00:00:15.280
Welcome to my presentation on ruby supercomputing, where we'll explore using the GPU for massive performance speedup. You can access the code right now via slideshare.net/preston/ladydrought. If you want to follow along, feel free to clone the code from that link.
00:00:24.320
However, running all the examples requires a fair number of dependencies. It's not just Ruby either; for those using NVIDIA chipsets, you'll need the NVIDIA Development Driver Toolkit, Ruby 1.9, and jQuery 1.6. I wouldn’t recommend trying to set that up right now.
00:00:38.000
Before diving deeper, I want to challenge everyone’s perspective on concurrency. When dealing with multi-threading, many developers feel as though they're swimming against the current. There are pros and cons to different architectures, especially when considering shared nothing models. While I'm not dismissing them, I personally believe we are limiting our computers' capabilities by avoiding threads altogether.
00:01:02.480
While complexities such as memory sharing exist, we have powerful tools at our disposal, especially provided by the operating system. However, to effectively utilize these tools, it's essential to understand concurrency's fundamentals. To illustrate this, let’s consider a simple example: calculating the area of tree rings.
00:01:48.720
Imagine you're developing a tree ring simulator. If we wish to find the area of every tree ring simultaneously, we first need to calculate the area of the entire tree using the formula for the area of a circle: A = πr^2. However, for each ring, we can find its area by taking the area of the full circle and subtracting the area of the previous ring. This leads us to recursively calculate the area for each different radius.
00:02:59.440
Our initial approach will be straightforward. We’ll write a simple function that calculates the area given a radius and returns the value. We can run this in a for loop to find the area for our desired number of rings. This method requires no special libraries and can be executed directly in Ruby.
00:03:47.480
However, this is just a single-threaded approach. To enhance performance, we can utilize multiple threads. In our second example, we introduce multi-threading where we take advantage of available CPU cores. By dividing the workload among several threads, we can significantly increase performance.
00:04:37.280
Alternatively, we could create lower-level implementations using languages like C. This approach allows you to access multi-threading more effectively by leveraging features like pthreads. The differences between our examples become apparent, with the multi-threaded solutions outperforming the single-threaded ones.
00:05:57.760
Ultimately, most Ruby gems fall into one of four categories: pure Ruby with minimal dependencies; pure Ruby multi-threaded implementations, although true thread safety is rare; native Ruby gems with components that require performance optimization; and lastly, those that optimize by scaling horizontally through distributed systems.
00:06:46.720
Let’s run through a few examples. The first two will be basic implementations, one considering single-threaded performance and the other with multi-threading. First, raise your hand if you think the single-threaded implementation will be faster. Now, raise your hand if you believe the multi-threaded version will outperform the single-threaded one.
00:07:37.440
In my testing, the multi-threaded implementation in Ruby 1.9 actually performed slightly slower than the single-threaded implementation. In contrast, when switching to JRuby, the parallel calculation with multiple CPU threads completed much faster, showcasing the significance of threading capabilities. This discrepancy highlights Ruby’s global interpreter lock limiting its simultaneous multiprocessor capabilities.
00:09:45.360
Moving forward, let's investigate the CPU utilization during these calculations. When using Ruby 1.9 multi-threading, I noticed that it rarely exceeds 100% CPU usage because it only utilizes one core due to the mentioned limitation, whereas JRuby effectively distributes the workload across multiple cores.
00:10:51.120
There are various reasons for avoiding multi-threading, including performance intricacies with locking mechanisms and testing complexities due to non-deterministic behaviors. More often than not, we seek to run threads effectively without dealing with potential complications.
00:11:39.200
We are limited in executing multiple instructions at once without incorporating GPUs. A GPU architecture enables thousands of threads to run concurrently, dramatically increasing performance in suitable contexts. We can often process algorithms leveraging GPU architecture for quick, parallel computation.
00:12:21.920
This capability can be especially useful when calculating vast amounts of information simultaneously, such as in graphics processing or computational mathematics. The potential of GPUs in performing large-scale processing tasks significantly outpaces typical CPU capabilities.
00:13:57.760
Let’s look at the hardware. The NVIDIA Tesla models remove video output, focusing purely on computation as opposed to graphical performance. These specialized GPUs deliver high performance, and recent advancements in GPU technology mean they can often integrate seamlessly with existing software infrastructures, making them widely applicable.
00:16:13.679
Transitioning to the practical aspects of GPU programming, we typically work with OpenCL, providing a way to harness the power of GPUs on various platforms, be it NVIDIA or ATI. The OpenCL language offers an abstraction for running kernels on compute units, streamlining the development process while maintaining flexibility.
00:18:57.679
However, dealing with these libraries requires an understanding of certain terms outlined in the OpenCL environment. A kernel refers to the code execution context, while the device is essentially the GPU or CPU that processes the computation. Moving ahead, it’s essential to familiarize ourselves with these concepts to fully utilize GPU capabilities effectively.
00:21:55.679
Now, let’s dive into some specific code snippets utilizing OpenCL. We’ll implement our tree rings example with a focus on leveraging GPU functionality. By using libraries like Barracuda in Ruby, we gain the ability to execute our function on the GPU directly, delegating memory management and data transfer between the CPU and GPU.
00:24:32.679
This setup simplifies the approach to GPU programming significantly. The notifications shown earlier reveal the clear performance gaps between Ruby on the CPU alone versus utilizing the GPU. Although Ruby’s ease of use is admirable, we still need to consider situations where performance is essential.
00:28:16.239
As demonstrated, while Ruby implementations provide straightforward approaches to coding, they can fail to exploit the efficiency of the underlying hardware, necessitating native extensions or alternate languages for optimal performance. The distinctions observed here illustrate the importance of code implementation in maximizing throughput.
00:30:49.289
Moving forward with questions: Why was the Ruby GPU version slower than the C implementation? The answer relates to data transfer overhead between CPU and GPU, where large memory copies can contribute to performance bottlenecks.
00:34:06.639
In conclusion, embracing GPU architectures alongside Ruby programming can lead to tremendous performance enhancement potential, especially for tasks conducive to parallel processing. With further exploration and experimentation, frameworks may evolve that allow Ruby developers to harness this power more naturally.