Processing data: Ruby or SQL?

Talks

Jônatas Paganini

Processing data: Ruby or SQL?

by Jônatas Paganini

In his talk "Processing Data: Ruby or SQL?" at RubyConfTH 2022, Jônatas Paganini explores how data processing can be optimized by examining the relationship between Ruby and SQL, debunking the myth that either is inherently slow. He emphasizes the importance of understanding data processing as a balance between I/O operations, CPU, and memory usage. The following key points are discussed:

I/O vs. CPU: The efficiency of data processing relies heavily on input/output operations compared to the actual processing done on the CPU. Paganini uses the analogy of making watermelon juice to illustrate the importance of focusing on the entire process rather than isolated components.
Data Retrieval: Data retrieval from databases can be inefficient if too much unnecessary information is collected. The analogy compares shopping for ingredients versus efficiently processing data.
TimescaleDB Integration: Paganini details his experience using TimescaleDB, a PostgreSQL extension for handling time series data. He highlights the benefits of hyper tables that dynamically partition data to boost performance during massive data insertions.
Performance Comparisons: Through practical benchmarks, he shows that while ActiveRecord in Ruby can introduce overhead, Ruby itself can be competitive with SQL when data processing is optimized.
I/O Optimization: His findings reveal that a significant portion of processing time is spent on I/O operations, with serialization and deserialization processes being particularly time-consuming. The performance is often bottlenecked not by Ruby but by data management practices.
Real-World Examples: Paganini presents a scenario with 400,000 records to show how SQL handle large datasets faster than Ruby, suggesting that bottlenecks often occur at the data transfer phase rather than the programming language.

The talk concludes with a reminder that optimizing data flow and processing location can lead to improvements in application performance. Developers are encouraged to focus on I/O efficiency, as both SQL and Ruby can perform similarly under the right circumstances. Paganini also reflects on his personal journey as a developer and emphasizes gratitude for the conference setting on his birthday, fostering community discussions on these critical topics.

00:00:00.120 Hello, world!

00:00:16.080 Hello everybody! I am very happy to be here. My name is Jônatas, and I am traveling from Brazil. Today is especially meaningful for me as it is my birthday, and being at this conference is a wonderful gift. I feel incredibly grateful to participate in such a high-level conference. I love all the talks; it’s amazing to see how people are raising the bar. Now, I need to try to keep up my performance.

00:00:36.540 The title of my talk might sound a bit unusual: 'Ruby or SQL?' Some of you might think I am here just to complain about Ruby. However, I want to clarify that Ruby is not slow. The main message here is that neither Ruby nor SQL is slow. The focus of this talk is to discuss how we process data and the trade-offs involved.

00:01:07.020 When we talk about data, we are usually discussing I/O operations. Processing data involves CPU and memory. How do you balance these elements to make processing efficient? I will begin with an analogy. Imagine the difference between a Puma and a Tiger. Yesterday, we discussed the Puma, and today, we have the Tiger—a beautiful design created by the designers at my company. Unfortunately, I couldn’t hire a designer for my presentation due to my travel budget, so I ended up with a rather peculiar tiger generated by AI.

00:01:41.939 Let’s say a friend calls you and says, 'Hey, I'm coming over!' Immediately, you think, 'Great! I'll make some watermelon juice!' But then you remember you don’t have any watermelon. So, you need to find a trolley bag to carry it home. You prepare, start your journey to the supermarket, and try to find a good, ripe watermelon that's sweet enough. As you navigate the supermarket, you might get distracted by all the chocolates and snacks you don’t need. After finally finding the watermelon, you check out, pay, and head home. Now it’s time to make that juice. You slice the watermelon, blend it, filter it, and deal with the cleanup while your friends are waiting.

00:02:38.580 Once you are done with the juice, you finally get to enjoy the time with your friends. Later, being the nerdy person you are, you reflect on how time-consuming it was to make that juice. You want to know how to optimize the juicing process for next time. You research the best juicers online, compare options, and start analyzing what is truly efficient. In the end, it hits you that you were optimizing the wrong aspect of the experience. You realize that the time spent shopping for the watermelon may have consumed more time than the juicing process itself. It’s time to refocus; if you optimize the process of making juice directly rather than optimizing the shopping experience, it could save you much more time.

00:04:05.220 Data processing is similar to juicing. When it comes to data processing, there are many considerations. Shopping is analogous to your database round-trip; you need to go and retrieve the data and bring it back efficiently. However, you may end up carrying much more data than needed during the retrieval process. The kitchen is the location where you process this data, like using a mixer. Regardless of how much effort you invest in the juicer or kitchen setup, the key is to start with good data, just like the watermelon. Whether you use Ruby or SQL, the mindset is crucial; traveling for data processing can be costly and cumbersome.

00:05:07.920 As we engage in data processing, we often are not merely extracting raw data but rather refining and reducing it in size. I have some quotes here, which I find intellectually interesting. I had a great time generating an accompanying image for this talk. Now, let’s pivot to the core topic: my adventures in data processing. I’ll share some benchmarks and insights I’ve gained in optimizing these processes. As a developer advocate for TimescaleDB, I have spent over 14 years working with Ruby, so I approached this endeavor with Ruby in mind.

00:06:16.620 During my time at TimescaleDB, I decided to explore how to connect Ruby with the Timescale extension, taking advantage of its functionalities. Initially, I aimed to create helper functions for Ruby to illustrate how TimescaleDB could be used effectively. I eventually packaged these helpers into a gem to streamline their use. It’s essential, however, to clarify that when I reference SQL, I am generally referring to concepts that can apply to many databases, not just PostgreSQL. TimescaleDB is a free extension for time series data based on PostgreSQL.

00:07:40.920 The TimescaleDB extension allows you to create hyper tables, a core concept of this technology. When you create a hyper table, it links your data in a way that abstracts traditional table structures. Instead of having one large table, your data is partitioned dynamically based on chosen intervals, such as daily or monthly. This means if you have massive data insertion, it will create sub-tables efficiently based on your requirements. Additionally, having an indexed structure can improve performance compared to traditionally searching through IDs.

00:09:03.840 For example, if you have a data table and run a query ordered by timestamps, you can pull temperature or moisture values. In SQL, I’m experimenting with functional pipelines enabled by the toolkit extension. This allows functional programming methodologies to be applied directly within the database. You can use arrows to establish a sequence of functions, creating a more readable process when extracting data. This contrasts with traditional SQL methods, where you may rely on sub-queries to collect deltas and processes. My point here is to illustrate how efficient these new methodologies can be.

00:10:37.920 Let’s now look at benchmarks focusing on volatility, an algorithm used widely in fields such as finance and the sciences. I’ll showcase examples of how I migrated my database to create hyper tables efficiently. Once the data is set up, I plan to illustrate how we can utilize integrated functions to fetch and analyze data. I generated data for five devices randomly over five-minute intervals for a month, resulting in 50,000 records. Even though this isn't an overwhelming amount of data, it's sufficient for testing and performance.

00:12:13.200 I utilized Rails scopes to run my previous code while also incorporating hash maps to group the volatility data by device ID. This is a vital feature of SQL that helps with efficiently aggregating larger datasets. In running the benchmarks, we arrive at some useful figures. The Ruby ActiveRecord implementation runs significantly slower than its SQL counterpart, primarily due to the overhead of ActiveRecord. While it’s critical to understand the performance comparisons here, my objective is to emphasize that Ruby is not inherently slow.

00:14:21.900 To illustrate my point, I created a cleaner, more optimized class and methods that would handle the volatility processing. In PostgreSQL, we have unique functions like array aggregations that aid in managing and processing grouped data. This approach allowed me to visualize performance in terms of both fetching and processing time, providing a tangible comparison of how the two approaches handle data. The benchmarks show that when managed correctly, Ruby performance can be quite competitive with other languages.

00:15:37.860 Furthermore, when you analyze the data fetching separately, you note that much of the time is spent in I/O operations rather than processing. I ran an EXPLAIN ANALYZE on my queries to reveal that the database takes significant time serializing the data. In fact, half of the time involved in our local testing was related to serialization and deserialization, evidencing that the I/O bound processes are crucial to performance. By managing how data is structured and how queries are executed, you can often optimize performance.

00:17:57.180 As we summarize how data traverses through a web application, keep in mind that the entire life cycle—from querying the database to rendering data on the client—transfers data through pipes. How fast or slow this occurs can significantly impact user experience. If you process the data on the front-end, you’ll require more bandwidth as it involves larger data pulls, often overloading servers. However, processing in the backend or database reduces the data sent and can streamline performance significantly.

00:20:12.180 To illustrate the importance of bandwidth and processing location, I shared an example of downsampling algorithms based on SQL and Ruby applications processing large datasets. For instance, with about 400,000 records per table, I demonstrated how processing times differ vastly between SQL and Ruby implementations. The SQL version loaded data in less than a second, while Ruby processing showed delays, especially when working with larger datasets. The difference diminishes as the data size decreases, but for substantial volumes, this is a telling metric.

00:22:24.780 Ultimately, I want to drive home the idea that the bottleneck in performance is often not the programming language itself but rather how we handle data transfer and processing. Ruby is fast; the issues lie in the I/O and how we manage data flow. Developers should consider where optimizations are most impactful, balancing I/O and CPU needs effectively. By addressing this, we can achieve better overall performance within our applications.

RubyConf TH 2022