Data Science in Ruby? Is it possible? Is it Fast? Should we use it?

00:00:06.610 Good morning, everyone.

00:00:08.920 My name is Rodrigo Urubatan, and I will try to be fast. If anyone wants to download these slides to follow up while I'm talking, just scan this QR code.

00:00:30.320 Today, we'll discuss data science in Ruby: if we can do it, if it's fast, if we should use it, and when to use it.

00:00:43.850 I want to know how many people here work with data science—data scientists, data engineers, or anyone who writes applications that utilize data. Let's raise our hands. Let's start by defining what data science is. This can be a tricky question, as every company has a different definition.

00:01:12.030 Some define data science as the process of extracting meaning and interpreting data. Others see it as using statistics and machine learning to clean and manipulate data. Additionally, some argue that any use of computer software to collect, clean, and manipulate data qualifies as data science. It has even become a buzzword that combines data mining and business intelligence.

00:01:31.640 Those terms have been around for at least 20 years, maybe longer, but they are often used to justify expensive tools and cheaper workforce. Personally, I prefer when tools are cheaper and professionals are more valued.

00:02:26.440 So, can Ruby handle data science? The quick answer is yes. We have libraries for almost everything needed for data science. However, as with any software-related question, the best answer is: it depends.

00:02:46.730 In Ruby, we have libraries to integrate with other tools like R and Python. We have data manipulation libraries and gems for distributed computing, as well as gems for structured data like Daaru. We will see examples from these groups later. We also have ready-made datasets such as the Iris dataset, which has been widely used in machine learning.

00:03:24.620 Moreover, we have libraries for statistics, data visualization, and interactive computing. Some tools are excellent, while others may not perform as well. For example, some gems work well together, whereas others do not. There are gems that perform the same function, but one can be significantly faster than the other. It is crucial to define what we want to accomplish and how to integrate various tools, which can often pose a tricky challenge.

00:04:50.660 Interactive computing is exemplified by using Jupyter Notebooks, where you can write code samples, execute them, and see results immediately. This feature is helpful when evaluating data cleanup tasks or for teaching purposes.

00:05:48.940 We have libraries that allow Ruby to work with Python. I discovered a library called PyCall, created by Kenton Murata, that lets you write Ruby code using Python modules as if they were Ruby libraries. It's impressive and solves many problems efficiently.

00:06:28.370 There are libraries for data manipulation like Kiba, which was a significant help in a project where I collected data from five different sources, including databases and files. This library allowed for cleaning and declarative data integration, helping me identify bugs more easily. Similarly, Jungler is useful for integrating multiple ETL jobs.

00:07:33.350 However, libraries for distributed computing are where things start to get complicated. Projects like Apache Spark are excellent, but the integration libraries for Ruby and JRuby have not seen updates in the last three years. If you're already using them, that's great, but if you're starting something new, they're not the ideal choice due to open bugs and lack of maintenance.

00:08:15.029 We also have libraries for a variety of data structures. For instance, Daaru is an implementation for DataFrames, which I believe is the most crucial data format for data science since it allows for easy data manipulation.

00:09:26.639 There are multiple libraries like Numo, Matrix, and NArray that are similar. NArray performs excellently, while Matrix has performance issues that have been known for two years without a resolution.

00:10:24.120 We have several ready-made datasets available for use in applications. Projects like the Ruby DataSets project offer a growing collection of datasets gathered from diverse sources.

00:10:57.209 In statistics, we have various libraries, including rb-gsl, which serves as an interface for the GSL scientific library, and innumerable statistics by Kenton Murata, which efficiently conducts simple calculations on arrays or Active Record results.

00:11:47.040 We also have numerous libraries for data visualization, such as Matplot, which is great for rendering visual outputs. I often use Daru, which is integrated with several libraries and works nicely with Jupyter Notebooks.

00:12:39.450 Now, let's discuss the current state of data science in Ruby. We face both advantages and challenges. For starters, all data science in Python generally revolves around the SciPy project. Conversely, in Ruby, we have three different projects offering diverse approaches.

00:13:15.230 Most data science tools can be found under the SciRuby project, which has numerous matrix-centric gems and the Daru project for DataFrames. NewPlot.rb provides additional capabilities and many other libraries.

00:13:41.690 We also have the Ruby Numo project, which focuses on array-centric gems. However, a significant drawback is that there are relatively few developers actively contributing to data science in Ruby.

00:15:00.650 A new initiative, the Red Data Tools project, was established for interoperability between these projects, designed with Apache Arrow as the backend. This is promising since the next version of Pandas will also adopt this backend, facilitating smoother data reading and manipulation across libraries.

00:15:59.650 Despite these tools, challenges persist. NMatrix has performance issues unresolved for two years, while NArray performs quicker but lacks compatibility with Daru.

00:16:49.190 While both the SciRuby and Ruby Numo projects are functional, the Red Data Tools project is still evolving, and currently lacks data manipulation libraries, focusing mainly on input-output functionalities.

00:17:32.580 In summary, working with data science in Ruby presents its difficulties. Some tools function excellently while others do not, and resources are scarce.

00:18:07.120 Python undeniably remains the crown jewel of data science, with Ruby offering mostly equivalent functionality. For instance, both Daaru and NMatrix in Ruby parallel their counterparts, Pandas and NumPy in Python, albeit with less documentation.

00:18:50.480 To illustrate, I tested performance by summing 50,000 random numbers using NMatrix, which took 0.3 seconds compared to 0.003 seconds with NumPy. Similarly, creating a simple DataFrame with 50,000 entries using Pandas took 708 microseconds, while the Daaru equivalent was 57 times slower, taking about 40 milliseconds.

00:20:18.720 Though it is possible to execute these operations in Ruby, the substantial performance differences—57 times in particular—can be problematic in production scenarios.

00:20:54.890 I find Ruby on Rails to be better suited for building web and business applications, although there is potential to perform machine learning tasks within Ruby. That discussion, however, would warrant a separate presentation.

00:21:47.040 My goal is to assist Ruby developers in using the best tools for each job to address challenging problems with fewer bugs, thereby affording them more free time.

00:22:03.500 I utilize PyCall to wrap around Python's libpython, allowing integration of Python modules within my Ruby web applications, streamlining the mix of Ruby and Python code.

00:22:57.250 As a wrap-up, I have pointers to several valuable resources: Kenton Murata's 2017 presentation on data analysis in Ruby, other talks by prominent authors in the field, and curated lists containing numerous libraries and documentation for machine learning and data science in Ruby.

00:27:01.170 Thank you for your attention. If anyone has questions, I'm happy to answer them. If you'd like to connect, please scan the QR code for my contact information.