Data Science in Ruby? Is it possible? Is it Fast? Should we use it?

Ruby

Rodrigo Urubatan

Data Science in Ruby? Is it possible? Is it Fast? Should we use it?

by Rodrigo Urubatan

In this presentation, Rodrigo Urubatan explores the possibilities and challenges of utilizing Ruby for data science. Despite Python being the dominant language in this field, the talk emphasizes that Ruby can also be a viable option, depending on specific needs. The discussion is structured around several key points:

Definition of Data Science: The term is subjective, varying by company and context; commonly viewed as extracting insights or using statistics and machine learning for data manipulation.
Ruby's Capabilities: Ruby has libraries for data science, including those for integration with Python and R, data visualization, and statistics. Notable libraries include Kiba for data manipulation, PyCall for using Python in Ruby code, and Daaru for creating DataFrames.
Tools and Libraries: The talk highlights the diversity of Ruby libraries such as rb-gsl for statistics and Matplot for data visualization. Despite this, performance issues are noted in some libraries, like NMatrix and NArray.
Current State of Ruby in Data Science: Urubatan describes various projects such as SciRuby and Ruby Numo that aim to facilitate data science tasks in Ruby. A new project, Red Data Tools, is mentioned for its potential to enhance interoperability within Ruby data science libraries.
Performance Comparisons: Comparisons with Python libraries reveal a significant performance gap; for instance, summing numbers with Ruby's NMatrix is notably slower than Python’s NumPy. This could present challenges in production environments.
Recommendation: While Ruby can handle data science tasks, it's best suited for web and business applications. Integrating Ruby with Python for more intensive data tasks could enhance performance and efficiency.

In conclusion, while it is possible to use Ruby for data science, developers must consider performance issues and the integration of existing Python tools to make the most of their data science efforts. Urubatan aims to provide resources to help Ruby developers choose the best tools for their data projects, allowing them to work more efficiently while fixing bugs effectively. Resources for further learning were also shared at the end of the presentation, including past talks and library documentation.

00:00:06.610 Good morning, everyone.

00:00:08.920 My name is Rodrigo Urubatan, and I will try to be fast. If anyone wants to download these slides to follow up while I'm talking, just scan this QR code.

00:00:30.320 Today, we'll discuss data science in Ruby: if we can do it, if it's fast, if we should use it, and when to use it.

00:00:43.850 I want to know how many people here work with data science—data scientists, data engineers, or anyone who writes applications that utilize data. Let's raise our hands. Let's start by defining what data science is. This can be a tricky question, as every company has a different definition.

00:01:12.030 Some define data science as the process of extracting meaning and interpreting data. Others see it as using statistics and machine learning to clean and manipulate data. Additionally, some argue that any use of computer software to collect, clean, and manipulate data qualifies as data science. It has even become a buzzword that combines data mining and business intelligence.

00:01:31.640 Those terms have been around for at least 20 years, maybe longer, but they are often used to justify expensive tools and cheaper workforce. Personally, I prefer when tools are cheaper and professionals are more valued.

00:02:26.440 So, can Ruby handle data science? The quick answer is yes. We have libraries for almost everything needed for data science. However, as with any software-related question, the best answer is: it depends.

00:02:46.730 In Ruby, we have libraries to integrate with other tools like R and Python. We have data manipulation libraries and gems for distributed computing, as well as gems for structured data like Daaru. We will see examples from these groups later. We also have ready-made datasets such as the Iris dataset, which has been widely used in machine learning.

00:03:24.620 Moreover, we have libraries for statistics, data visualization, and interactive computing. Some tools are excellent, while others may not perform as well. For example, some gems work well together, whereas others do not. There are gems that perform the same function, but one can be significantly faster than the other. It is crucial to define what we want to accomplish and how to integrate various tools, which can often pose a tricky challenge.

00:04:50.660 Interactive computing is exemplified by using Jupyter Notebooks, where you can write code samples, execute them, and see results immediately. This feature is helpful when evaluating data cleanup tasks or for teaching purposes.

00:05:48.940 We have libraries that allow Ruby to work with Python. I discovered a library called PyCall, created by Kenton Murata, that lets you write Ruby code using Python modules as if they were Ruby libraries. It's impressive and solves many problems efficiently.

00:06:28.370 There are libraries for data manipulation like Kiba, which was a significant help in a project where I collected data from five different sources, including databases and files. This library allowed for cleaning and declarative data integration, helping me identify bugs more easily. Similarly, Jungler is useful for integrating multiple ETL jobs.

00:07:33.350 However, libraries for distributed computing are where things start to get complicated. Projects like Apache Spark are excellent, but the integration libraries for Ruby and JRuby have not seen updates in the last three years. If you're already using them, that's great, but if you're starting something new, they're not the ideal choice due to open bugs and lack of maintenance.

00:08:15.029 We also have libraries for a variety of data structures. For instance, Daaru is an implementation for DataFrames, which I believe is the most crucial data format for data science since it allows for easy data manipulation.

00:09:26.639 There are multiple libraries like Numo, Matrix, and NArray that are similar. NArray performs excellently, while Matrix has performance issues that have been known for two years without a resolution.

00:10:24.120 We have several ready-made datasets available for use in applications. Projects like the Ruby DataSets project offer a growing collection of datasets gathered from diverse sources.

00:10:57.209 In statistics, we have various libraries, including rb-gsl, which serves as an interface for the GSL scientific library, and innumerable statistics by Kenton Murata, which efficiently conducts simple calculations on arrays or Active Record results.

00:11:47.040 We also have numerous libraries for data visualization, such as Matplot, which is great for rendering visual outputs. I often use Daru, which is integrated with several libraries and works nicely with Jupyter Notebooks.

00:12:39.450 Now, let's discuss the current state of data science in Ruby. We face both advantages and challenges. For starters, all data science in Python generally revolves around the SciPy project. Conversely, in Ruby, we have three different projects offering diverse approaches.

00:13:15.230 Most data science tools can be found under the SciRuby project, which has numerous matrix-centric gems and the Daru project for DataFrames. NewPlot.rb provides additional capabilities and many other libraries.

00:13:41.690 We also have the Ruby Numo project, which focuses on array-centric gems. However, a significant drawback is that there are relatively few developers actively contributing to data science in Ruby.

00:15:00.650 A new initiative, the Red Data Tools project, was established for interoperability between these projects, designed with Apache Arrow as the backend. This is promising since the next version of Pandas will also adopt this backend, facilitating smoother data reading and manipulation across libraries.

00:15:59.650 Despite these tools, challenges persist. NMatrix has performance issues unresolved for two years, while NArray performs quicker but lacks compatibility with Daru.

00:16:49.190 While both the SciRuby and Ruby Numo projects are functional, the Red Data Tools project is still evolving, and currently lacks data manipulation libraries, focusing mainly on input-output functionalities.

00:17:32.580 In summary, working with data science in Ruby presents its difficulties. Some tools function excellently while others do not, and resources are scarce.

00:18:07.120 Python undeniably remains the crown jewel of data science, with Ruby offering mostly equivalent functionality. For instance, both Daaru and NMatrix in Ruby parallel their counterparts, Pandas and NumPy in Python, albeit with less documentation.

00:18:50.480 To illustrate, I tested performance by summing 50,000 random numbers using NMatrix, which took 0.3 seconds compared to 0.003 seconds with NumPy. Similarly, creating a simple DataFrame with 50,000 entries using Pandas took 708 microseconds, while the Daaru equivalent was 57 times slower, taking about 40 milliseconds.

00:20:18.720 Though it is possible to execute these operations in Ruby, the substantial performance differences—57 times in particular—can be problematic in production scenarios.

00:20:54.890 I find Ruby on Rails to be better suited for building web and business applications, although there is potential to perform machine learning tasks within Ruby. That discussion, however, would warrant a separate presentation.

00:21:47.040 My goal is to assist Ruby developers in using the best tools for each job to address challenging problems with fewer bugs, thereby affording them more free time.

00:22:03.500 I utilize PyCall to wrap around Python's libpython, allowing integration of Python modules within my Ruby web applications, streamlining the mix of Ruby and Python code.

00:22:57.250 As a wrap-up, I have pointers to several valuable resources: Kenton Murata's 2017 presentation on data analysis in Ruby, other talks by prominent authors in the field, and curated lists containing numerous libraries and documentation for machine learning and data science in Ruby.

00:27:01.170 Thank you for your attention. If anyone has questions, I'm happy to answer them. If you'd like to connect, please scan the QR code for my contact information.

Rodrigo Urubatan

@urubatan

RubyConf TH 2019