Forecasting the Future: An Intro to Machine Learning with Ruby

00:00:09.960 Hi! My name is Landon, as you can see. I'm a Senior Software Engineer at Testable. You can find me on Twitter, LinkedIn, and all the other social media platforms. So, the title of my talk is "Forecasting the Future." I'm going to discuss using Ruby for machine learning—native Ruby, not using Python libraries. This whole thing started with a question I had several years ago: Why aren't people doing machine learning in native Ruby? This question plagued me for quite some time.

00:00:35.360 If you're familiar with this field, you know a lot of people use Python, C++, or R; those are some common tools. However, I didn't see a reason why we couldn't use Ruby. The goal of this talk is to show you a machine learning project built in native Ruby so that all of you can go out and build your own machine learning projects.

00:01:09.720 Before we get into that, there are two things I want to discuss: tools and libraries. For tools, I ended up using something called Jupyter Notebook. Basically, it's your IDE for data science projects. You can write code in it, and many times in the Python world, people use Jupyter notebooks with Python. In this case, I have an IRuby kernel, which allows you to write Ruby code in the Jupyter notebooks. Here’s an example where I executed 1 + 1 in Ruby, resulting in 2.

00:01:40.119 I also wrote a function called 'hello' that prints 'Hello.' When executed, 'Hello' prints to the screen. All the work we will do today—building the model and executing code—will happen in this environment, basically our Visual Studio equivalent. There are a lot of cool things you can do with it.

00:02:08.680 In addition to tools, there are libraries that you can utilize in Ruby. The only part where I'm using Python code is when I call a library called Seaborn, a data visualization library. I'm printing some data points onto a graph. You get a lot of cool visualizations if you're working on a data science project, and you can share this Jupyter notebook with others at your company or here; they can run all the code you've written.

00:02:49.680 Next, let’s talk about libraries. I used three libraries for this project: Numo::NArray, Daru, and Rali. Numo::NArray provides an n-dimensional array structure. When building your model, you end up putting this data structure into your model. Daru acts like a data frame—it’s two-dimensional, similar to an Excel spreadsheet with rows and columns, allowing for more efficient data processing. Finally, Rali is a machine learning framework for Ruby. It contains various machine learning models.

00:03:41.159 Now, we need to start off with a problem. The problem we want to solve is predicting the weather, specifically the maximum temperature. This is the goal I set out to achieve. Now that we have our problem defined, we need to collect some data.

00:04:06.080 For this project, I accessed the National Centers for Environmental Information, where I found some valuable datasets. I downloaded the historical weather data for Atlanta Airport, which provides approximately 63 years of data from 1960 until the present. This data will be used to build our machine learning model to hopefully predict the weather for a given day.

00:04:48.760 Now that we have our data, we need to prepare it. When building a machine learning model, you can't simply download data from a website, throw it into the model, and expect it to output answers. Most of the data you acquire from external sources will have some issues that we must address.

00:05:18.639 Here is the CSV for the dataset I'm using. I'm importing this CSV into a Daru data frame and creating a duplicate of the data. This allows me to modify one set while keeping the original intact. I'll be dropping columns or rows in order to refine the data into a usable format for my machine learning model.

00:06:03.280 As for the dataset, when I downloaded it, I received a PDF document that contained a list of all the parameters in the dataset. Some of the column names, such as 'prcp' for precipitation, 'snwd' for snow depth, and 'TMax' for maximum temperature, are not easily distinguishable, thus having definitions provided can help clarify their meanings. These columns will be crucial in predicting the maximum temperature.

00:07:00.960 There are more columns in the dataset, but these five core values are designated as essential for predicting the maximum temperature. I thought it would simplify the project to focus on these instead of using all 48 available columns. In my new data frame, I've selected only the precipitation, snowfall, and these core values, along with the respective dates.

00:08:08.440 Once I have the core data, the next step is cleaning it up. We will need to deal with any missing values in the dataset. After all, how can you build a model with a lot of nil values? Outliers are also a concern; for instance, if the average temperature is 60°F, but one record shows 120°F, we can't use that to build our model.

00:08:51.760 We also have to handle malformed data. If, by chance, a weather sensor recorded an erroneous temperature—for example, throwing in a nil or invalid character—those values must be resolved as well. Finally, we might encounter duplicate data entries for the same day, so we need to ensure that we only count each day once when building our model.

00:09:10.960 I started by writing a method to check each row for nil values. If I found any, I decided to drop those rows. While dropping rows with nil values might not be the best approach for a production model, it's straightforward for this example. There are other techniques to deal with nils, such as taking the average of other values and imputing that average into nils. However, that depends on the nature of your data.

00:10:29.239 Now, let’s discuss an advanced topic called feature engineering. For this project, I want to add a new column to our dataset. For example, if we're trying to predict the maximum temperature for a certain day, we shouldn't use that day's maximum temperature as input. Instead, we should use the previous day's maximum temperature to make a prediction.

00:11:19.000 To accomplish this, I wrote a method that shifts maximum temperature values down the dataset, using the previous day's maximum temperature as a feature. This is one of the steps required to refine and prepare our data properly.

00:12:04.080 This process means that while machine learning seems glamorous, a large portion of your time will actually be spent on preparing and cleaning data. To highlight this, there's a rule known as the 80/20 rule: you will spend 80% of your time wrangling data and only 20% on model development.

00:12:41.000 Now let’s move on to that 20%, which is training the model. To train the model, we need to split our full dataset into two parts: a training set, which will be used to build the model, and a testing set to validate our predictions. Usually, I divide the data so that 80% goes into the training set and 20% into the testing set.

00:13:31.080 The model I used to build this project is linear regression, which attempts to model the relationship between variables by fitting a linear equation to the observed data. We can summarize this relationship with the familiar formula for a line: y = mx + b.

00:14:06.560 Using the training data, our input features would be the minimum temperature, snowfall, and other cores values, while the outcome we're predicting will be the maximum temperature for that day. The model will establish a best-fit line, minimizing the distance between the predicted and actual values.

00:14:56.040 Once we have inputted the training values into the linear regression model, it will analyze all the data points to determine how to draw that best-fit line through them. This is the magic of building a model: after analyzing the relationships, it will try to generate accurate predictions based on the input values.

00:15:42.560 For testing, after building the model, we will feed back some of the separated data into the model to verify its predictions. This means we will input data where the actual outcome is known and compare the model’s predicted results against these known values to evaluate its accuracy.

00:16:33.600 So far, we've discussed how to set up a problem, collect the data, prepare it, train the model, and make predictions. Thank you for your attention. I have a special gift for you: if you visit this URL, you can access the entire repository I used to build this machine learning model. It outlines each step involved in creating the models.

00:17:16.720 Within the repository, you can folow along, tweak things, or even use alternative machine learning algorithms. While my project uses linear regression, you can experiment with other models like neural networks or K-nearest neighbor algorithms to determine which performs better. It's important to note, however, that linear regression might not be the best choice for predicting weather.

00:18:14.320 And here's a nice newsletter from Test Double that you can subscribe to if you'd like to receive more information on topics related to programming and software development. I've made several updates to this talk since I presented it previously.

00:18:48.480 The repository now includes a Docker container that allows you to reproduce the project easily. Keep in mind that it is fairly large, as it is not built from scratch and comes in at several gigabytes. Nevertheless, I believe that this project is helpful and informative.

00:19:31.680 I welcome any questions now. What would you like to know?

00:19:55.360 Audience member: How many rows were in the data?

00:20:01.280 Landon: Oh, I skipped that detail. There are about 23,000 rows in the dataset.

00:20:25.360 Audience member: How did you determine that 80% of the data should be used for training and 20% for testing?

00:20:42.800 Landon: This split is generally a rule of thumb in data science. Typically, you want to use more data for training than for testing to allow the model enough exposure to diverse inputs to learn effectively.

00:21:15.360 Audience member: Why wasn't Ruby popular for machine learning?

00:21:30.640 Landon: There are two main reasons: First, many crucial libraries for machine learning were developed in Python, giving it a head start and stronger community support. Secondly, there was less focus on Ruby within the data science community which has traditionally leaned towards Python, making it less appealing for data science applications.

00:22:31.480 Audience member: Can you explain how your model makes predictions for specific dates?

00:22:50.480 Landon: The model uses the training data that includes maximum temperatures from prior days to predict future temperatures. It takes the most recent maximum temperatures and utilizes them alongside other features from previous days' records.

00:23:45.720 Landon: I appreciate your attention today. Feel free to reach out if you have any more questions!