RailsConf 2023

Forecasting the Future: An Introduction to Machine Learning for Weather Prediction

Forecasting the Future: An Introduction to Machine Learning for Weather Prediction

by Landon Gray

In this video titled "Forecasting the Future: An Introduction to Machine Learning for Weather Prediction," speaker Landon Gray presents a compelling case for utilizing Ruby in the machine learning space, particularly in weather prediction applications. This talk is tailored for both seasoned Ruby developers and those new to the language, aiming to demystify the process of creating machine learning models in Ruby instead of the more commonly used Python.

Key points discussed in the presentation include:

- Introduction to Machine Learning in Ruby: Landon emphasizes the accessibility of machine learning in Ruby, encouraging developers who favor Ruby over Python to explore AI and ML concepts in their native language.

- Project Overview: He outlines a project where the goal is to predict the maximum temperature using historical weather data from the Atlanta airport dating back to 1960.

- Data Collection and Preparation:

- The dataset is collected from the National Centers for Environmental Information, consisting of approximately 20,000 rows and 48 columns of data.

- Data cleaning is emphasized, where significant time is spent preparing the dataset, typically 80% of the overall project time.

- Landon discusses handling missing values, outliers, and duplicates to ensure the data is ready for analysis and model training.
- Machine Learning Libraries in Ruby: He introduces three key libraries:

- Numo: A numerical array class for fast data processing.

- Daru: Provides data frame structures for analysis and visualization akin to Python's Pandas.

- Rumale: A gem that facilitates various machine learning algorithms.

- Model Training and Prediction: Landon explains the implementation of a linear regression model, highlighting how it attempts to model the relationship between variables by fitting a linear equation.

- He provides clear, simple coding examples for training the model and making predictions based on the prepared data.

- Integration with Rails: The talk concludes with insights on how to integrate the trained ML model into a Rails application for real-time predictions. Landon encourages developers to adopt machine learning in Ruby and to iterate on his presented project using their data.

In conclusion, Landon Gray reiterates that creating machine learning models in Ruby is not only feasible but also accessible to all developers interested in exploring AI applications, reinforcing the idea that extensive academic experience is not a prerequisite for engaging with machine learning. To support these endeavors, he shares that the project will be available on GitHub for attendees to download, adapt, and utilize.

This talk serves as an introduction and encouragement for Rubyists to venture into machine learning, fostering a community of learners and practitioners.

00:00:01.199 foreign
00:00:19.160 Welcome! We have a little bit of technical difficulties, but things seem to be rolling right now. Thank you for coming to my talk. Oops, wrong way.
00:00:29.460 Hi, I'm Landon. I'm a Senior Monkey Patcher at Testable. That is a name I came up with for myself. I'm a senior software consultant at Testable.
00:00:36.899 If you'd like to reach out to me, I'm on LinkedIn, Mastodon, and the Bird app.
00:00:52.200 The reason I'm giving this talk is that several months ago, maybe a year ago, I started thinking about machine learning, AI, and Ruby. A lot of people are engaging in machine learning using Python, and I was curious as to why no one is doing it in Ruby. That's what I want to do, as it's my native language.
00:01:01.739 I don't want to have to write Python every time I want to do something in my main programming language. So, this talk will walk through an entire project that I undertook. I have a gift for you all at the end, but I will walk through the project and demonstrate how to execute machine learning projects so you can do it in Ruby without needing to learn a bunch of Python.
00:01:16.979 To start, here is the agenda for the talk: I'm going to set up a problem, collect some data, perform data preparation, train our own machine learning model, and then make some predictions.
00:01:56.579 Before we get to that, I want to discuss two things: tools and libraries. As developers, one of our main tools is our code editor, but when it comes to data science work, one of the main tools is Jupyter Notebooks. Jupyter Notebooks allow you to build out your data science project in a way that is shareable and lets you execute code in a top-down approach.
00:02:20.120 Traditionally, Jupyter Notebooks use Python, so you write your Python code in the notebook and then execute the code within it. But here, we will execute Ruby using a tool called iRuby.
00:02:46.980 For example, in Ruby, I defined a method that just prints 'Hello, World!.' You can execute it sequentially in a Jupyter Notebook. You can also incorporate some really cool visualization tools. Here is the only bit of Python code that will be in this presentation as I call a Python library for visualization purposes. However, there are also Ruby gems for visualization. I want to show you that you can have visualizations, download the file, and share it with your business stakeholders to showcase the entire project you've completed.
00:03:26.540 Next, I want to discuss libraries. For this machine learning project, I'm using three libraries called Numo, Daru, and Rumale. Numo is a numerical multidimensional array class for fast data processing and easy manipulation. Daru is a gem that provides a data structure called a DataFrame, which facilitates analysis, manipulation, and visualization of data.
00:03:40.560 I'm not sure how familiar you are with Python, but Numo and Daru have synonymous libraries in Python called Pandas and NumPy. Rumale, on the other hand, allows you to utilize various machine learning algorithms. First, we will set up the problem.
00:04:02.520 I want to predict the weather, as I think that's super cool. Specifically, I want to predict the maximum temperature for a weather dataset. To achieve this, we first need to collect our data. I sourced a dataset from the National Centers for Environmental Information, which offers a plethora of downloadable weather data. Specifically, I downloaded the weather dataset for the Atlanta airport, dating back to the 1960s, as I thought it would be interesting to analyze.
00:04:59.699 So let's predict the maximum temperature based on some given input. Next, we will prepare the data. After collecting our dataset, we will import it into our Jupyter Notebook. We'll observe that there are approximately 20,000 rows and about 48 columns. The next step involves duplicating this data. When working on a data science project, it is essential not to alter the original data during your modifications, as you may need to reference it later.
00:05:50.400 If you drop columns and only retain a few, you might lose the ability to reference the dropped columns later. Therefore, I'm creating a duplication to continue my project while ensuring the original data remains unchanged. Now, I will drop all columns except five, which represent core values defined in the dataset.
00:06:00.240 I'm simplifying my project for this example by using these five core values as the predictors to forecast the maximum temperature. I will create a new DataFrame that includes only the necessary columns. Each dataset is structured with columnar data, and we will preview the top five rows to understand how the data is formatted.
00:06:28.620 As part of the data cleanup process, you have to clean the data before applying it to a machine learning algorithm. This task is essential and often time-consuming. You may come across missing values; you'll need to determine whether to replace them with zeros, which may skew your dataset, or drop the rows entirely.
00:06:49.139 Alternatively, you could employ a method known as imputing, which replaces the missing values with the averages of the specific column. Additionally, you might encounter outliers that can significantly impact your model's performance, such as a single datum of one million amidst figures ranging from one to one hundred.
00:07:04.500 Furthermore, malformed data may arise, including misspellings or duplicate rows. Cleaning this data is a crucial part of the process. Utilizing the Daru library, I had to clean up some data by dropping the nil rows. The code implementation can be quite tedious, and I'm not entirely satisfied with it, as there are libraries that provide nice functions to drop null values.
00:07:30.240 However, I chose this route for this demonstration. Consequently, I discovered that data cleaning is extremely time-consuming—so much so that there's a term called the 80/20 rule in data science. This rule states that you will generally spend about 80 percent of your time cleaning and manipulating data while only 20 percent is spent building and tweaking models. Fortunately, we are already 80 percent of the way through the process, as we now will proceed to train and make predictions with our model.
00:08:04.560 When training a model, it's necessary to split the dataset before training. Typically, about 80 percent of the dataset is used for training while the remaining 20 percent is reserved for testing. The training data is used to develop the model; you will want some data points for validation to test your model's functionality, which is the role of your testing dataset.
00:09:08.400 In my case, I divided the data, taking the first 80 percent of the rows as the training dataset and the final 20 percent as the testing dataset. Since I'm using linear regression, it's worth discussing the model I selected. Linear regression is a straightforward model and is easy to grasp because many of us have been exposed to algebra. Essentially, linear regression attempts to model the relationship between two variables by fitting a linear equation to the observed data.
00:10:28.080 You may recognize the equation 'y = mx + b,' which represents the equation of a line. For clarity, I prefer to write it as 'f(x) = mx + b,' which aligns with how we view functions and methods in programming. In this context, we can input all the values we wish to use to predict—including precipitation and snowfall—into the x values, and the model will provide a prediction for the resulting y value, which corresponds to the maximum temperature.
00:11:40.920 For illustration, let’s say our dataset exhibits a linear pattern when plotted. In this scenario, a linear regression model will seek to plot a line that closely aligns with all the data points. While this method is a common approach, there are other machine learning models that can trace through different data points yielding more precise predictions. Thus, I encourage you to explore different models in my project and provide feedback on any improvements.
00:12:19.640 To build the linear regression model, the process is remarkably simple; you only need the X values from precipitation, minimum temperature, snowfall, and others, to fit your data and generate a linear regression model. Once that’s complete, you will have your model, and then we can close out the session.
00:13:05.760 How does Rails factor into this? In developing applications, you might wonder how to utilize this project. The predictions made with the model can be implemented directly into your Rails app. By providing the model with test data, you can use a predict function that will output the desired y value.
00:14:02.520 Theoretically, if you were writing a Rails application with the preceding implementations, you could wrap the shown code within a method for repeated use to generate predictions for your users. I find this capability to be quite nifty.
00:14:51.120 In summary, we set up the problem, collected data, prepared that data, trained our model, and made predictions. That's essentially the comprehensive process.
00:15:32.520 I would like to thank a few individuals and organizations: Testable and Andrew Kane, who has been significantly contributing to machine learning in the Ruby space. You should check out his blog. I've also taken courses from Great Learning to help build out Python projects and learn how to adapt them for Ruby.
00:16:01.080 Additionally, I have a special gift for you—I mentioned at the beginning. I published the project on GitHub so that you can examine it. My ultimate goal is for people to download the project, tweak it, and apply it to their own datasets, whether they are work-related or just for personal experiments.
00:16:35.220 This project is not overly complicated, and I'm eager for you all to utilize your datasets and create something unique. Ultimately, to see more machine learning incorporated into Ruby and Rails, it's necessary for all of you to start working on projects. It's important to note that you do not need an advanced knowledge of the subject to engage with this; while there’s an academic perspective, there’s also space for those who simply wish to experiment and play.
00:17:48.480 Lastly, I encourage you to sign up for our mailing list at Testable, and I would love to leave some time for questions. I see we have time for Q&A, so feel free to ask!
00:18:30.120 Regarding the x values, those would include the parameters we established earlier, specifically the precipitation, snowfall, snow depth, and the minimum and maximum temperatures. For simplification, I reduced the predictors to essential values, which you can follow in the project.
00:19:02.520 To clarify, I also utilized the maximum temperature from prior data to forecast the following day's maximum temperature, as several of these values are interrelated, and I hope that contextualizes the approach. It's essential to acknowledge that while this logic may work in a simple project like this, there are more sophisticated models better suited for comprehensive forecasting.
00:19:46.800 In terms of using the model itself, yes, that is an important consideration. There are ways to export the model for practical usage. In the context of your application, once you have trained the model and it's in motion, you wouldn’t need to retrain it unless new data warranted it. Yet, each time you'd use it would be the same model, merely requiring you to adapt your application to utilize the trained model as new data arrives.