RailsConf 2017

Predicting Titanic Survivors with Machine Learning

Predicting Titanic Survivors with Machine Learning

by Ju Liu

In this video titled "Predicting Titanic Survivors with Machine Learning," Ju Liu speaks at RailsConf 2017 about employing machine learning techniques to analyze historical data from the Titanic disaster. Liu begins by introducing himself, sharing his experience with Ruby and machine learning. He outlines the video's goal, which is to explore the survival rates of Titanic passengers using real data while using Python libraries for live coding.

Key points discussed include:

  • Historical Data Overview: The talk begins with a brief introduction to the Titanic's passenger data, containing 891 passenger records with various attributes such as survival status, passenger class, age, and fare.
  • Data Visualization: Liu utilizes libraries like pandas and matplotlib to load and visualize the data, demonstrating the survival rates through bar graphs and scatter plots. He shows that approximately 40% of passengers survived and discusses age distribution in relation to survival.
  • Passenger Class Analysis: An analysis is conducted on the distribution of passenger classes, revealing that an overwhelming number were from third class, which aligns with expectations based on historical context.
  • Gender Impact on Survival: The talk highlights significant survival discrepancies based on gender, with females having a much higher survival rate compared to males. Liu emphasizes the importance of this factor while analyzing the data.
  • Machine Learning Model Predictions: Liu introduces basic prediction heuristics, such as predicting survival based solely on gender. He enhances this by leveraging the SciKit-Learn library to apply various algorithms, including logistic regression and decision trees, which yield success rates around 78% to 98%. Liu emphasizes the importance of proper model training and avoiding overfitting in machine learning.
  • Conclusion: The session concludes with Liu summarizing the main findings, affirming that the movie aligns with historical accuracy regarding survival, especially between classes and genders.

Overall, this session provides an insightful, practical understanding of machine learning principles through a relatable case study of the Titanic, backed by real data analysis and visualization techniques.

00:00:11.929 Hi everyone, I'm super happy to be here! I've been using Ruby on Rails for ages, and it feels great speaking at RailsConf. You should follow me on Twitter; I think you can follow me just like everyone else does. Yes, I am actually Italian. I can speak Italian and cook various Italian dishes, which I think makes me pretty much Italian.
00:00:25.500 I started coding in Ruby in 2010, and in the last couple of years, I've been mostly working with JavaScript. I've also developed a keen interest in machine learning and climbed a lot—though I must confess I'm not a great climber! We're trying to organize a climbing session tomorrow, so if anyone is interested, please follow me on Twitter and hit me up.
00:00:57.149 Currently, I'm working at a company in London called Erlang Solutions. They were very kind to send me here to talk about how awesome they are. They were actually the first company to believe in Erlang as a technology since 1999, which seems like a century ago, and they love all sorts of technologies like Ruby, React, RabbitMQ, and more.
00:01:17.850 Now, my goal today is to do some live coding. But first, I want to ask if everyone can see the screen in the back. Can you guys see it clearly? A bit bigger? Good, I'll keep going! Before we get started, I just want to show you my desktop background, which I think is really nice. It has some Photoshop objects in the front, rendered in high quality, and you see two guys in the back not even looking at each other—what's the point? Anyway, I hope you're familiar with the movie Titanic. If you're not, please raise your hand, and keep it raised!
00:01:42.360 What I want to do today is first introduce you to an amazing file, which is a CSV document containing real historical data of the Titanic passengers. There are 892 lines in total—one for the header—so essentially 891 passengers. For each passenger, we have several pieces of information, like the passenger ID, whether they survived, their passenger class, name, sex, age, the number of siblings or spouses, the number of parents and children, the fare they paid, the cabin they occupied, and where they embarked on the ship.
00:02:03.689 My main goal is to verify if the movie was accurate in its depiction of the Titanic. To achieve this, we'll use some Python libraries. The speaker before me already mentioned how they are essential when it comes to machine learning, and I hope to show you just how powerful they can be today. Let's start by creating a file called 'visualize.py' and import the libraries, including Pandas, which is great for handling CSV files, and Matplotlib, which is a powerful visualization library.
00:02:37.640 In Pandas, there's this concept called a data frame. When you load your CSV file, it gets turned into a Pandas data frame. I'll call it DF. To load the CSV, we use the command DF = pd.read_csv(), and it will automatically read the file. For instance, if we want to look into the distribution of survival rates, we can type 'DF['Survived'].value_counts()' to see how many survived and how many didn't.
00:03:12.399 When I run the command, we find out that out of all the passengers, around 332 survived, while the rest unfortunately did not. To visualize this, we can create a bar graph. I’ll set it up using Matplotlib and adjust the sizing appropriately. Let me figure that out.
00:03:57.539 In just a couple of lines, we can transform this information into a visual graph. It's important to adjust the numbers for better readability, and I can normalize the data to display the percentages of survivors versus non-survivors. After running the code, we can see that roughly 40% of people survived while 60% did not. What’s even more interesting than a single graph is to analyze various aspects further. We can set up subplots to examine relationships within our data. In this case, we want to see if there is a relationship between survival rate and age. A scatter plot can serve this purpose.
00:06:29.399 So we can analyze the survival rate against the ages of the passengers and include a pass to spread the dots to see them more clearly. When I print this, it's quite unexpected because there's no apparent age-related connection to survival rates. We can observe that a large number of passengers are aged between 20 and 40 on both sides of the survival spectrum. Interestingly, older individuals seem to have lower survival rates, while younger individuals may have survived more, but again, we can't draw conclusive outcomes just from this distribution.
00:08:48.650 Another aspect we can explore is the distribution of passenger classes. If we adjust our analysis to display the passenger classes instead, we can visualize that most passengers were in the third class, which is consistent with what we would expect based on historical data. We can then further analyze the relationship between the ages of passengers and the class they could afford.
00:10:05.500 Utilizing Python’s list comprehension, we can isolate passengers based on their class and visualize their average ages using a kernel density graph. If we run this analysis, we observe that third-class passengers are significantly younger on average, at around 20 years, compared to the second-class passengers at around 30 years, and first-class passengers averaging around 40. The data suggests that wealth may have correlated with age as wealthier individuals tended to be older.
00:12:17.650 Moving forward, I discovered a factual inaccuracy in the movie regarding the ship's route. Historical data indicates that the Titanic made two stops after leaving Southampton before it sank. We can investigate the embarked column to see that 70% of passengers boarded in Southampton, followed by stops in France and Ireland.
00:13:20.160 While this historical trivia is interesting, what's more compelling is when we analyze our data to find correlations and trends with just minimal Python code—within minutes. However, while looking at survival rates, there's a key factor missing: the gender of the passengers. To investigate this further, I’ll create a separate pie chart to assess the differences in survival rates between men and women.
00:14:15.600 Using the same dataset, I’ll make more graphs to show the survival rates for men and women. Based on preliminary results, it’s quite evident; around 30% of passengers survived overall, but only 20% of men survived as opposed to around 70% of women. However, this initial glimpse doesn’t provide much significance without considering the ratio of males to females in the dataset.
00:15:30.680 We can further refine our analysis by taking a closer look at the data breakdown. If I compare the male and female survival rates by passenger class, once again, our initial assumptions are validated: first-class men had a significantly higher survival rate, around 45%, while third-class men barely had 10%. In contrast, first-class women had remarkable survival rates, indicating that socio-economic factors strongly influenced survival outcomes.
00:17:41.860 Next, we can delve deeper and analyze survival outcomes within specific demographic groups by combining gender and passenger class in our dataset. I'll modify my previous code to compare the survival rates of males and females across first and third class. As expected, we can verify that first-class women represented a majority in terms of survival rates, while third-class women had a more evenly split outcome.
00:19:19.550 This aligns with the movie's portrayal of characters, validating the historical accuracy behind Jack and Rose's story. There’s substantial evidence supporting their respective outcomes: third-class men were likely to perish while first-class women survived—reaffirming what many believe to be representative of the social dynamics of that era.
00:20:43.940 Understanding the relationships within our data provides us with a unique opportunity to use machine learning for predictions. We can create simple heuristic models to predict outcomes based on survival probabilities. The approach begins by importing our data and creating a training set, which will help us improve prediction accuracy.
00:21:16.950 I’ll start by creating a new column focusing on gender. If a passenger is female, we’ll predict they survived; otherwise, we’ll assume they didn’t. Once I generate predictions, I can see the results by comparing them against actual survival rates. It turns out our naive model achieves a commendable 78% accuracy, which is impressive for such a basic heuristic.
00:22:33.350 Next, we can leverage advanced libraries for a more robust machine learning approach. Scikit-learn, for instance, offers a range of machine learning models. One such model used is linear regression. By fitting our data using this model, we can derive scores that represent our model's predictions, which are also around 79% accuracy.
00:23:59.720 The model's intuitive nature allows it to grasp the essentials of our data without much assistance. Yet, there are cases where models can become too fine-tuned to historical data, resulting in overfitting—a common pitfall in machine learning. Regularization techniques help mitigate this, ensuring the model maintains a generalized ability to predict outcomes outside of its training examples.
00:25:02.000 I’ll demonstrate the use of pre-processing techniques to transform features in a way that optimizes the model's capability by adjusting complexities through polynomial transformations. From quadratic adjustments to reevaluating our predictor variables, our model becomes more resilient and adaptable across various datasets.
00:26:17.210 Additionally, it’s crucial to visualize the decision-making process of our models. By constructing a decision tree classifier, we can better interpret how decisions are derived from input features. Each decision within the tree prompts questions based on our dataset, thus revealing which attributes had the most influence on predictions.
00:27:29.480 From the visualized tree, we can see that critical decisions are made first based on gender and age, followed by class. This illustrates how important variables interact within our dataset to classify outcomes effectively. Hence, machine learning enables us to explore historical events through a lens shaped by the data itself.
00:28:19.580 To reinforce the model's reliability, employing techniques like cross-validation can ensure the model's predictions are solid and hold across different sets of data. This method involves sub-sampling while training and adapting the model to account for unseen data.
00:29:13.720 After applying these adjustments, running the model yields pleasing results, showcasing that the initial predictions aren't just coincidental but reflect nuanced behaviors across the dataset.
00:30:09.160 Lastly, visualizing the model yields tree files that display decision paths leading to predictions. This representation helps bridge the gap between human intuition and machine logic, allowing us to understand the powerful conclusions reached through data analysis.
00:30:57.920 Thank you for your attention!