RailsConf 2013

Natural Language Processing with Ruby

Natural Language Processing with Ruby

by Brandon Black

The video titled "Natural Language Processing with Ruby" presented by Brandon Black at Rails Conf 2013 offers a beginner-friendly introduction to Natural Language Processing (NLP) and its relevance to software engineering, particularly within the Ruby community. It outlines challenges and tools associated with NLP, emphasizing the need for Ruby developers to engage with this evolving field. The presentation is structured around the following key points:

  • Definition of NLP: NLP is the field concerned with making computers understand and generate human language, involving tasks like searching, parsing, and generating text.
  • Applications and Challenges: It covers several applications of NLP such as search engines, spell checking, and sentiment analysis, and highlights the complexities involved, including linguistic nuances and evolving language.
  • Importance in the Software Industry: The speaker explains that a significant portion of worker productivity is spent searching for information, making NLP solutions critical for improving efficiency in data handling.
  • NLP Solutions and Tools: Brandon presents existing Ruby libraries conducive to NLP, including Chronic for date processing and Treat, which aims to mirror Python's NLP toolkit but points out the lack of comprehensive solutions compared to other programming languages like Python and Java. He stresses the importance of leveraging these tools effectively.
  • Role of JRuby: For more advanced NLP tasks, he promotes using JRuby, as it allows Ruby applications to utilize mature Java NLP libraries, which are substantially developed and well-documented.
  • Encouragement to Engage: Lastly, Brandon encourages Rubyists to contribute to existing libraries, learn from other programming communities, and partake in educational resources, highlighting courses from Stanford and practical literature on NLP.

In conclusion, the talk underscores the critical role of NLP in enhancing computational linguistics and motivates Ruby developers to deepen their involvement in this essential aspect of computer science, stressing that Ruby can play a significant role in this space despite current limitations.

00:00:16.400 Hello, my name is Brandon, and this is actually my first RailsConf. This is my first attempt at this talk, so bear with me; you're in for some fun. I work on MongoDB at Tengen, specifically on the drivers team, where I focus on the Ruby driver for MongoDB. My role at Tengen consists of a couple of different parts. One part involves ensuring that people are using MongoDB effectively and maintaining the Ruby driver, while the other part revolves around advocating for open source and supporting the Ruby community.
00:00:42.399 Today, my talk is not centered around MongoDB itself; I might mention it a few times, but the focus is on the Ruby community and natural language processing (NLP) in Ruby. I'll explain what NLP is, what it means, and how you can get involved in this fascinating field using a language we all love.
00:01:07.040 Currently, I work for Tengen, and my past employers include companies like Facebook, Romby, which specializes in business intelligence in San Diego, and Myspace, where I worked on the developer platform for a while. I want to reiterate that Tengen is here because we care deeply about the Ruby community. We want to invest in it and inspire you to invest as well. As part of this community, you have the opportunity to make it better, which is why we're here.
00:01:44.560 Now, on a lighter note, I think there are far too many prominent Rubyists who love cats, so I feel compelled to represent the dog crowd. This is my dog, Somi. She drools a lot, snores louder than I do, and is shaped like a potato, but who can resist her adorable face? Here are my credentials, in case anyone doubts that I'm the real deal. If you missed it, that's a Klingon, a Romulan, a Ninja Turtle t-shirt, and Pepsi shorts. I've embraced my nerdy side for as long as I can remember.
00:02:14.400 Now, let's discuss the goals and agenda of this talk. Natural language processing, in general, is a vast subject that encompasses many fields within computer science. It’s not solely limited to one aspect, making it quite broad. During the next 40 minutes, I’ll provide a quick introduction to NLP, explaining what it is, the challenges it poses, and why it’s important. We will also discuss the tools available currently, how Ruby measures up, and how to bridge the gaps where Ruby may not be sufficient.
00:02:55.360 As we wrap up, I’ll guide you on what to do next if you’d like to learn more about this topic: where to go, how to delve deeper, and how to become more involved in this aspect of our community. We’ll start with my definition of natural language processing. I'm no expert, just a guy who learned something and is sharing it with you. NLP basically involves analyzing, understanding, and generating human language to enable computers to interact with humans effectively. In essence, it's about making computers comprehend human language in both generation and understanding.
00:03:54.560 This definition may sound vague because NLP is indeed a vast and complex field that intersects with numerous studies. It holds immense relevance for all of us. The applications of NLP are plentiful, and common problems we attempt to solve in this domain include search, spell checking, auto summarization, predictive text, content categorization, and machine translation such as Google Translate. Imagine if there were a perfect computer system that could flawlessly translate text and interpret human speech and writing—there would practically be no limit to the applications of such technology. At a high level, NLP can be categorized into three basic areas: searching information, extracting information, and grouping information.
00:05:01.200 Additionally, technologies such as Siri, OCR, and image recognition fall under the umbrella of NLP, although I'll focus primarily on NLP in the context of text instead of speech or OCR. You may wonder why dealing with language is so difficult; after all, it has rules, syntax, and grammar. Yet computers struggle significantly with language comprehension. To illustrate this, consider a grammatically correct English sentence that is confusing even to us. For example, let’s analyze the sentence made up of various forms of the word 'buffalo.' It's a valid sentence, yet the nuances and context are lost. In essence, context is crucial for understanding language, which presents a considerable challenge for computers.
00:06:08.000 Another notable challenge is that there is no perfect universal solution to NLP problems. Even experts in the field often disagree about the best approaches, and fixing one issue can lead to problems elsewhere. Additionally, language is constantly evolving: different cultures, slang, and technology influence language and grammar significantly. For instance, consider how writing by hand has become less frequent with the increasing use of keyboards and instant messaging. This technological shift alters our communication style, making it challenging for existing systems to keep up.
00:07:38.080 Moreover, many challenges in NLP involve computational complexities. Over the past decade, advancements have made certain aspects of NLP that were once impossible now achievable. Today's hardware allows us to perform operations that were not feasible years ago. However, numerous outstanding problems in this field remain considered AI complete, meaning we may not solve them until we address the central challenges of artificial intelligence, such as passing the Turing test.
00:08:57.760 So, why is it important to care about NLP? This question may arise, especially for developers. To illustrate its significance, consider a pie chart showing how an average U.S. worker spends their time in a given workweek. About 39% of the work is specifically role-related tasks, while 28% is email management—reading emails and gathering information. Roughly half of an employee’s workweek, around 45-47%, is spent sifting through and searching for information, which indicates inefficiency.
00:09:45.120 There's an increasing demand for solutions to handle these challenges as the problem space continues to grow. There's exponentially more data to sift through than in previous years, leading to what many call a big data problem, affecting various startups and organizations. For context, recent estimates indicate that over 4 billion photos were taken in the last year alone, a number that eclipses all photos taken in human history combined. Additionally, the IDC estimates that we generate 1.8 zettabytes of data annually, with that number expected to grow significantly. This data explosion further emphasizes the need for efficient ways to manage and extract meaningful information.
00:11:04.160 Now, when it comes to addressing NLP challenges, there are typically three common approaches. The first is rule-based analysis, often utilized in fields like ActiveRecord. The second is statistical analysis, which has fascinating applications, such as a case where a researcher identified the likely author of a leaked script for a popular anime using statistical methods. The final approach is machine learning, which is essential for tackling some of the more complex NLP challenges.
00:12:40.640 Most successful NLP systems rely on a 'human in the loop' model, where user feedback helps refine the system. This principle is evident across various platforms. Examples include how Facebook gathers likes to enhance ad targeting and user experience. Understanding how to train models in NLP remains important, and I will share some basic building blocks to help you build your own systems.
00:13:20.720 Key elements of NLP include parts of speech tagging, which identifies nouns and verbs. Tokenization breaks down text into manageable pieces, while stemming seeks to find root words to facilitate search and indexing. Named entity recognition aims to extract specific names and places from text, which is critical for context. Understanding these core components can significantly help you navigate the NLP landscape.
00:14:54.368 As we dive into what’s available in the Ruby community regarding NLP, unfortunately, there isn't much. One useful library is Chronic, which parses date and time expressions, helping to convert them into a usable format. Linguistics is another interesting gem that offers functionality for pluralization and verb conjugation, close to what ActiveSupport provides. On the other hand, the Punk Segmentation library relies heavily on a C extension and can help break up text into sentences, especially beneficial for various NLP tasks.
00:16:15.359 Ruby Stemmer is another library, albeit less actively maintained. It uses the Snowball Stemming API to identify root words effectively. Finally, Treat is a promising library aiming to create a Ruby counterpart to Python's NLP toolkit. It’s an active project, gathering various tools to handle common NLP tasks, helping illuminate the Ruby landscape in this domain.
00:17:46.639 What if you need to go beyond basic functionality in NLP? The answer lies in JRuby. For those unfamiliar, JRuby is Ruby on the Java Virtual Machine (JVM). Through JRuby, you can leverage well-established, mature Java libraries within your Ruby code, significantly enhancing what can be achieved—especially in the realm of NLP.
00:18:54.679 Let’s briefly walk through an example of summarizing text using JRuby. This involves tokenizing the received text, removing stop words (common but semantically meaningless words), ranking the relevant words based on their frequency, and then extracting relevant sentences based on key content. Essentially, you’ll pull together the most relevant sentences to create a coherent summary.
00:20:40.799 We’ll utilize libraries such as OpenNLP and Snowball for stemming. By incorporating these tools, we're able to streamline our processing of text into meaningful outputs while still relying on user-friendly Ruby syntax. While this example may seem basic, it illustrates the ease of integrating Java libraries into your Ruby project using JRuby.
00:21:47.439 As we consider next steps, I challenge you to engage deeply with the potential of NLP within the Ruby community. This field is on the cutting edge of computer science and presents opportunities to make significant impacts. Take the time to learn about existing solutions, and consider contributing to libraries that could benefit the community.
00:23:04.480 Don’t shy away from exploring other programming languages as well, like Python, which has a robust NLP toolkit, or Java, with its strengths in the NLP arena. There are also great resources available—such as online courses from Stanford and MIT that provide excellent foundations in these subjects. I recommend checking out Machine Learning for Hackers and Natural Language Processing for Python if you are keen on diving deeper into these topics.
00:25:48.160 Lastly, I encourage you to take action! Whether by contributing code, expanding the functionality of existing Ruby libraries, or simply sharing your experiences at local meetups and tech talks, your involvement can help grow and nurture the Ruby community.
00:27:45.120 To summarize, don't underestimate Ruby's capacity; it can tackle big problems beyond web applications and small scripts. Ruby is a powerful and expressive language, capable of confronting complex issues like NLP.
00:28:15.760 So, that's it. Thank you for your attention, and I challenge you to join me in exploring these possibilities.