Going The Distance

Ruby

Going The Distance

Richard Schneeman

Going The Distance

by Richard Schneeman

In the talk 'Going The Distance', Richard Schneeman dives into the significance of distance algorithms, particularly in the context of programming with Ruby. He opens with a personal introduction, mentioning his experience as a Ruby enthusiast and his work with Heroku on various projects related to open-source and documentation. The main focus of the presentation is to explore distance measurements, specifically Hamming and Levenshtein distances, and their real-world applications.

Key points discussed include:

Introduction to Algorithms: Richard emphasizes the beauty and practicality of algorithms, relating them to everyday programming challenges, especially in spell-checking and error correction.
Hamming Distance: This distance metric, which only works for equal-length strings, is briefly explained. Richard notes its limitations, particularly with variable-length inputs.
Levenshtein Distance: Transitioning to a more suitable algorithm for spell-check applications, Richard describes how Levenshtein distance efficiently calculates edit costs for string transformations by taking into account insertions, deletions, and substitutions. He illustrates this using practical examples like transforming 'Saturday' to 'Sunday'.
Implementation in Ruby: Richard details how to implement a recursive version of the Levenshtein algorithm in Ruby, discussing the efficiency gains achieved by caching substring costs to avoid redundant calculations. He utilizes a matrix-based approach to visually represent transformation costs, which aids in understanding the algorithm's performance.
Practical Applications: The talk connects these algorithms to real-world applications, such as Google’s search enhancements and intelligent coding recommendations through a gem called 'Did You Mean'. This underscores the importance of algorithmic efficiency in user experience.
Encouragement for Exploration: Richard encourages the audience to explore algorithm implementations and resources available on platforms like GitHub and Wikipedia to further their understanding of algorithms beyond formal education.

Conclusion: Richard concludes with an affirmation that algorithms are crucial tools for programmers, urging attendees to appreciate their applications in everyday programming scenarios. He invites questions and encourages an ongoing quest for knowledge regarding distance algorithms and their adaptations in various programming contexts. The overarching message is that algorithms serve as essential instruments for improving coding practices and solving complex problems efficiently.

00:00:17.460 All right, hello everyone! My name is Richard Schneeman, or 'Action Eames'. If you came to the lightning talk, you already know the most important part about me: I'm incredibly committed to Ruby.

00:00:24.010 Sadly, my dog could not come to the conference, but you will also know that Ruby is, in fact, the favorite language of a Python programmer. It's okay; I've gotten over it.

00:00:30.600 I work for a small company based out of San Francisco that you may have heard of; it's called Heroku. They give me some free time and allow me to work on projects such as CodeTriage.com, which allows you to get an open-source issue in your inbox every single day.

00:00:41.920 Another project I recently announced is DocsDr.org, which is essentially the same thing, but with methods. So, if you're looking for documentation, you can get documented or undocumented method information directly in your inbox.

00:01:07.780 A little bit more about me: I am in the top 50 of the Rails contributors, so that makes me a big deal. I do not have any cats in this presentation, but this is a photo of my dog. You may have seen a couple of people wearing 'Keep Ruby Weird' shirts; this is a conference that we organized for the first time ever in Austin, Texas.

00:01:33.340 Austin is known for its motto, 'Keep Austin Weird'. Ruby is amazing, dynamic, and powerful, and in the spirit of why we wanted to preserve that, we gathered together and thought about how we could keep Ruby weird at RubyConf as well.

00:01:47.740 We brainstormed various ideas, and I realized I needed a much longer segue. We considered doing the can-can dance, but the audience seemed hesitant.

00:02:06.940 So, thank you very much to those who stayed — you all volunteered for the conference can-can! I need everyone to stand up and stay weird. The conference can-can is similar to the actual can-can, except it’s more like pretending to be a really shy person doing the can-can. We have actual music that is perhaps a little better than mine.

00:02:46.460 Are you ready for this? The song lasts for an entire minute. Thank you very much! I hope you’re completely weirded out. Now that I have your attention and you're awake, I’d like to talk about algorithms.

00:03:34.600 But before delving into algorithms, I want to share some background about myself. I attended Georgia Tech and studied mechanical engineering; I don’t have a computer science degree and have been teaching myself programming over the last eight years.

00:03:51.500 It's been a super fun experience, but I've found computer science to be quite boring. I hope those of you with CS degrees aren’t offended, but I find programming really interesting because I enjoy building things.

00:04:05.480 The mechanical engineers of the world often miss that algorithms are absolutely beautiful! They solve real-world problems that someone else has invested time into learning.

00:04:34.620 I, however, face a major problem: I cannot spell at all. This is one of the reasons why I am a programmer; if I misspell a variable name, as long as I maintain consistency, it’s okay. I’m kind of joking, but spelling becomes even more challenging when I’m tired or distracted. I often find myself doing something like 'git commit', but entering something incorrect, like 'gets'. What's fascinating is how Git responds by suggesting a probable command, revealing that it knows what I likely meant.

00:05:52.790 The method Git uses to determine this is called edge distance, which is the cost of transforming one word into another. The lower the cost of changing one word to another, the more likely it is that the second word was what I meant; if there’s zero cost, it’s the same word. For instance, converting 'zat' to 'bat' has a cost of one, while 'zzzz' to 'at-bat' has a cost of two. But how do we implement this into code?

00:07:24.400 Initially, I thought it would be simple and jotted down a few lines of code, calculating the distance between two strings. When I tested this simple use case, I got a cost of one. The method employed was basic: iterating over each character in one string and checking if it matched the corresponding character in the other string. It seemed straightforward until I realized that the method had limitations, especially with strings of unequal lengths.

00:08:28.190 In my naivety, I unintentionally recreated a concept known as Hamming distance, which measures errors in a string, but only works for strings of the same length. For spelling, this isn’t useful, as it only accounts for substitutions, not insertions or deletions. This leads us to another algorithm: Levenshtein distance.

00:09:46.160 To understand Levenshtein distance, we need to figure out how to account for deletions and insertions. If we look at two similar strings, like 'Shin Eames' and 'zSHnimes', we can identify that the first character is the main difference. If everything except the first character matches, then this suggests a deletion.

00:10:37.940 In Ruby code, if those match, we can delete that character. For insertions, we can apply the same logic by checking if only one character is missing. If we can knock out the first character of the first word, we can proceed with the insertion process. Similarly, for substitutions, we compare characters in the same position.

00:11:54.350 Once we gather all of these rules, we can calculate the distance. If we assume that we already have a distance measure for strings, we can calculate the distance for strings of different lengths. Transforming an empty string to a string of length three would incur a cost of three because you’d be adding three characters.

00:12:54.600 Next, we can proceed to calculate distances by recursively implementing our distance measurement. At some point, if we hit two characters that match, the required cost is zero. When combining all of these calculations, we end up with a recursive version of Levenshtein distance, which is elegant and effective.

00:13:54.700 I tested this further and found out that transforming 'Saturday' into 'Sunday' only required three edits, rather than a previously guessed seven. However, I discovered that while recursive calculations worked, they came with significant inefficiencies.

00:15:06.860 The previously computed distances of substrings presented an opportunity for optimization. By caching the distance of all substrings, we could efficiently compute the distance for longer strings.

00:15:57.420 Let's illustrate this with a matrix. We position our empty string vertically and 'Saturday' horizontally, with the respective costs laid out. Changing an 'S' to an 'S' carries no cost, while adding unnecessary characters results in an increased cost.

00:17:18.670 Through systematic calculation, we can discover the insertion, deletion, and substitution costs as we build our matrix.

00:17:35.530 Once we have our constructed matrix, we can derive the cost effectively for transforming words directly, confirming that it takes three edits from 'Saturday' to 'Sunday', achieving a significant efficiency compared to previous calculations.

00:18:29.730 This method proves its value by allowing us to calculate substring costs as well. As an example, finding the cost to transform 'Sun' to 'Sat' simply involves looking up our previously established matrix.

00:19:39.640 I encourage you to play with all these examples available on GitHub. Beyond this talk, you might find useful applications, especially regarding algorithm implementations and programming practices.

00:21:15.460 As you explore further, give special attention to the practical applications of algorithms, like Google's search suggestions, which utilize these distance calculations for better accuracy and user experience.

00:22:19.640 Incorporating Levenshtein distance into software can provide intelligent recommendations, improving user interaction. Just as Google enhances its service using extensive databases and mathematical models, similarly, we can apply these principles within our own programming environments.

00:23:06.520 Lastly, I want to highlight a useful gem called 'Did You Mean', which suggests correct methods in your Ruby code similar to Google’s spell-check. This kind of practical effort is broadening the horizons of programming beyond theoretical frameworks.

00:24:06.130 And while tedious, it’s vital for us programmers to hone this skill since we frequently encounter spelling mishaps. As we delve into these algorithms, we enlighten ourselves on how to efficiently share knowledge.

00:25:13.600 There is much to explore in terms of distance algorithms, including types like longest common subsequence, Manhattan distance, or others appropriate for large-scale spelling corrections. As we evolve in our programming skills, let’s embrace these algorithms and generate innovative applications for diverse needs.

00:26:51.390 To summarize, algorithms aren't just for computer science undergrads anymore; they are essential tools. For anyone seeking resources to learn more about them, Wikipedia and Rosetta Code offer valuable insights, and engaging in algorithm talks can ignite new ideas.

00:29:09.150 Thank you for coming! If you have any questions, feel free to ask. I'm available here for the next few minutes, and all my codes and references are uploaded for you to explore.

Richard Schneeman

@schneems

RubyConf 2014