Talks

Going the Distance

Going the Distance

by Richard Schneeman

In the talk titled "Going the Distance," presented by Richard Schneeman at Rails Pacific 2014, the focus is on the importance of algorithms in programming, specifically highlighting the Levenshtein distance algorithm for measuring edit distance between words. Richard, a committed Ruby developer, introduces himself humorously, sharing a bit about his personal life and his work at Heroku, including his passion for open-source contributions.

Key Points Discussed:

  • Personal Background: Richard shares his unconventional background in mechanical engineering and his fascination with programming, which he finds more engaging than theoretical discussions in computer science.
  • Edit Distance Concept: The talk introduces the concept of edit distance, illustrating how the Levenshtein distance algorithm can be utilized to determine the cost of converting one string into another through insertions, deletions, and substitutions.
  • Initial Algorithm Attempt: Richard describes his initial failed attempt at calculating edit distance using a simple character-by-character comparison and how it led to inaccurate results, prompting him to seek improved methods.
  • Hamming vs. Levenshtein Distance: He contrasts Hamming distance, which strictly requires strings of equal length, with the more versatile Levenshtein distance that accommodates various string lengths and operations.
  • Implementation and Optimization: Richard explains the implementation of the recursive Levenshtein distance and its performance, mentioning the high number of iterations it required. He then introduces an optimized version using a matrix approach, significantly reducing the computational steps.
  • Community Engagement: Throughout the talk, Richard emphasizes the value of sharing knowledge within the programming community and encourages others to explore and present their findings.

Conclusion:

Richard concludes by encouraging exploration of algorithms and metrics in programming, underscoring the importance of collaboration and continuous learning in software development. He invites the audience to ask questions and share their experiences in coding and algorithms, reinforcing the idea that learning extends beyond formal education.

Overall, the talk highlights the relevance of algorithms in software engineering, particularly those that enhance text processing capabilities, and encourages developers to engage with the community and share their learning journeys.

00:00:14.920 Hello everyone, my name is Richard Schneeman. You can call me Schnees. A little fact about me—if I need to use the bathroom, I just ask where it is, and people point me in the right direction. On the internet, people call me Schnees.
00:00:24.720 I am very committed to Ruby; in fact, I married her. This is my wife, her name is Ruby. I met her at a Rails Meetup. Interestingly enough, she also programs in Ruby, but recently she's been stressing our relationship a bit by learning Python.
00:00:42.239 I work for Heroku, where I have the opportunity to engage in open-source projects. Later in today’s talk, you will hear about one of my contributions that I’m particularly excited to share.
00:00:55.399 There’s a service called CodeTri.com that I bet none of you have ever heard of. CodeTri.com is an application designed to get you up and running in open source, and it is itself open source. This is one of the applications I use to experiment and test new things on Heroku.
00:01:06.400 I've been working on a new application that’s in early alpha or beta stage. If you're more interested in documentation than issues, I have a project called Doc.org, which sends you one piece of documentation a day. You can either receive a documented method for more insight or an undocumented method if you're looking for easier commits.
00:01:19.000 I've noticed that the number of undocumented methods in frameworks like Rails is quite large, making it easy to get commits. Check it out if you're interested! Additionally, I'm an organizer of a conference called Keep Ruby Weird and I have some stickers to give away.
00:01:44.840 Today, I will primarily be discussing one of my contributions to Rails. I am part of the Rails issue team and, more recently, among the top 50 Rails contributors. The title of my talk is 'Going the Distance,' which is actually a reference to a Cake song. I understand if some of you may not catch the lyrics references.
00:02:14.040 I attended college at Georgia Tech, where I studied mechanical engineering instead of computer science. I was more involved with hands-on projects, like working on engines and developing models in Excel. I had a role at GE designing refrigerators, where our programming language of choice was Microsoft Excel.
00:02:38.640 I’ve been programming for about eight years now. I didn't study computer science in college because I found it rather uninteresting, yet programming, to me, is fascinating. The key difference lies in creation; programming for me involves accomplishing tasks rather than just theorizing.
00:03:05.200 Over the years in programming, I've come to realize there were several concepts I overlooked in computer science, one of which is algorithms. So, I want to share that algorithms are, in fact, beautiful! Algorithms solve problems. I had an interesting problem: spelling.
00:03:33.279 I’ve never been a good speller. My parents joked that it was lucky I chose engineering as a career! In programming, it's fine to misspell a variable name—as long as you spell it incorrectly consistently. However, as time passes and exhaustion sets in, spelling becomes even tougher.
00:03:57.760 One of my favorite tools, Git, helps with this—if you mistype a command, it suggests corrections. I thought, 'That's neat functionality! How can I add this to my own programs?' Especially in Rails. To implement this, I wanted to introduce the concept of 'Edit Distance'.
00:04:19.680 Edit Distance measures the cost of converting one word to another. The lower the cost of conversion, the more likely the words are a match. For example, consider 'Z' and 'bat.' The distance from 'Z' to 'bat' is one change; simply substitute 'Z' with 'B.' Alternatively, from 'zat' to 'bat' would require two changes. While this is easy to rationalize, programming it can be quite challenging.
00:04:57.520 Initially, I attempted to create a simple function that iterates through each character of the strings to find matches. My first algorithm determined that changing 'zat' to 'bat' had a distance of one. I thought, 'Great, I’m done!' However, later, when I compared 'Saturday' to 'Sunday', the algorithm suggested a distance of seven changes, which seemed excessive. Clearly, it needed improvement.
00:05:23.040 In the process, I found out that I inadvertently replicated Hamming distance. Hamming distance requires the strings to be of the same length to avoid raising an error, and it measures the error between two strings.
00:05:37.760 Hamming distance can be useful for telecommunications and error correction, but it’s not ideal for spelling because it doesn't account for insertions or deletions. To improve my algorithm, I needed to incorporate these aspects.
00:05:57.040 I researched this on Wikipedia and discovered the Levenshtein distance algorithm. This algorithm perfectly suited my needs, as it incorporates deletions, insertions, and substitutions.
00:06:16.880 I'll explain how it works. For example, if comparing two strings where one is a substring of the other, we establish which characters need to be inserted or deleted to form a match. If we remove a character from the first string, or compare it to the second string minus the first character, we can evaluate our findings.
00:06:46.920 By performing these operations recursively, we determine the edit distance with minimal cost using substitution if needed. Thus, when writing code, it's crucial to account for all possible character comparisons, ensuring we keep track of insertion, deletion, and substitution.
00:07:23.600 Eventually, I put together the recursive Levenshtein distance algorithm. The process evaluates whether strings are empty, whether they match, and computes the costs for each operation as necessary. The result allows for precise calculation of distances.
00:07:51.679 I ran the algorithm on 'Saturday' and 'Sunday', and it returned a distance of three, which was correct. To give further context, I prepared a video demonstrating the calculations and iterations.
00:08:29.360 This process took 1,647 substring comparisons but returned the correct distance of three. If anyone is interested in my examples, they are available on GitHub, and I’ve also noted everything there for you to review further.
00:09:00.200 I explored alternative algorithms like Hamming distance, which I initially called 'dirty distance.' It performed quickly, requiring only eight comparisons but was incorrect. On the other hand, the Levenshtein algorithm took much longer to iterate yet provided the correct answer.
00:09:51.520 Is there a way to optimize beyond Levenshtein? It turns out there is! This means caching substring comparisons to avoid redundant calculations. This led to the implementation of the Matrix Levenshtein distance algorithm.
00:10:23.600 For example, transitioning from a blank string to 'Saturday' incurs a cost of seven changes through insertions. Each operation can be represented in a matrix, showing how many steps involve moving from one character to another.
00:10:54.480 As we build our operation matrix, we determine costs for each type of operation—deletion, insertion, and substitution—while calculating matches when applicable. This algorithm loops through the characters and determines each operation's total cost.
00:11:29.040 Eventually, I streamlined everything into this looping based algorithm that optimizes performance and accuracy. As noted, there are notable differences in approaches when consulting resources like Wikipedia.
00:12:01.600 After running the final computations, I obtained confirmation of my results, and the algorithm indicated the distance accurately. The full benefits of this method allow for instant retrieval of substring distances within the proposed algorithm, making it efficient.
00:12:38.720 In total, the matrix implementation only took forty-eight iterations compared to nearly 1,647 iterations with the recursive approach, which is significant and truly advantageous.
00:13:20.000 In conclusion, I encourage you all to explore these concepts and consider different perspectives on familiar algorithms. Embrace the chance to learn and share with others, as it solidifies understanding.
00:13:56.240 I occasionally send links, such as to GitHub examples, to ensure everyone stays together through coding explorations. Moreover, it’s essential to build out not only understanding from algorithms but also significant connections within community development.
00:14:37.760 Learning algorithms isn’t confined solely to formal education. Think about ways to discuss with colleagues or comprehend widely used technologies better. I propose that if you've never presented on algorithms or similar yet enjoy sharing knowledge, then take that chance. Share something you discover!
00:15:15.360 As you move forward, keep in mind the importance algorithms hold, not just for coding but for sharing knowledge as a standardized method of developing processes. With that said, I’ll wrap up my talk and see if anyone has questions.
00:15:49.280 Thank you for your time, and I hope you enjoyed this exploration into Levenshtein distance and other algorithms!