00:00:14.920
Hello everyone, my name is Richard Schneeman. You can call me Schnees. A little fact about me—if I need to use the bathroom, I just ask where it is, and people point me in the right direction. On the internet, people call me Schnees.
00:00:24.720
I am very committed to Ruby; in fact, I married her. This is my wife, her name is Ruby. I met her at a Rails Meetup. Interestingly enough, she also programs in Ruby, but recently she's been stressing our relationship a bit by learning Python.
00:00:42.239
I work for Heroku, where I have the opportunity to engage in open-source projects. Later in today’s talk, you will hear about one of my contributions that I’m particularly excited to share.
00:00:55.399
There’s a service called CodeTri.com that I bet none of you have ever heard of. CodeTri.com is an application designed to get you up and running in open source, and it is itself open source. This is one of the applications I use to experiment and test new things on Heroku.
00:01:06.400
I've been working on a new application that’s in early alpha or beta stage. If you're more interested in documentation than issues, I have a project called Doc.org, which sends you one piece of documentation a day. You can either receive a documented method for more insight or an undocumented method if you're looking for easier commits.
00:01:19.000
I've noticed that the number of undocumented methods in frameworks like Rails is quite large, making it easy to get commits. Check it out if you're interested! Additionally, I'm an organizer of a conference called Keep Ruby Weird and I have some stickers to give away.
00:01:44.840
Today, I will primarily be discussing one of my contributions to Rails. I am part of the Rails issue team and, more recently, among the top 50 Rails contributors. The title of my talk is 'Going the Distance,' which is actually a reference to a Cake song. I understand if some of you may not catch the lyrics references.
00:02:14.040
I attended college at Georgia Tech, where I studied mechanical engineering instead of computer science. I was more involved with hands-on projects, like working on engines and developing models in Excel. I had a role at GE designing refrigerators, where our programming language of choice was Microsoft Excel.
00:02:38.640
I’ve been programming for about eight years now. I didn't study computer science in college because I found it rather uninteresting, yet programming, to me, is fascinating. The key difference lies in creation; programming for me involves accomplishing tasks rather than just theorizing.
00:03:05.200
Over the years in programming, I've come to realize there were several concepts I overlooked in computer science, one of which is algorithms. So, I want to share that algorithms are, in fact, beautiful! Algorithms solve problems. I had an interesting problem: spelling.
00:03:33.279
I’ve never been a good speller. My parents joked that it was lucky I chose engineering as a career! In programming, it's fine to misspell a variable name—as long as you spell it incorrectly consistently. However, as time passes and exhaustion sets in, spelling becomes even tougher.
00:03:57.760
One of my favorite tools, Git, helps with this—if you mistype a command, it suggests corrections. I thought, 'That's neat functionality! How can I add this to my own programs?' Especially in Rails. To implement this, I wanted to introduce the concept of 'Edit Distance'.
00:04:19.680
Edit Distance measures the cost of converting one word to another. The lower the cost of conversion, the more likely the words are a match. For example, consider 'Z' and 'bat.' The distance from 'Z' to 'bat' is one change; simply substitute 'Z' with 'B.' Alternatively, from 'zat' to 'bat' would require two changes. While this is easy to rationalize, programming it can be quite challenging.
00:04:57.520
Initially, I attempted to create a simple function that iterates through each character of the strings to find matches. My first algorithm determined that changing 'zat' to 'bat' had a distance of one. I thought, 'Great, I’m done!' However, later, when I compared 'Saturday' to 'Sunday', the algorithm suggested a distance of seven changes, which seemed excessive. Clearly, it needed improvement.
00:05:23.040
In the process, I found out that I inadvertently replicated Hamming distance. Hamming distance requires the strings to be of the same length to avoid raising an error, and it measures the error between two strings.
00:05:37.760
Hamming distance can be useful for telecommunications and error correction, but it’s not ideal for spelling because it doesn't account for insertions or deletions. To improve my algorithm, I needed to incorporate these aspects.
00:05:57.040
I researched this on Wikipedia and discovered the Levenshtein distance algorithm. This algorithm perfectly suited my needs, as it incorporates deletions, insertions, and substitutions.
00:06:16.880
I'll explain how it works. For example, if comparing two strings where one is a substring of the other, we establish which characters need to be inserted or deleted to form a match. If we remove a character from the first string, or compare it to the second string minus the first character, we can evaluate our findings.
00:06:46.920
By performing these operations recursively, we determine the edit distance with minimal cost using substitution if needed. Thus, when writing code, it's crucial to account for all possible character comparisons, ensuring we keep track of insertion, deletion, and substitution.
00:07:23.600
Eventually, I put together the recursive Levenshtein distance algorithm. The process evaluates whether strings are empty, whether they match, and computes the costs for each operation as necessary. The result allows for precise calculation of distances.
00:07:51.679
I ran the algorithm on 'Saturday' and 'Sunday', and it returned a distance of three, which was correct. To give further context, I prepared a video demonstrating the calculations and iterations.
00:08:29.360
This process took 1,647 substring comparisons but returned the correct distance of three. If anyone is interested in my examples, they are available on GitHub, and I’ve also noted everything there for you to review further.
00:09:00.200
I explored alternative algorithms like Hamming distance, which I initially called 'dirty distance.' It performed quickly, requiring only eight comparisons but was incorrect. On the other hand, the Levenshtein algorithm took much longer to iterate yet provided the correct answer.
00:09:51.520
Is there a way to optimize beyond Levenshtein? It turns out there is! This means caching substring comparisons to avoid redundant calculations. This led to the implementation of the Matrix Levenshtein distance algorithm.
00:10:23.600
For example, transitioning from a blank string to 'Saturday' incurs a cost of seven changes through insertions. Each operation can be represented in a matrix, showing how many steps involve moving from one character to another.
00:10:54.480
As we build our operation matrix, we determine costs for each type of operation—deletion, insertion, and substitution—while calculating matches when applicable. This algorithm loops through the characters and determines each operation's total cost.
00:11:29.040
Eventually, I streamlined everything into this looping based algorithm that optimizes performance and accuracy. As noted, there are notable differences in approaches when consulting resources like Wikipedia.
00:12:01.600
After running the final computations, I obtained confirmation of my results, and the algorithm indicated the distance accurately. The full benefits of this method allow for instant retrieval of substring distances within the proposed algorithm, making it efficient.
00:12:38.720
In total, the matrix implementation only took forty-eight iterations compared to nearly 1,647 iterations with the recursive approach, which is significant and truly advantageous.
00:13:20.000
In conclusion, I encourage you all to explore these concepts and consider different perspectives on familiar algorithms. Embrace the chance to learn and share with others, as it solidifies understanding.
00:13:56.240
I occasionally send links, such as to GitHub examples, to ensure everyone stays together through coding explorations. Moreover, it’s essential to build out not only understanding from algorithms but also significant connections within community development.
00:14:37.760
Learning algorithms isn’t confined solely to formal education. Think about ways to discuss with colleagues or comprehend widely used technologies better. I propose that if you've never presented on algorithms or similar yet enjoy sharing knowledge, then take that chance. Share something you discover!
00:15:15.360
As you move forward, keep in mind the importance algorithms hold, not just for coding but for sharing knowledge as a standardized method of developing processes. With that said, I’ll wrap up my talk and see if anyone has questions.
00:15:49.280
Thank you for your time, and I hope you enjoyed this exploration into Levenshtein distance and other algorithms!