00:00:17.460
All right, hello everyone! My name is Richard Schneeman, or 'Action Eames'. If you came to the lightning talk, you already know the most important part about me: I'm incredibly committed to Ruby.
00:00:24.010
Sadly, my dog could not come to the conference, but you will also know that Ruby is, in fact, the favorite language of a Python programmer. It's okay; I've gotten over it.
00:00:30.600
I work for a small company based out of San Francisco that you may have heard of; it's called Heroku. They give me some free time and allow me to work on projects such as CodeTriage.com, which allows you to get an open-source issue in your inbox every single day.
00:00:41.920
Another project I recently announced is DocsDr.org, which is essentially the same thing, but with methods. So, if you're looking for documentation, you can get documented or undocumented method information directly in your inbox.
00:01:07.780
A little bit more about me: I am in the top 50 of the Rails contributors, so that makes me a big deal. I do not have any cats in this presentation, but this is a photo of my dog. You may have seen a couple of people wearing 'Keep Ruby Weird' shirts; this is a conference that we organized for the first time ever in Austin, Texas.
00:01:33.340
Austin is known for its motto, 'Keep Austin Weird'. Ruby is amazing, dynamic, and powerful, and in the spirit of why we wanted to preserve that, we gathered together and thought about how we could keep Ruby weird at RubyConf as well.
00:01:47.740
We brainstormed various ideas, and I realized I needed a much longer segue. We considered doing the can-can dance, but the audience seemed hesitant.
00:02:06.940
So, thank you very much to those who stayed — you all volunteered for the conference can-can! I need everyone to stand up and stay weird. The conference can-can is similar to the actual can-can, except it’s more like pretending to be a really shy person doing the can-can. We have actual music that is perhaps a little better than mine.
00:02:46.460
Are you ready for this? The song lasts for an entire minute. Thank you very much! I hope you’re completely weirded out. Now that I have your attention and you're awake, I’d like to talk about algorithms.
00:03:34.600
But before delving into algorithms, I want to share some background about myself. I attended Georgia Tech and studied mechanical engineering; I don’t have a computer science degree and have been teaching myself programming over the last eight years.
00:03:51.500
It's been a super fun experience, but I've found computer science to be quite boring. I hope those of you with CS degrees aren’t offended, but I find programming really interesting because I enjoy building things.
00:04:05.480
The mechanical engineers of the world often miss that algorithms are absolutely beautiful! They solve real-world problems that someone else has invested time into learning.
00:04:34.620
I, however, face a major problem: I cannot spell at all. This is one of the reasons why I am a programmer; if I misspell a variable name, as long as I maintain consistency, it’s okay. I’m kind of joking, but spelling becomes even more challenging when I’m tired or distracted. I often find myself doing something like 'git commit', but entering something incorrect, like 'gets'. What's fascinating is how Git responds by suggesting a probable command, revealing that it knows what I likely meant.
00:05:52.790
The method Git uses to determine this is called edge distance, which is the cost of transforming one word into another. The lower the cost of changing one word to another, the more likely it is that the second word was what I meant; if there’s zero cost, it’s the same word. For instance, converting 'zat' to 'bat' has a cost of one, while 'zzzz' to 'at-bat' has a cost of two. But how do we implement this into code?
00:07:24.400
Initially, I thought it would be simple and jotted down a few lines of code, calculating the distance between two strings. When I tested this simple use case, I got a cost of one. The method employed was basic: iterating over each character in one string and checking if it matched the corresponding character in the other string. It seemed straightforward until I realized that the method had limitations, especially with strings of unequal lengths.
00:08:28.190
In my naivety, I unintentionally recreated a concept known as Hamming distance, which measures errors in a string, but only works for strings of the same length. For spelling, this isn’t useful, as it only accounts for substitutions, not insertions or deletions. This leads us to another algorithm: Levenshtein distance.
00:09:46.160
To understand Levenshtein distance, we need to figure out how to account for deletions and insertions. If we look at two similar strings, like 'Shin Eames' and 'zSHnimes', we can identify that the first character is the main difference. If everything except the first character matches, then this suggests a deletion.
00:10:37.940
In Ruby code, if those match, we can delete that character. For insertions, we can apply the same logic by checking if only one character is missing. If we can knock out the first character of the first word, we can proceed with the insertion process. Similarly, for substitutions, we compare characters in the same position.
00:11:54.350
Once we gather all of these rules, we can calculate the distance. If we assume that we already have a distance measure for strings, we can calculate the distance for strings of different lengths. Transforming an empty string to a string of length three would incur a cost of three because you’d be adding three characters.
00:12:54.600
Next, we can proceed to calculate distances by recursively implementing our distance measurement. At some point, if we hit two characters that match, the required cost is zero. When combining all of these calculations, we end up with a recursive version of Levenshtein distance, which is elegant and effective.
00:13:54.700
I tested this further and found out that transforming 'Saturday' into 'Sunday' only required three edits, rather than a previously guessed seven. However, I discovered that while recursive calculations worked, they came with significant inefficiencies.
00:15:06.860
The previously computed distances of substrings presented an opportunity for optimization. By caching the distance of all substrings, we could efficiently compute the distance for longer strings.
00:15:57.420
Let's illustrate this with a matrix. We position our empty string vertically and 'Saturday' horizontally, with the respective costs laid out. Changing an 'S' to an 'S' carries no cost, while adding unnecessary characters results in an increased cost.
00:17:18.670
Through systematic calculation, we can discover the insertion, deletion, and substitution costs as we build our matrix.
00:17:35.530
Once we have our constructed matrix, we can derive the cost effectively for transforming words directly, confirming that it takes three edits from 'Saturday' to 'Sunday', achieving a significant efficiency compared to previous calculations.
00:18:29.730
This method proves its value by allowing us to calculate substring costs as well. As an example, finding the cost to transform 'Sun' to 'Sat' simply involves looking up our previously established matrix.
00:19:39.640
I encourage you to play with all these examples available on GitHub. Beyond this talk, you might find useful applications, especially regarding algorithm implementations and programming practices.
00:21:15.460
As you explore further, give special attention to the practical applications of algorithms, like Google's search suggestions, which utilize these distance calculations for better accuracy and user experience.
00:22:19.640
Incorporating Levenshtein distance into software can provide intelligent recommendations, improving user interaction. Just as Google enhances its service using extensive databases and mathematical models, similarly, we can apply these principles within our own programming environments.
00:23:06.520
Lastly, I want to highlight a useful gem called 'Did You Mean', which suggests correct methods in your Ruby code similar to Google’s spell-check. This kind of practical effort is broadening the horizons of programming beyond theoretical frameworks.
00:24:06.130
And while tedious, it’s vital for us programmers to hone this skill since we frequently encounter spelling mishaps. As we delve into these algorithms, we enlighten ourselves on how to efficiently share knowledge.
00:25:13.600
There is much to explore in terms of distance algorithms, including types like longest common subsequence, Manhattan distance, or others appropriate for large-scale spelling corrections. As we evolve in our programming skills, let’s embrace these algorithms and generate innovative applications for diverse needs.
00:26:51.390
To summarize, algorithms aren't just for computer science undergrads anymore; they are essential tools. For anyone seeking resources to learn more about them, Wikipedia and Rosetta Code offer valuable insights, and engaging in algorithm talks can ignite new ideas.
00:29:09.150
Thank you for coming! If you have any questions, feel free to ask. I'm available here for the next few minutes, and all my codes and references are uploaded for you to explore.