Summarized using AI

'Did you mean?' experience in Ruby and beyond

Yuki Nishijima • June 05, 2015 • Singapore • Talk

In the presentation titled 'Did you mean?' experience in Ruby and beyond, Yuki Nishijima discusses an innovative gem called 'didyoumean' that introduces a suggestive feature similar to Google's correction suggestions for Ruby developers. Nishijima begins by explaining the gem's history since its release in February 2014 and outlines the improvements made since its inception.

Key Points:

  • Introduction to 'didyoumean' Gem: The gem allows Ruby programmers to benefit from quick suggestions when they misspell method names, thus enhancing productivity by saving time that would otherwise be spent on searching for correct method names.
  • Working Mechanism: The presenter illustrates how the gem operates by providing examples of common scenarios where misnaming occurs, emphasizing that it automatically searches for and suggests the intended method based on input errors.
  • Spell Checker Fundamentals: Nishijima delves into the underlying principles of spelling correction, explaining the three components of a spell checker: dictionaries, control mechanisms, and optimizations needed to enhance accuracy and performance. PHP-style examples are given to illustrate each aspect's role.
  • Accuracy Evaluation: The speaker discusses evaluations performed on 'didyoumean', revealing its accuracy metrics and limitations. At one point, accuracy reached only about 54%, prompting Nishijima to consider improvements, such as incorporating the Yellow Winkler distance metric for better handling of similar-sounding words.
  • Custom Exception Class Integration: Nishijima concludes by showcasing how developers can create their own finders to utilize the gem more effectively in specific contexts, like searching for attribute names in database queries.
  • Future Directions: The talk wraps up with thoughts on integrating additional features and further refining the gem for future versions of Ruby, including native support for handling emojis.

The presentation is filled with technical insights that can empower developers to improve their programming experience in Ruby by leveraging 'didyoumean'. Nishijima's insights not only detail the gem's utility but also the mathematical theories underpinning spelling correction algorithms, providing a comprehensive understanding of its functionality and ongoing development.

'Did you mean?' experience in Ruby and beyond
Yuki Nishijima • June 05, 2015 • Singapore • Talk

'Did you mean?' experience in Ruby and beyond by Yuki Nishijima

did_you_mean gem is a gem that adds a Google-like suggestion feature to Ruby. Whenever you mis-spell a method name, it'll read your mind and tell you the right one.

Although the history of the gem isn't long, it got so many improvements since it's first released back in February 2014. In this talk, I'll talk about what improvements have been made after a quick introduction of how it works.

Have a custom Exception class and want to make it 'correctable'? Let's learn how to create your own finder so you can improve your coding experience in Ruby.

Help us caption & translate this video!

http://amara.org/v/Ghc7/

Red Dot Ruby Conference 2015

00:00:19.359 hi thank you for welcoming me uh here in singapore um my name is yuki uh today
00:00:24.800 i'm gonna talk about digimon experience and beyond uh this is me uh my name is yuki
00:00:30.880 nishijiman um youtube found me on github and twitter and
00:00:36.880 and i think yesterday matt called me nishidasan which is actually not the right one
00:00:42.960 my family nails in the is nishijiman so i wanted to ask him did you be misugima
00:00:52.160 so yeah this is me this is me on github i like ward um
00:00:58.800 i'm the creator of the jimmy jam and i also maintain a kind kaminar jam which is a wealth cosmetic jam
00:01:07.439 i was born in japan and i live in new york right now and i also used to live in
00:01:13.280 on in the philippines before and then i actually came to singapore for rental
00:01:18.560 ruby conf in 2011 and 2013 and i'm really excited
00:01:23.920 to be here as a speaker this year um i work for a company called pivotal
00:01:29.280 labs filter labs isn't as a consultancy that helps build software with azure
00:01:34.479 practices like pair programming tdd internet things like that
00:01:39.520 and i think some of you already know philadelphia before because we there's to be a branch in
00:01:45.840 singapore so this is stan app pair programming
00:01:50.880 and pinpoint we're going to also open a branch in tokyo this summer so if you are
00:01:57.280 interested in working with us just let me know all right um so let's talk about today's main
00:02:03.920 topic digital mean experience um
00:02:09.039 mass actually uh talked about it yesterday already but um let me just give you a quick introduction because uh
00:02:15.520 some of you might not have used it before so let's say you want to check if a string object with a specific letter so
00:02:23.040 maybe you can use a stats with method but lastly it doesn't work because it
00:02:28.640 only responds to start with not starts with
00:02:36.239 it often happens because you are now using rails and more specifically um active support and because actually
00:02:42.720 support has an ls it starts with but if you use digimon jam as soon as
00:02:48.959 you get an error it will automatically look for what you really wanted to call and then suggest it to you so you don't
00:02:55.599 have to waste your time like you don't have to google you don't have to go to like rubydoc it's going to just look for what
00:03:02.560 you really want wanted to call so since last year i started getting many questions about this gem um
00:03:10.239 and one of them is like um it's great but sometimes it doesn't
00:03:15.920 correct when it's like for example uh you typed us you type like a long little name and
00:03:22.239 then you realize that uh it's not it's not just you just don't remember the correct one and then digimon files
00:03:29.440 to search for it and then sometimes it displays two too many connect corrections
00:03:34.480 and because it's not super smart so sometimes like for example rails path and then you type
00:03:40.879 uh like api something something and the path and it's gonna says a lot of things
00:03:47.120 and another example could be i misspelled a tablename in the database and then digimen just doesn't suggest
00:03:53.920 that this is actually interesting because dgmn is designed to work with name error
00:04:00.080 and nozzle error but if you type uh like a column name table name then
00:04:06.159 digimin doesn't actually suggest anything so let's talk about let's learn about
00:04:12.400 how to write a spell checker in general first so that you can you can understand easily uh this is not actually about
00:04:19.120 really specific topics but it's about computer science and the history of building a computer
00:04:24.720 still checker is actually quite long even longer than my age so let's learn from the history so that
00:04:30.639 you can easily understand how it works so first of all so what's a split
00:04:36.000 checker basically it behaves like a function that takes a user input
00:04:41.040 like this and obviously the input is uh what people actually typed
00:04:47.120 and the input usually has noise like you make a typo you may you misspell something and you don't remember method
00:04:52.960 name so basically it's just nice
00:04:58.400 and spectacular gives us the outfit back uh which is what is most likely to be
00:05:03.440 intended so it is actually pretty simple it takes an input it gives us back the correct one
00:05:12.080 but what's inside the blue box in this case so usually a spoke checker consists of
00:05:18.000 three things the first one is dictionary which is basically just a set of words
00:05:24.320 the second one is control mechanism which decides what return of the correction and the last one is optimization
00:05:32.400 and technically a split checker can work with only with a dictionary and a
00:05:37.520 control mechanism but usually some optimization techniques are applied to improv
00:05:43.440 checker like performance accuracy so what is dictionary again it is
00:05:50.720 basically just a set of words and it usually comes from from an actual dictionary which is why it is called
00:05:55.759 dictionary and a spell checker can have a multiple set of dictionaries
00:06:00.960 uh for example sometimes also spell checker can have both ask for dictionary and then sometimes wictionary
00:06:07.120 which is available on the web and sometimes even more
00:06:12.240 and this is because every single dictionary out there has different characteristics and then one dictionary
00:06:18.400 can't always cover everything for example british english is a bit different from american english
00:06:25.280 and then you want to implement split checker that can correct both or sometimes it shouldn't because for
00:06:30.880 example optimization has sometimes see sometimes s and if you were in america then the space shuttle shouldn't suggest
00:06:37.280 it and then another example could be um
00:06:45.280 speech check of war japanese i think this is also true for other languages like chinese
00:06:50.880 but the reason why it needs multiple dictionaries is that it is typical that when you're writing japanese
00:06:57.120 you always use english words and then you probably need two different dictionaries so that the spectacle can
00:07:03.759 correct um even english words while you are writing something
00:07:11.199 the next one is control mechanism and mathematically it is just a formula it
00:07:16.240 usually checks uh whether or not each one of the words in the dictionary is is the right one
00:07:22.639 a spell checker can have multiple formulas because one founder doesn't always cover everything because the same
00:07:28.000 reason as an as why dictionaries can't and since it is just a phenomenon uh
00:07:33.759 some metrics must be calculated based on the user impact and um and the words in the dictionary
00:07:41.039 there are many types of metrics out there like um using culture similarity between two
00:07:46.400 strings with leuvenstein and yellow winkler and humming and there's a lot more
00:07:52.720 and naively scanning all the words in the dictionary will be painfully slow because let's say english
00:07:59.360 there are like four million words and then you don't you really don't want to spend everything because it takes time
00:08:06.800 so optimization it basically improves performance or accuracy sometimes both
00:08:14.479 there are many many many optimization techniques out there but they're actually just
00:08:19.840 they're usually context specific like for example this blue checker has optimization technique but if you have another one
00:08:26.720 then this one can't use the other one but it's really powerful because it really often optimization is what makes
00:08:33.760 spectaculars great so i think really should
00:08:40.000 we don't so if you are writing a spell checker then you really should optimize
00:08:46.160 now let's take an example let's take a look at some examples um
00:08:52.480 obviously the second one is wrong they were traveling and it is it is easy for humans to
00:08:59.200 choose the right one yeah it is clear that one is the correct one but for computers it is really hard to choose uh
00:09:06.000 which one is the right one because computers are not smart and in this case you can just implement
00:09:11.519 some kind of grammar analysis because uh one can come between travel and trap
00:09:17.920 day and travel but where camp can stay in the middle of the day and traveling
00:09:27.760 another example will be something like this and then this sentence looks really weird but
00:09:32.880 space attackers are not only for humans for example if you talk to siri it can't recognize the difference between no as a
00:09:39.600 verb and no as an answer in this case you can just create a dictionary for for the same
00:09:46.399 sound of a word and then so that we took a split checker
00:09:52.080 to pick up the right one now we know that a split checker can
00:09:59.120 have three things uh dictionary control mechanism and optimization
00:10:06.000 so what's the dictionary of the digital jam and what control mechanisms does it use and are there any option automations
00:10:12.240 in it the dictionary update of the dam is simply just a list of symbols
00:10:19.920 and and it it calculates leaving same distance for each word in the dictionary
00:10:25.120 and then and are using pet and then suggest the ones that are within the
00:10:30.839 threshold so what is limits and distance it is actually quite simple let's say you have two strings
00:10:37.440 in the in the previous example we talked about start with and it starts with
00:10:53.600 so here you can see stopped west and it starts wet and obviously there's a one character difference between them
00:11:01.360 the s letter between t and under which means you if you remove one letter
00:11:06.640 uh from the from the second one then they will be identical that means the leavings and distance
00:11:12.959 will be one now let's take a look at this example of
00:11:18.000 first hand and full name obviously there's three three letter
00:11:24.320 difference as well as one x-ray letter in first name
00:11:29.760 so the learning center distance will be four
00:11:35.120 and envisioning jam has just one optimization which is a context-based dictionary
00:11:41.440 i'm not sure if i should call it an optimization because this is what i did since the beginning but
00:11:48.720 basically you can if you want to get all the lists of the names that you can just call symbol dot
00:11:54.399 all symbols and so how many symbols are defined in a
00:12:01.279 ruby process if you as you can see there there are about uh 2500 symbols if you
00:12:08.639 just call it with rubicon third symbol of all symbols and size
00:12:16.399 so how many symbols are defined when you just do wells rails new and then what else default refdb migrate and then well
00:12:23.600 c i'm going to ask you guys raise your hand if you think it's
00:12:29.360 five thousand ten thousand
00:12:35.839 twenty thousand fifty thousand
00:12:44.000 a hundred thousand okay so the answer is uh about 20
00:12:51.440 thousand involves which is quite a lot because
00:12:56.560 every time you get an error you really don't want to scan 2000 words because it takes time
00:13:02.959 and as you can see here the number of methods available are relatively small like for for string we have um 26 edge
00:13:10.320 methods for f1 red we have 22 346 methods and one hash
00:13:19.720 236 and for user model it has about 600 methods and then a user
00:13:27.360 object has about 400. so every time you get an error problem
00:13:34.240 like hash it doesn't have to scan all the symbols which is about uh
00:13:39.920 20 20 000. in this case you can just scan about
00:13:45.040 20 50 300 methods
00:13:50.800 and visioning gem uses a pattern called find a pattern that means um
00:13:57.279 for example you get a name error and it says initialize consent then it's going to use consonant finder which only knows
00:14:03.760 about the list of consonant names and then if it's if the error is known for error then method it's going to use
00:14:11.600 method finder which only knows about method names that you can call
00:14:19.040 so this is how the jimmy jam works and this is actually the like how the
00:14:24.959 latest version of digimon works and um but um it is actually different from um
00:14:32.160 the one that is available on github now we now we know um how it works
00:14:40.160 but we don't know how accurate it is because sometimes it doesn't suggest that
00:14:46.480 sometimes it suggests too many too many methods and we want to know how accurate is it
00:14:54.240 how are we going to do this we can't just test it what will be symbols because it's hard to collect uh
00:14:59.600 typing data while you're programming so i'm gonna just use uh existing um data
00:15:05.519 that is available on the internet and i'm gonna use um wiktionary simple
00:15:12.160 english as a dictionary and simple english is a dictionary that only contains essential english words
00:15:19.120 uh because wikileaks also has a has a whole english dictionary but it
00:15:25.199 has 4 million words so i don't want to use it because it takes time and then while you're programming i don't think
00:15:30.959 we use a really really hard hard to remember words because you want to make it simple and so that which means you
00:15:37.920 only use essential words i'm going to also use
00:15:43.759 a list of correct and incorrect pairs from birth back spell spelling error covers
00:15:51.920 there's always a study and then and that data was used by that study and everything is available on the web
00:15:59.120 so i'm gonna upload this slide later so you can check them out
00:16:06.240 and this is a result of the evaluation as you can see here the accuracy right now is about 54 which
00:16:13.920 is actually not high so why is it low uh what kind of names
00:16:19.279 can the spell taken not correct
00:16:24.560 obviously many cases where i remember a method of them incorrectly and the correct screw checker doesn't
00:16:31.199 the current speed checker doesn't catch it so let's just optimize it
00:16:39.360 and you may already realize it but sometimes i say miss type and then sometimes i say misspell
00:16:45.120 and then they are actually different a study said that spell characters that
00:16:50.320 incorrect mistypes can't always correct misspells and it is easy to correct missed types
00:16:56.800 because it you can just you can just call it edit distance with relevance 10 distance and then if you
00:17:02.720 just make a typo like for example you try to hit a and accidentally hit s then there will be just one character
00:17:08.959 difference and then that's which spectacular can correct that mistake
00:17:14.959 but when it comes to uh like spelling this text like you don't remember the method name correctly and you would stop
00:17:22.880 no you don't know how to type and then the other and you don't have to type and you try
00:17:28.880 to guess and but it doesn't always
00:17:33.919 catch the white one and the other study says that um you always remember uh the first part of the
00:17:41.840 method name are not messed with just the snaps in general like yesterday max called me nishida-san
00:17:48.000 but he remembered uh the first part of my money but didn't even remember the
00:17:53.600 last one so i i guess it's a good
00:17:58.840 example and now it's time to use jello wings decent sorry a yellow winkler distance um
00:18:05.200 what is yellow yellow winter distance it is basically a yellow distance plus
00:18:10.559 prefix bonus um there's another distance from the trade called yellow and the prefix bonus has been added because you
00:18:17.520 always remember the first part of that uh name and if you add a prefix bonus
00:18:22.559 then you can you can pick up the the right one because it has bonus
00:18:29.600 so what is the yellow distance there are two important metrics uh m and
00:18:36.000 t and the first one is under normal matching letters and the second one is
00:18:41.200 have the number of transpositions
00:18:46.640 take a look at this example here you can you can find first name and the second one has
00:18:53.760 a wrong character and to calculate m it checks if the letter appears in the
00:19:00.720 first and the other one and here you can see just four arrows
00:19:07.120 and then the question is does it actually have to scan everything
00:19:12.320 and the answer is actually no because it has matching window which means let's say you have you have two long
00:19:18.640 strings and the first one has a chapter a in the first plate and then the second one has here the a in the last one but
00:19:25.600 it doesn't make sense if you if you combine these two things because it's too far
00:19:33.360 so we don't have to check these ones and as you can see every letter here
00:19:39.679 appears in the other one which means the matching number will be just 10.
00:19:47.440 and there are two transpositions meaning um t is gonna be in just one because it's
00:19:54.400 gonna be half number of the transpositions here now we know n and t
00:20:01.520 and you can calculate the distance with this formula and it's going to be
00:20:07.640 0.9666 xx the next thing we have to do is to cultivate a prefix bonus
00:20:17.039 to calculate it it it let's only care about the first four letters in the strings
00:20:23.280 so let's forget about the rest on we only know about the first one
00:20:28.400 and check if each one each one matches the the other one
00:20:35.039 and obviously the first one matches but the second one doesn't
00:20:40.320 and even if the second one second one appears in the third place in the other string it should stop counting if it doesn't
00:20:47.440 match so in this case i should appear in the second place but it doesn't so it should
00:20:52.960 just stop counting which means in this case the prefix
00:20:58.480 match is is gonna be just one
00:21:03.520 and we're going to use this formula to cultivate bonus where
00:21:09.200 w is a weight usually is 0.1 and mp is number prefix matrix which is just one
00:21:16.000 and then here yet j is yellow yellow distance
00:21:21.919 which means the prefix bonus is going to be zero point zero zero three three three three
00:21:29.679 and since yellow winter this one is just yellow wing club plus prefix bonus we can just uh combine this two
00:21:35.919 so we're gonna get 0.9.6999999
00:21:43.600 so yeah um as you can see they are pretty close which is why we get um value which is this is really close to
00:21:49.840 one if everything's same the yellow inter distance is going to be just one because
00:21:56.400 because they are same now let's talk about the misspell
00:22:01.600 correction in digimon jam it uses degenerate distance and then picks up the closest one only if no miss
00:22:09.520 type corrections are made and then the living stand distance and
00:22:14.720 then the limits staying distant i'm sorry this time should be lower than that length of the sort of distance
00:22:20.720 because um sometimes yellowing the distance could be really high even if the living stand distance is really um
00:22:28.080 really high um so uh releasing distance should be low
00:22:33.280 and then otherwise it's going to suggest something that is not related
00:22:40.480 now let's read the evaluation um we can use the same script and how
00:22:45.679 uh how much it is improved and now it's actually better the
00:22:50.880 accuracy increased by about seven percent and then and that's about eighty percent attract
00:22:56.720 which is great but it's but it is also true that twenty
00:23:02.240 percent of the time it is wrong so what are the corrections that didn't go well
00:23:08.240 uh this is just one of them face fate and space
00:23:13.600 the reason why they are mispelled is that um they sound quite similar so like
00:23:18.640 for example if you say faith then somebody thinks that oh it is f a z e but it's actually one but
00:23:25.840 um both livingston yellow winkler distance cannot catch it because the distance is quite low and and uh
00:23:33.520 and the first character is actually different any another example like this female
00:23:39.840 email female and then night unite night same problem
00:23:45.600 and then the last one uh this is interesting because it always happens to me like you really don't know it is s or
00:23:54.240 c and then i don't know like how many s do i have to type how many cs do i have to type it's really confusing but
00:24:02.400 as you can see most of the errors are coming from the fact that this one they sounds quite similar
00:24:08.480 but has different letters like c o s uh ph or f
00:24:14.480 in other words if i have to improve the digimon jam even more
00:24:19.760 i should probably apply like a pronunciation based observation
00:24:26.320 now let's talk about writing a finding this is the last section of this talk but yeah it's really great so
00:24:33.360 the reason why you want to write a finder is that let's say you have a you're writing a rails app and then
00:24:38.960 sometimes you want to find it like this for example use uh actual active record then miss type
00:24:46.000 attributes name in the database and you will get unknown attribute error
00:24:51.120 but it doesn't correct our mistake because it's not name error it's not norms of error so it doesn't know about how to
00:24:57.840 correct this how to correct the
00:25:03.279 mistake here i want something like this so i'm going to tag in the in the hash
00:25:10.480 and then it should suggest a name so that i can easily realize that
00:25:15.760 i'm doing something wrong as i talked earlier djimogen has a
00:25:21.360 couple of clients by default but you can also add a new one if you want which is great
00:25:28.640 so let's just implement it here you can see a class called attribute name finder which includes djm in base finance um
00:25:36.000 i'm not actually sure if it's a good name i should probably change it to something else if i come with a new one
00:25:45.600 and what you really have to implement is just two methods and initializing and searches map out
00:25:53.600 the initializer takes an exception object and then you can grab things like a binding object and
00:25:59.200 original message and what's important here is that you really have to call original message
00:26:04.559 because this file is evaluated while it is trying to generate a message so if you call just message then it's
00:26:11.279 going to be stack overflow error because it tries to generate a message it tries
00:26:16.320 to call them finder it tries to call it method so it's not good so it's really important to call it original message
00:26:24.960 and a certain method should recent hash where the key should be a user input and
00:26:30.000 the value should be a dictionary and it has a response to attribute
00:26:36.720 methods sorry attribute name and in column length then you can just implement like
00:26:41.840 this attribute name word is coming from the original message and then column length
00:26:49.360 is coming from column lamps so here evol soft class is actually a
00:26:55.440 x record object and then if you say column lamps you can get a list of names
00:27:03.840 and don't forget to add a new finder to the little finest
00:27:10.000 so before we get something like this but if you after you implement the finder then
00:27:15.279 you're gonna get something like this which is great uh it is available on github so check it
00:27:21.919 out so usually as matt said it uh it's going
00:27:28.159 to be bundled when ruby's 2.3 is coming coming out
00:27:33.200 but um there's still a lot a lot of things that i have to do like we're moving support for other mris
00:27:40.480 there will be rubiness um if it if it should be if it's it is going to be bundled as a
00:27:46.399 as part of ruby then you shouldn't know about jerry you shouldn't know about women else you shouldn't know about old
00:27:51.440 movies you shouldn't know about rails bundles regents and then the next thing i have to do is
00:27:58.640 stop monkey patching right now did de jamin jam has a monkey patching and also a c extension and then
00:28:07.279 i'm expecting the next version of movie jack uh ruby version two
00:28:13.360 to include the extension and also hopefully i don't have to
00:28:18.720 do multi-patching anymore so so yeah
00:28:24.880 there's still other things a lot of things that i have to do but hopefully i can ship it west with the next version
00:28:31.360 of movie and then one last thing i want to tell you today is that
00:28:37.120 digit museum totally works with emojis
00:28:49.760 all right that's pretty much it
00:28:59.520 thank you so much uh we have time for a couple of quick questions if you have any if you do please come up to the mic
00:29:12.480 hi thanks for the speech um i have a question i'm trying to implement tfidf the problem i face is with the tf with
00:29:19.760 the idf whereby i'm trying to look for a corpus with the document frequencies and also if i get
00:29:27.440 it because it's quite large how what data format would i best put it in to actually do a fast query
00:29:34.960 uh can you say again so your question is uh you want to implement the finder but you want to change the format
00:29:40.640 i'm trying to implement tf idf uh whereby it tries to find the importance of a word
00:29:46.240 in document so i'm trying to find a good corpus and also if i find it how would i best
00:29:54.080 store it so that i could do a quick query
00:29:59.279 um i don't know actually what i can think of is to implement like vm programming or emax plugin or moving
00:30:06.720 my plugin to automatically capture what you type and then send it to somewhere else so that you can correct uh you can
00:30:13.520 collect uh like what you type and while you actually miss type or misspelled so yeah it's a
00:30:19.919 it's a difficult question because i use some um some covers that are that is available
00:30:25.679 on the internet but it is used like back in 1980 like and it could be really old
00:30:32.480 so yeah so the the evolution script is not actually good
00:30:38.720 enough okay thank you
00:30:55.840 you
Explore all talks recorded at Red Dot Ruby Conference 2015
+18