00:00:19.359
hi thank you for welcoming me uh here in singapore um my name is yuki uh today
00:00:24.800
i'm gonna talk about digimon experience and beyond uh this is me uh my name is yuki
00:00:30.880
nishijiman um youtube found me on github and twitter and
00:00:36.880
and i think yesterday matt called me nishidasan which is actually not the right one
00:00:42.960
my family nails in the is nishijiman so i wanted to ask him did you be misugima
00:00:52.160
so yeah this is me this is me on github i like ward um
00:00:58.800
i'm the creator of the jimmy jam and i also maintain a kind kaminar jam which is a wealth cosmetic jam
00:01:07.439
i was born in japan and i live in new york right now and i also used to live in
00:01:13.280
on in the philippines before and then i actually came to singapore for rental
00:01:18.560
ruby conf in 2011 and 2013 and i'm really excited
00:01:23.920
to be here as a speaker this year um i work for a company called pivotal
00:01:29.280
labs filter labs isn't as a consultancy that helps build software with azure
00:01:34.479
practices like pair programming tdd internet things like that
00:01:39.520
and i think some of you already know philadelphia before because we there's to be a branch in
00:01:45.840
singapore so this is stan app pair programming
00:01:50.880
and pinpoint we're going to also open a branch in tokyo this summer so if you are
00:01:57.280
interested in working with us just let me know all right um so let's talk about today's main
00:02:03.920
topic digital mean experience um
00:02:09.039
mass actually uh talked about it yesterday already but um let me just give you a quick introduction because uh
00:02:15.520
some of you might not have used it before so let's say you want to check if a string object with a specific letter so
00:02:23.040
maybe you can use a stats with method but lastly it doesn't work because it
00:02:28.640
only responds to start with not starts with
00:02:36.239
it often happens because you are now using rails and more specifically um active support and because actually
00:02:42.720
support has an ls it starts with but if you use digimon jam as soon as
00:02:48.959
you get an error it will automatically look for what you really wanted to call and then suggest it to you so you don't
00:02:55.599
have to waste your time like you don't have to google you don't have to go to like rubydoc it's going to just look for what
00:03:02.560
you really want wanted to call so since last year i started getting many questions about this gem um
00:03:10.239
and one of them is like um it's great but sometimes it doesn't
00:03:15.920
correct when it's like for example uh you typed us you type like a long little name and
00:03:22.239
then you realize that uh it's not it's not just you just don't remember the correct one and then digimon files
00:03:29.440
to search for it and then sometimes it displays two too many connect corrections
00:03:34.480
and because it's not super smart so sometimes like for example rails path and then you type
00:03:40.879
uh like api something something and the path and it's gonna says a lot of things
00:03:47.120
and another example could be i misspelled a tablename in the database and then digimen just doesn't suggest
00:03:53.920
that this is actually interesting because dgmn is designed to work with name error
00:04:00.080
and nozzle error but if you type uh like a column name table name then
00:04:06.159
digimin doesn't actually suggest anything so let's talk about let's learn about
00:04:12.400
how to write a spell checker in general first so that you can you can understand easily uh this is not actually about
00:04:19.120
really specific topics but it's about computer science and the history of building a computer
00:04:24.720
still checker is actually quite long even longer than my age so let's learn from the history so that
00:04:30.639
you can easily understand how it works so first of all so what's a split
00:04:36.000
checker basically it behaves like a function that takes a user input
00:04:41.040
like this and obviously the input is uh what people actually typed
00:04:47.120
and the input usually has noise like you make a typo you may you misspell something and you don't remember method
00:04:52.960
name so basically it's just nice
00:04:58.400
and spectacular gives us the outfit back uh which is what is most likely to be
00:05:03.440
intended so it is actually pretty simple it takes an input it gives us back the correct one
00:05:12.080
but what's inside the blue box in this case so usually a spoke checker consists of
00:05:18.000
three things the first one is dictionary which is basically just a set of words
00:05:24.320
the second one is control mechanism which decides what return of the correction and the last one is optimization
00:05:32.400
and technically a split checker can work with only with a dictionary and a
00:05:37.520
control mechanism but usually some optimization techniques are applied to improv
00:05:43.440
checker like performance accuracy so what is dictionary again it is
00:05:50.720
basically just a set of words and it usually comes from from an actual dictionary which is why it is called
00:05:55.759
dictionary and a spell checker can have a multiple set of dictionaries
00:06:00.960
uh for example sometimes also spell checker can have both ask for dictionary and then sometimes wictionary
00:06:07.120
which is available on the web and sometimes even more
00:06:12.240
and this is because every single dictionary out there has different characteristics and then one dictionary
00:06:18.400
can't always cover everything for example british english is a bit different from american english
00:06:25.280
and then you want to implement split checker that can correct both or sometimes it shouldn't because for
00:06:30.880
example optimization has sometimes see sometimes s and if you were in america then the space shuttle shouldn't suggest
00:06:37.280
it and then another example could be um
00:06:45.280
speech check of war japanese i think this is also true for other languages like chinese
00:06:50.880
but the reason why it needs multiple dictionaries is that it is typical that when you're writing japanese
00:06:57.120
you always use english words and then you probably need two different dictionaries so that the spectacle can
00:07:03.759
correct um even english words while you are writing something
00:07:11.199
the next one is control mechanism and mathematically it is just a formula it
00:07:16.240
usually checks uh whether or not each one of the words in the dictionary is is the right one
00:07:22.639
a spell checker can have multiple formulas because one founder doesn't always cover everything because the same
00:07:28.000
reason as an as why dictionaries can't and since it is just a phenomenon uh
00:07:33.759
some metrics must be calculated based on the user impact and um and the words in the dictionary
00:07:41.039
there are many types of metrics out there like um using culture similarity between two
00:07:46.400
strings with leuvenstein and yellow winkler and humming and there's a lot more
00:07:52.720
and naively scanning all the words in the dictionary will be painfully slow because let's say english
00:07:59.360
there are like four million words and then you don't you really don't want to spend everything because it takes time
00:08:06.800
so optimization it basically improves performance or accuracy sometimes both
00:08:14.479
there are many many many optimization techniques out there but they're actually just
00:08:19.840
they're usually context specific like for example this blue checker has optimization technique but if you have another one
00:08:26.720
then this one can't use the other one but it's really powerful because it really often optimization is what makes
00:08:33.760
spectaculars great so i think really should
00:08:40.000
we don't so if you are writing a spell checker then you really should optimize
00:08:46.160
now let's take an example let's take a look at some examples um
00:08:52.480
obviously the second one is wrong they were traveling and it is it is easy for humans to
00:08:59.200
choose the right one yeah it is clear that one is the correct one but for computers it is really hard to choose uh
00:09:06.000
which one is the right one because computers are not smart and in this case you can just implement
00:09:11.519
some kind of grammar analysis because uh one can come between travel and trap
00:09:17.920
day and travel but where camp can stay in the middle of the day and traveling
00:09:27.760
another example will be something like this and then this sentence looks really weird but
00:09:32.880
space attackers are not only for humans for example if you talk to siri it can't recognize the difference between no as a
00:09:39.600
verb and no as an answer in this case you can just create a dictionary for for the same
00:09:46.399
sound of a word and then so that we took a split checker
00:09:52.080
to pick up the right one now we know that a split checker can
00:09:59.120
have three things uh dictionary control mechanism and optimization
00:10:06.000
so what's the dictionary of the digital jam and what control mechanisms does it use and are there any option automations
00:10:12.240
in it the dictionary update of the dam is simply just a list of symbols
00:10:19.920
and and it it calculates leaving same distance for each word in the dictionary
00:10:25.120
and then and are using pet and then suggest the ones that are within the
00:10:30.839
threshold so what is limits and distance it is actually quite simple let's say you have two strings
00:10:37.440
in the in the previous example we talked about start with and it starts with
00:10:53.600
so here you can see stopped west and it starts wet and obviously there's a one character difference between them
00:11:01.360
the s letter between t and under which means you if you remove one letter
00:11:06.640
uh from the from the second one then they will be identical that means the leavings and distance
00:11:12.959
will be one now let's take a look at this example of
00:11:18.000
first hand and full name obviously there's three three letter
00:11:24.320
difference as well as one x-ray letter in first name
00:11:29.760
so the learning center distance will be four
00:11:35.120
and envisioning jam has just one optimization which is a context-based dictionary
00:11:41.440
i'm not sure if i should call it an optimization because this is what i did since the beginning but
00:11:48.720
basically you can if you want to get all the lists of the names that you can just call symbol dot
00:11:54.399
all symbols and so how many symbols are defined in a
00:12:01.279
ruby process if you as you can see there there are about uh 2500 symbols if you
00:12:08.639
just call it with rubicon third symbol of all symbols and size
00:12:16.399
so how many symbols are defined when you just do wells rails new and then what else default refdb migrate and then well
00:12:23.600
c i'm going to ask you guys raise your hand if you think it's
00:12:29.360
five thousand ten thousand
00:12:35.839
twenty thousand fifty thousand
00:12:44.000
a hundred thousand okay so the answer is uh about 20
00:12:51.440
thousand involves which is quite a lot because
00:12:56.560
every time you get an error you really don't want to scan 2000 words because it takes time
00:13:02.959
and as you can see here the number of methods available are relatively small like for for string we have um 26 edge
00:13:10.320
methods for f1 red we have 22 346 methods and one hash
00:13:19.720
236 and for user model it has about 600 methods and then a user
00:13:27.360
object has about 400. so every time you get an error problem
00:13:34.240
like hash it doesn't have to scan all the symbols which is about uh
00:13:39.920
20 20 000. in this case you can just scan about
00:13:45.040
20 50 300 methods
00:13:50.800
and visioning gem uses a pattern called find a pattern that means um
00:13:57.279
for example you get a name error and it says initialize consent then it's going to use consonant finder which only knows
00:14:03.760
about the list of consonant names and then if it's if the error is known for error then method it's going to use
00:14:11.600
method finder which only knows about method names that you can call
00:14:19.040
so this is how the jimmy jam works and this is actually the like how the
00:14:24.959
latest version of digimon works and um but um it is actually different from um
00:14:32.160
the one that is available on github now we now we know um how it works
00:14:40.160
but we don't know how accurate it is because sometimes it doesn't suggest that
00:14:46.480
sometimes it suggests too many too many methods and we want to know how accurate is it
00:14:54.240
how are we going to do this we can't just test it what will be symbols because it's hard to collect uh
00:14:59.600
typing data while you're programming so i'm gonna just use uh existing um data
00:15:05.519
that is available on the internet and i'm gonna use um wiktionary simple
00:15:12.160
english as a dictionary and simple english is a dictionary that only contains essential english words
00:15:19.120
uh because wikileaks also has a has a whole english dictionary but it
00:15:25.199
has 4 million words so i don't want to use it because it takes time and then while you're programming i don't think
00:15:30.959
we use a really really hard hard to remember words because you want to make it simple and so that which means you
00:15:37.920
only use essential words i'm going to also use
00:15:43.759
a list of correct and incorrect pairs from birth back spell spelling error covers
00:15:51.920
there's always a study and then and that data was used by that study and everything is available on the web
00:15:59.120
so i'm gonna upload this slide later so you can check them out
00:16:06.240
and this is a result of the evaluation as you can see here the accuracy right now is about 54 which
00:16:13.920
is actually not high so why is it low uh what kind of names
00:16:19.279
can the spell taken not correct
00:16:24.560
obviously many cases where i remember a method of them incorrectly and the correct screw checker doesn't
00:16:31.199
the current speed checker doesn't catch it so let's just optimize it
00:16:39.360
and you may already realize it but sometimes i say miss type and then sometimes i say misspell
00:16:45.120
and then they are actually different a study said that spell characters that
00:16:50.320
incorrect mistypes can't always correct misspells and it is easy to correct missed types
00:16:56.800
because it you can just you can just call it edit distance with relevance 10 distance and then if you
00:17:02.720
just make a typo like for example you try to hit a and accidentally hit s then there will be just one character
00:17:08.959
difference and then that's which spectacular can correct that mistake
00:17:14.959
but when it comes to uh like spelling this text like you don't remember the method name correctly and you would stop
00:17:22.880
no you don't know how to type and then the other and you don't have to type and you try
00:17:28.880
to guess and but it doesn't always
00:17:33.919
catch the white one and the other study says that um you always remember uh the first part of the
00:17:41.840
method name are not messed with just the snaps in general like yesterday max called me nishida-san
00:17:48.000
but he remembered uh the first part of my money but didn't even remember the
00:17:53.600
last one so i i guess it's a good
00:17:58.840
example and now it's time to use jello wings decent sorry a yellow winkler distance um
00:18:05.200
what is yellow yellow winter distance it is basically a yellow distance plus
00:18:10.559
prefix bonus um there's another distance from the trade called yellow and the prefix bonus has been added because you
00:18:17.520
always remember the first part of that uh name and if you add a prefix bonus
00:18:22.559
then you can you can pick up the the right one because it has bonus
00:18:29.600
so what is the yellow distance there are two important metrics uh m and
00:18:36.000
t and the first one is under normal matching letters and the second one is
00:18:41.200
have the number of transpositions
00:18:46.640
take a look at this example here you can you can find first name and the second one has
00:18:53.760
a wrong character and to calculate m it checks if the letter appears in the
00:19:00.720
first and the other one and here you can see just four arrows
00:19:07.120
and then the question is does it actually have to scan everything
00:19:12.320
and the answer is actually no because it has matching window which means let's say you have you have two long
00:19:18.640
strings and the first one has a chapter a in the first plate and then the second one has here the a in the last one but
00:19:25.600
it doesn't make sense if you if you combine these two things because it's too far
00:19:33.360
so we don't have to check these ones and as you can see every letter here
00:19:39.679
appears in the other one which means the matching number will be just 10.
00:19:47.440
and there are two transpositions meaning um t is gonna be in just one because it's
00:19:54.400
gonna be half number of the transpositions here now we know n and t
00:20:01.520
and you can calculate the distance with this formula and it's going to be
00:20:07.640
0.9666 xx the next thing we have to do is to cultivate a prefix bonus
00:20:17.039
to calculate it it it let's only care about the first four letters in the strings
00:20:23.280
so let's forget about the rest on we only know about the first one
00:20:28.400
and check if each one each one matches the the other one
00:20:35.039
and obviously the first one matches but the second one doesn't
00:20:40.320
and even if the second one second one appears in the third place in the other string it should stop counting if it doesn't
00:20:47.440
match so in this case i should appear in the second place but it doesn't so it should
00:20:52.960
just stop counting which means in this case the prefix
00:20:58.480
match is is gonna be just one
00:21:03.520
and we're going to use this formula to cultivate bonus where
00:21:09.200
w is a weight usually is 0.1 and mp is number prefix matrix which is just one
00:21:16.000
and then here yet j is yellow yellow distance
00:21:21.919
which means the prefix bonus is going to be zero point zero zero three three three three
00:21:29.679
and since yellow winter this one is just yellow wing club plus prefix bonus we can just uh combine this two
00:21:35.919
so we're gonna get 0.9.6999999
00:21:43.600
so yeah um as you can see they are pretty close which is why we get um value which is this is really close to
00:21:49.840
one if everything's same the yellow inter distance is going to be just one because
00:21:56.400
because they are same now let's talk about the misspell
00:22:01.600
correction in digimon jam it uses degenerate distance and then picks up the closest one only if no miss
00:22:09.520
type corrections are made and then the living stand distance and
00:22:14.720
then the limits staying distant i'm sorry this time should be lower than that length of the sort of distance
00:22:20.720
because um sometimes yellowing the distance could be really high even if the living stand distance is really um
00:22:28.080
really high um so uh releasing distance should be low
00:22:33.280
and then otherwise it's going to suggest something that is not related
00:22:40.480
now let's read the evaluation um we can use the same script and how
00:22:45.679
uh how much it is improved and now it's actually better the
00:22:50.880
accuracy increased by about seven percent and then and that's about eighty percent attract
00:22:56.720
which is great but it's but it is also true that twenty
00:23:02.240
percent of the time it is wrong so what are the corrections that didn't go well
00:23:08.240
uh this is just one of them face fate and space
00:23:13.600
the reason why they are mispelled is that um they sound quite similar so like
00:23:18.640
for example if you say faith then somebody thinks that oh it is f a z e but it's actually one but
00:23:25.840
um both livingston yellow winkler distance cannot catch it because the distance is quite low and and uh
00:23:33.520
and the first character is actually different any another example like this female
00:23:39.840
email female and then night unite night same problem
00:23:45.600
and then the last one uh this is interesting because it always happens to me like you really don't know it is s or
00:23:54.240
c and then i don't know like how many s do i have to type how many cs do i have to type it's really confusing but
00:24:02.400
as you can see most of the errors are coming from the fact that this one they sounds quite similar
00:24:08.480
but has different letters like c o s uh ph or f
00:24:14.480
in other words if i have to improve the digimon jam even more
00:24:19.760
i should probably apply like a pronunciation based observation
00:24:26.320
now let's talk about writing a finding this is the last section of this talk but yeah it's really great so
00:24:33.360
the reason why you want to write a finder is that let's say you have a you're writing a rails app and then
00:24:38.960
sometimes you want to find it like this for example use uh actual active record then miss type
00:24:46.000
attributes name in the database and you will get unknown attribute error
00:24:51.120
but it doesn't correct our mistake because it's not name error it's not norms of error so it doesn't know about how to
00:24:57.840
correct this how to correct the
00:25:03.279
mistake here i want something like this so i'm going to tag in the in the hash
00:25:10.480
and then it should suggest a name so that i can easily realize that
00:25:15.760
i'm doing something wrong as i talked earlier djimogen has a
00:25:21.360
couple of clients by default but you can also add a new one if you want which is great
00:25:28.640
so let's just implement it here you can see a class called attribute name finder which includes djm in base finance um
00:25:36.000
i'm not actually sure if it's a good name i should probably change it to something else if i come with a new one
00:25:45.600
and what you really have to implement is just two methods and initializing and searches map out
00:25:53.600
the initializer takes an exception object and then you can grab things like a binding object and
00:25:59.200
original message and what's important here is that you really have to call original message
00:26:04.559
because this file is evaluated while it is trying to generate a message so if you call just message then it's
00:26:11.279
going to be stack overflow error because it tries to generate a message it tries
00:26:16.320
to call them finder it tries to call it method so it's not good so it's really important to call it original message
00:26:24.960
and a certain method should recent hash where the key should be a user input and
00:26:30.000
the value should be a dictionary and it has a response to attribute
00:26:36.720
methods sorry attribute name and in column length then you can just implement like
00:26:41.840
this attribute name word is coming from the original message and then column length
00:26:49.360
is coming from column lamps so here evol soft class is actually a
00:26:55.440
x record object and then if you say column lamps you can get a list of names
00:27:03.840
and don't forget to add a new finder to the little finest
00:27:10.000
so before we get something like this but if you after you implement the finder then
00:27:15.279
you're gonna get something like this which is great uh it is available on github so check it
00:27:21.919
out so usually as matt said it uh it's going
00:27:28.159
to be bundled when ruby's 2.3 is coming coming out
00:27:33.200
but um there's still a lot a lot of things that i have to do like we're moving support for other mris
00:27:40.480
there will be rubiness um if it if it should be if it's it is going to be bundled as a
00:27:46.399
as part of ruby then you shouldn't know about jerry you shouldn't know about women else you shouldn't know about old
00:27:51.440
movies you shouldn't know about rails bundles regents and then the next thing i have to do is
00:27:58.640
stop monkey patching right now did de jamin jam has a monkey patching and also a c extension and then
00:28:07.279
i'm expecting the next version of movie jack uh ruby version two
00:28:13.360
to include the extension and also hopefully i don't have to
00:28:18.720
do multi-patching anymore so so yeah
00:28:24.880
there's still other things a lot of things that i have to do but hopefully i can ship it west with the next version
00:28:31.360
of movie and then one last thing i want to tell you today is that
00:28:37.120
digit museum totally works with emojis
00:28:49.760
all right that's pretty much it
00:28:59.520
thank you so much uh we have time for a couple of quick questions if you have any if you do please come up to the mic
00:29:12.480
hi thanks for the speech um i have a question i'm trying to implement tfidf the problem i face is with the tf with
00:29:19.760
the idf whereby i'm trying to look for a corpus with the document frequencies and also if i get
00:29:27.440
it because it's quite large how what data format would i best put it in to actually do a fast query
00:29:34.960
uh can you say again so your question is uh you want to implement the finder but you want to change the format
00:29:40.640
i'm trying to implement tf idf uh whereby it tries to find the importance of a word
00:29:46.240
in document so i'm trying to find a good corpus and also if i find it how would i best
00:29:54.080
store it so that i could do a quick query
00:29:59.279
um i don't know actually what i can think of is to implement like vm programming or emax plugin or moving
00:30:06.720
my plugin to automatically capture what you type and then send it to somewhere else so that you can correct uh you can
00:30:13.520
collect uh like what you type and while you actually miss type or misspelled so yeah it's a
00:30:19.919
it's a difficult question because i use some um some covers that are that is available
00:30:25.679
on the internet but it is used like back in 1980 like and it could be really old
00:30:32.480
so yeah so the the evolution script is not actually good
00:30:38.720
enough okay thank you
00:30:55.840
you