MS Word and other programs do spelling correction. Have you ever wondered how they do it? In this class we see a way to write a simple spelling corrector.
What does it mean for two words or two names to be "close" to one another? There are many possible definitions, but an easy one to implement is called "edit distance". It is the number of editing operations needed to convert the first word or name into the second. The operations are: copy the first character from the first string to the output (which can be done for free), insert a character not in the first string into the output, delete the first character from the first string without outputting it, replace the first character in the first string by a different character in the output, or swap the first two characters in the first string and output them (so "thier" can be converted to "their" in one operation). This file contains two methods that solve this problem recursively: EditDistance.java.
Each function looks at the first character of s1 and s2. If they match, they then find the edit distance between remaining strings. If not, they try all of the possible operations to make the next character in the output match the next character in second string and then solve the remaining subproblems. (Note that copy and replace have the same subproblem, but their contributions to the edit distance differ by one.) The one that solves the problem in the fewest edits is the one chosen. We first look at a simple recursive implementation of this idea:
public static int naiveEditDistance(String s1, String s2) {
int matchDist; // Edit distance if first char. match or do a replace
int insertDist; // Edit distance if insert first char of s1 in front of s2.
int deleteDist; // Edit distance if delete first char of s2.
int swapDist; // edit distance for twiddle (first 2 char. must swap).
if(s1.length() == 0)
return s2.length(); // Insert the remainder of s2
else if (s2.length()== 0)
return s1.length(); // Delete the remainer of s1
else {
matchDist = naiveEditDistance(s1.substring(1), s2.substring(1));
if(s1.charAt(0) != s2.charAt(0))
matchDist++; // If first 2 char. don't match must replace
deleteDist = naiveEditDistance(s1.substring(1), s2) + 1;
insertDist = naiveEditDistance(s1, s2.substring(1)) + 1;
if(s1.length() > 1 && s2.length() > 1 &&
s1.charAt(0) == s2.charAt(1) && s1.charAt(1) == s2.charAt(0))
swapDist = naiveEditDistance(s1.substring(2), s2.substring(2)) + 1;
else
swapDist = Integer.MAX_VALUE; // Can't swap if first 2 char. don't match
return Math.min(matchDist, Math.min(insertDist, Math.min(deleteDist, swapDist)));
}
}
Works great, on small strings. But exponential run time growth - look at tree for "catz" and "cots". Works fine. But "Kate Blanchet" and "Cate Blanchett" takes minutes. Ignoring swap (which usually is not possible) each problem makes 3 recursive calls. The shortest path in the tree is min(|s1|, |s2|) for all "match" calls, and the longest is |s1| + |s2| when you remove only one letter per call on delete and insert. For strings that are 14 or 15 characters long this is much worse than 314, which is pretty bad.
So what can we do? Avoid re-computing subproblems by memoizing. We keep a map of solved subproblems. To solve a subproblem, we first look up the problem in the map of solved subproblems. If we find the subproblem in the map we use the corresponding answer. If not, we solve the problem and add the solution to the map.
We do this in the method memoizedEditDistance
and its helper function
editDist
. We also use the class StringPair
to hold the pair of strings in a subproblem. (Note that StringPair
overrides equals
, so must also override hashcode
.)
public int memoizedEditDist(String s1, String s2) {
solvedProblems = new HashMap();
return editDist(s1, s2);
}
private int editDist(String s1, String s2) {
int matchDist; // Edit distance if first char. match or do a replace
int insertDist; // Edit distance if insert first char of s1 in front of s2.
int deleteDist; // Edit distance if delete first char of s2.
int swapDist; // edit distance for twiddle (first 2 char. must swap).
if(s1.length() == 0)
return s2.length(); // Insert the remainder of s2
else if (s2.length()== 0)
return s1.length(); // Delete the remainer of s1
else {
StringPair pair = new StringPair(s1, s2);
Integer result = solvedProblems.get(pair);
if(result != null) // Did we find the subproblem in the map?
return result; // If so, return the answer
else {
matchDist = editDist(s1.substring(1), s2.substring(1));
if(s1.charAt(0) != s2.charAt(0))
matchDist++; // If first 2 char. don't match must replace
deleteDist = editDist(s1.substring(1), s2) + 1;
insertDist = editDist(s1, s2.substring(1)) + 1;
if(s1.length() > 1 && s2.length() > 1 &&
s1.charAt(0) == s2.charAt(1) && s1.charAt(1) == s2.charAt(0))
swapDist = editDist(s1.substring(2), s2.substring(2)) + 1;
else
swapDist = Integer.MAX_VALUE; // Can't swap if first 2 char. don't match
int dist = Math.min(matchDist, Math.min(insertDist, Math.min(deleteDist, swapDist)));
solvedProblems.put(pair, dist); // Save the result
return dist;
}
}
}
Much faster - all pairs of final segments of strings get tested, but only O(m*n) of those (where m and n are the lengths of input strings). Lookup takes O(1) expected time.
The general idea is called "dynamic programming". Will see ways to do it without using a Map in CS 31. The basic idea is to keep a matrix of the subproblem answers and fill it in using an order that guarantees that a subproblem's solution is known before you use it to fill in another place in the table.
Norvig worked up a quick spell corrector while on a plane ride! It is written in the language Python, and is 21 lines long. The description is at http://www.norvig.com/spell-correct.html.
Rael Cunha converted it to Java. He was trying to keep it short and got it down to 35 lines of code. I have added comments to it in Spelling.java
First, some theory. There are two things to be considered. First, how far away are you from a correctly spelled word? Edit distance is a reasonable choice here, although other measures that took into account common typos and mis-spellings would give a more accurate measure.
Norvig just looks at words with minimum edit distance. Then how does he pick one? He picks the one that is most common in English. How does he determine this? He creates big.txt, a file of a number of books from the Project Gutenberg (128,000 lines, about a million words) and lists of the most frequent words in Wiktionary and the British National Corpus. He then runs statistics! It includes lots of Sherlock Holmes, War and Peace, etc. It was missing some common words (e.g. kangaroo), so I appended the Unix spell dictionary /usr/share/dict/words that we used in WordReduction to get bigger.txt.
So his algorithm:
Here is the Java implementation in Spelling.java:
class Spelling {
private HashMap nWords;
/**
* Constructs a new spell corrector. Builds up a map of correct words with
* their frequencies, based on the words in the given file.
*
* @param file the text to process
* @throws IOException
*/
public Spelling(String file) throws IOException {
nWords = new HashMap();
BufferedReader in = new BufferedReader(new FileReader(file));
// This pattern matches any word character (letters or digits)
Pattern p = Pattern.compile("\\w+");
for(String temp = ""; temp != null; temp = in.readLine()){
Matcher m = p.matcher(temp.toLowerCase());
// find looks for next match for pattern p (in this case a word). True if found.
// group then returns the last thing matched.
// The ? is a conditional expression.
while(m.find())
nWords.put((temp = m.group()), nWords.containsKey(temp) ? nWords.get(temp) + 1 : 1);
}
in.close();
}
/**
* Constructs a list of all words within edit distance 1 of the given word.
* @param word the word to construct the list from
* @return a list of words with in edit distance 1 of word
*/
private ArrayList edits(String word) {
ArrayList result = new ArrayList();
// All deletes of a single letter
for(int i=0; i < word.length(); ++i)
result.add(word.substring(0, i) + word.substring(i+1));
// All swaps of adjacent letters
for(int i=0; i < word.length()-1; ++i)
result.add(word.substring(0, i) + word.substring(i+1, i+2) +
word.substring(i, i+1) + word.substring(i+2));
// All replacements of a letter
for(int i=0; i < word.length(); ++i)
for(char c='a'; c <= 'z'; ++c)
result.add(word.substring(0, i) + String.valueOf(c) + word.substring(i+1));
// All insertions of a letter
for(int i=0; i <= word.length(); ++i)
for(char c='a'; c <= 'z'; ++c)
result.add(word.substring(0, i) + String.valueOf(c) + word.substring(i));
return result;
}
/**
* Corrects the spelling of a word, if it is within edit distance 2.
* @param word the word to check/correct
* @return word if correct or too far from any word; corrected word otherwise
*/
public String correct(String word) {
// If in the dictionary, return it as correctly spelled
if(nWords.containsKey(word))
return word;
ArrayList list = edits(word); // Everything edit distance 1 from word
HashMap candidates = new HashMap();
// Find all things edit distance 1 that are in the dictionary. Also remember
// their frequency count from nWords.
// (Note if equal frequencies the last one will be the one remembered.)
for(String s : list)
if(nWords.containsKey(s))
candidates.put(nWords.get(s),s);
// If found something edit distance 1 return the most frequent word
if(candidates.size() > 0)
return candidates.get(Collections.max(candidates.keySet()));
// Find all things edit distance 1 from everything of edit distance 1. These
// will be all things of edit distance 2 (plus original word). Remember frequencies
for(String s : list)
for(String w : edits(s))
if(nWords.containsKey(w))
candidates.put(nWords.get(w),w);
// If found something edit distance 2 return the most frequent word.
// If not return the word with a "?" prepended. (Original just returned the word.)
return candidates.size() > 0 ?
candidates.get(Collections.max(candidates.keySet())) : "?" + word;
}
/**
* Original version read a single word to correct from the command line.
* @throws IOException
*/
public static void main(String args[]) throws IOException {
Spelling corrector = new Spelling("bigger.txt");
Scanner input = new Scanner(System.in);
System.out.println("Enter words to correct");
String word = input.next();
while(true) {
System.out.println(word + " is corrected to " + corrector.correct(word));
word = input.next();
}
}
Let's see what it does. First, nWords
is a hash map of words with frequencies,
mapped to lower case. It is initialized in the Spelling
class constructor. A
BufferedReader
is used to read in a file line by line. The Pattern
class allows
us to look for regular expressions, and the particular pattern \w+
matches any
number of letters and numbers. It is compiled and saved in p
.
The for loop reads each line in the file, converts it to lower case, and calls
the method matcher
on it. This method creates a Matcher
object
from the pattern that can be used to identify the words in the line. The method
find
finds the next word in the line, returning true as long as there is one. The
method group
retrieves the word that find
found. Finally, there is
a very compact line that updates the frequency map. It calls put
with the word as
the key and with value
one more than the count associated with the word if the word is in the map or 1 if it is not.
The ?
is a conditional expression. It evaluates the boolean expression before the
?
and uses it to select the expression before the :
if the expression is
true or the one after the :
if the expression is false.
We now look at the correct
method.
It would be too slow to compute the edit distance to hundreds of thousands
of words, even with memoization or dynamic programming. So Norvig reverses
the process! He first sees if the word is correct. If so he returns it.
If not, he
computes all things with edit distance 1 from the word
and looks them up in nWords
. If he finds a word he puts it
into a new map candidates
, but the key is the frequency and the
value is the word. Note that this is the reverse of the nWords
map,
and results in only the last word of a given frequency being saved. If he finds
valid words he returns the one that occurred most frequently.
For each word that is edit distance 1 from word
he finds all things that are
an additional edit distance of 1. In this way he finds all things of edit distance
2 from the original word (plus the original word, but we know that is not in the dictionary.)
Just as with the list of words of edit distance 1 he looks up each new word in the dictionary
and adds it to candidates
. If any words are found he returns the most frequently
occurring one. If not he returns the original word. I found it awkward to return the same
thing for correct and for not within edit distance 2 of any word in the dictionary, so I
prepended a "?" to the word if it is not found.
Norvig did some experiments that showed 99% of the time mis-spellings are within edit distance 2 of the correct word. His spell corrector gets about 67% of the words correct. The first version of the paper reported much better results, but there was a subtle bug in his experiment that made the results look better than they were - see the link above.
So how does he generate all words of edit distance 1 from a given word? This
is done by the method edits
. He creates a list of everything that
he can get by dropping a letter from word
(delete), swapping each pair
of adjacent letters, replacing each
letter in word
by all possible letters, and inserting each possible letter
in each possible position.
Final questions - how to improve the spell checker.