CS 10: Winter 2016

Lecture 26, March 2

Code discussed in lecture

Edit Distance
Spelling Corrector

Edit Distance

MS Word and other programs do spelling correction. Have you ever wondered how they do it? In this class we see a way to write a simple spelling corrector.

What does it mean for two words or two names to be "close" to one another? There are many possible definitions, but an easy one to implement is called "edit distance". It is the number of editing operations needed to convert the first word or name into the second. The operations are: copy the first character from the first string to the output (which can be done for free), insert a character not in the first string into the output, delete the first character from the first string without outputting it, replace the first character in the first string by a different character in the output, or swap the first two characters in the first string and output them (so "thier" can be converted to "their" in one operation). This file contains two methods that solve this problem recursively: EditDistance.java.

Each function looks at the first character of s1 and s2. If they match, they then find the edit distance between remaining strings. If not, they try all of the possible operations to make the next character in the output match the next character in second string and then solve the remaining subproblems. (Note that copy and replace have the same subproblem, but their contributions to the edit distance differ by one.) The one that solves the problem in the fewest edits is the one chosen. We first look at a simple recursive implementation of this idea:

public static int naiveEditDistance(String s1, String s2) {
  int matchDist;   // Edit distance if first char. match or do a replace
  int insertDist;  // Edit distance if insert first char of s1 in front of s2.
  int deleteDist;  // Edit distance if delete first char of s2.
  int swapDist;    // edit distance for twiddle (first 2 char. must swap).
    
  if(s1.length() == 0)
    return s2.length();   // Insert the remainder of s2
  else  if (s2.length()== 0)
    return s1.length();   // Delete the remainer of s1
  else {
    matchDist = naiveEditDistance(s1.substring(1), s2.substring(1));
    if(s1.charAt(0) != s2.charAt(0))
      matchDist++;  // If first 2 char. don't match must replace

    deleteDist = naiveEditDistance(s1.substring(1), s2) + 1;
    insertDist = naiveEditDistance(s1, s2.substring(1)) + 1;
      
    if(s1.length() > 1 && s2.length() > 1 && 
       s1.charAt(0) == s2.charAt(1) && s1.charAt(1) == s2.charAt(0)) 
      swapDist = naiveEditDistance(s1.substring(2), s2.substring(2)) + 1;
    else
      swapDist = Integer.MAX_VALUE;  // Can't swap if first 2 char. don't match

    return Math.min(matchDist, Math.min(insertDist, Math.min(deleteDist, swapDist)));
  }
}

Works great, on small strings. But exponential run time growth - look at tree for "catz" and "cots". Works fine. But "Kate Blanchet" and "Cate Blanchett" takes minutes. Ignoring swap (which usually is not possible) each problem makes 3 recursive calls. The shortest path in the tree is min(|s1|, |s2|) for all "match" calls, and the longest is |s1| + |s2| when you remove only one letter per call on delete and insert. For strings that are 14 or 15 characters long this is much worse than 3¹⁴, which is pretty bad.

So what can we do? Avoid re-computing subproblems by memoizing. We keep a map of solved subproblems. To solve a subproblem, we first look up the problem in the map of solved subproblems. If we find the subproblem in the map we use the corresponding answer. If not, we solve the problem and add the solution to the map.

We do this in the method memoizedEditDistance and its helper function editDist. We also use the class StringPair to hold the pair of strings in a subproblem. (Note that StringPair overrides equals, so must also override hashcode.)

public int memoizedEditDist(String s1, String s2) {
  solvedProblems = new HashMap();

  return editDist(s1, s2);
}
  
private int editDist(String s1, String s2) {
    int matchDist;   // Edit distance if first char. match or do a replace
    int insertDist;  // Edit distance if insert first char of s1 in front of s2.
    int deleteDist;  // Edit distance if delete first char of s2.
    int swapDist;    // edit distance for twiddle (first 2 char. must swap).
    
    if(s1.length() == 0)
      return s2.length();   // Insert the remainder of s2
    else  if (s2.length()== 0)
      return s1.length();   // Delete the remainer of s1
    else {
      StringPair pair = new  StringPair(s1, s2);
      Integer result = solvedProblems.get(pair);
      
      if(result != null)  // Did we find the subproblem in the map?
        return result;    // If so, return the answer
      else {
        matchDist = editDist(s1.substring(1), s2.substring(1));
        if(s1.charAt(0) != s2.charAt(0))
          matchDist++;  // If first 2 char. don't match must replace

        deleteDist = editDist(s1.substring(1), s2) + 1;
        insertDist = editDist(s1, s2.substring(1)) + 1;

        if(s1.length() > 1 && s2.length() > 1 && 
            s1.charAt(0) == s2.charAt(1) && s1.charAt(1) == s2.charAt(0)) 
          swapDist = editDist(s1.substring(2), s2.substring(2)) + 1;
        else
          swapDist = Integer.MAX_VALUE;  // Can't swap if first 2 char. don't match
        
        int dist = Math.min(matchDist, Math.min(insertDist, Math.min(deleteDist, swapDist)));
        
        solvedProblems.put(pair, dist);  // Save the result
        
        return dist;
      }
    }
  }

Much faster - all pairs of final segments of strings get tested, but only O(m*n) of those (where m and n are the lengths of input strings). Lookup takes O(1) expected time.

The general idea is called "dynamic programming". Will see ways to do it without using a Map in CS 31. The basic idea is to keep a matrix of the subproblem answers and fill it in using an order that guarantees that a subproblem's solution is known before you use it to fill in another place in the table.

Spelling Corrector

Norvig worked up a quick spell corrector while on a plane ride! It is written in the language Python, and is 21 lines long. The description is at http://www.norvig.com/spell-correct.html.

Rael Cunha converted it to Java. He was trying to keep it short and got it down to 35 lines of code. I have added comments to it in Spelling.java

First, some theory. There are two things to be considered. First, how far away are you from a correctly spelled word? Edit distance is a reasonable choice here, although other measures that took into account common typos and mis-spellings would give a more accurate measure.

Norvig just looks at words with minimum edit distance. Then how does he pick one? He picks the one that is most common in English. How does he determine this? He creates big.txt, a file of a number of books from the Project Gutenberg (128,000 lines, about a million words) and lists of the most frequent words in Wiktionary and the British National Corpus. He then runs statistics! It includes lots of Sherlock Holmes, War and Peace, etc. It was missing some common words (e.g. kangaroo), so I appended the Unix spell dictionary /usr/share/dict/words that we used in WordReduction to get bigger.txt.

So his algorithm:

Find all words of minimum edit distance.
Return the one that occurs most frequently in big.txt.

Here is the Java implementation in Spelling.java:

class Spelling {

  private HashMap nWords;

  /** 
   * Constructs a new spell corrector.  Builds up a map of correct words with
   * their frequencies, based on the words in the given file.
   * 
   * @param file the text to process
   * @throws IOException
   */
  public Spelling(String file) throws IOException {
    nWords = new HashMap();
    BufferedReader in = new BufferedReader(new FileReader(file));
    
    // This pattern matches any word character (letters or digits)
    Pattern p = Pattern.compile("\\w+");
    for(String temp = ""; temp != null; temp = in.readLine()){
      Matcher m = p.matcher(temp.toLowerCase());
      
      // find looks for next match for pattern p (in this case a word).  True if found.
      // group then returns the last thing matched.
      // The ? is a conditional expression.
      while(m.find())    
        nWords.put((temp = m.group()), nWords.containsKey(temp) ? nWords.get(temp) + 1 : 1);
    }
    in.close();
  }

  /**
   * Constructs a list of all words within edit distance 1 of the given word.
   * @param word the word to construct the list from
   * @return a list of words with in edit distance 1 of word
   */
  private ArrayList edits(String word) {
    ArrayList result = new ArrayList();
    
    // All deletes of a single letter
    for(int i=0; i < word.length(); ++i) 
      result.add(word.substring(0, i) + word.substring(i+1));
    
    // All swaps of adjacent letters
    for(int i=0; i < word.length()-1; ++i) 
      result.add(word.substring(0, i) + word.substring(i+1, i+2) + 
                 word.substring(i, i+1) + word.substring(i+2));
    
    // All replacements of a letter
    for(int i=0; i < word.length(); ++i) 
      for(char c='a'; c <= 'z'; ++c) 
        result.add(word.substring(0, i) + String.valueOf(c) + word.substring(i+1));
    
    // All insertions of a letter
    for(int i=0; i <= word.length(); ++i) 
      for(char c='a'; c <= 'z'; ++c) 
        result.add(word.substring(0, i) + String.valueOf(c) + word.substring(i));
    
    return result;
  }

  /**
   * Corrects the spelling of a word, if it is within edit distance 2.
   * @param word the word to check/correct
   * @return word if correct or too far from any word; corrected word otherwise
   */
  public String correct(String word) {
    // If in the dictionary, return it as correctly spelled
    if(nWords.containsKey(word)) 
      return word;

    ArrayList list = edits(word);  // Everything edit distance 1 from word
    HashMap candidates = new HashMap();

    // Find all things edit distance 1 that are in the dictionary.  Also remember
    //   their frequency count from nWords.  
    // (Note if equal frequencies the last one will be the one remembered.)
    for(String s : list) 
      if(nWords.containsKey(s)) 
        candidates.put(nWords.get(s),s);

    // If found something edit distance 1 return the most frequent word
    if(candidates.size() > 0)   
      return candidates.get(Collections.max(candidates.keySet()));

    // Find all things edit distance 1 from everything of edit distance 1.  These
    // will be all things of edit distance 2 (plus original word).  Remember frequencies
    for(String s : list) 
      for(String w : edits(s)) 
        if(nWords.containsKey(w)) 
          candidates.put(nWords.get(w),w);
    
    // If found something edit distance 2 return the most frequent word.
    // If not return the word with a "?" prepended.  (Original just returned the word.)
    return candidates.size() > 0 ? 
        candidates.get(Collections.max(candidates.keySet())) : "?" + word;
  }
  
  /**
   * Original version read a single word to correct from the command line.
   * @throws IOException
   */
   public static void main(String args[]) throws IOException {
     Spelling corrector = new Spelling("bigger.txt");
     Scanner input = new Scanner(System.in);  
     
     System.out.println("Enter words to correct");
     String word = input.next();
     
     while(true) {
       System.out.println(word + " is corrected to " + corrector.correct(word));
       word = input.next();
     }
   }

Let's see what it does. First, nWords is a hash map of words with frequencies, mapped to lower case. It is initialized in the Spelling class constructor. A BufferedReader is used to read in a file line by line. The Pattern class allows us to look for regular expressions, and the particular pattern \w+ matches any number of letters and numbers. It is compiled and saved in p.

The for loop reads each line in the file, converts it to lower case, and calls the method matcher on it. This method creates a Matcher object from the pattern that can be used to identify the words in the line. The method find finds the next word in the line, returning true as long as there is one. The method group retrieves the word that find found. Finally, there is a very compact line that updates the frequency map. It calls put with the word as the key and with value one more than the count associated with the word if the word is in the map or 1 if it is not. The ? is a conditional expression. It evaluates the boolean expression before the ? and uses it to select the expression before the : if the expression is true or the one after the : if the expression is false.

We now look at the correct method. It would be too slow to compute the edit distance to hundreds of thousands of words, even with memoization or dynamic programming. So Norvig reverses the process! He first sees if the word is correct. If so he returns it.

If not, he computes all things with edit distance 1 from the word and looks them up in nWords. If he finds a word he puts it into a new map candidates, but the key is the frequency and the value is the word. Note that this is the reverse of the nWords map, and results in only the last word of a given frequency being saved. If he finds valid words he returns the one that occurred most frequently.

For each word that is edit distance 1 from word he finds all things that are an additional edit distance of 1. In this way he finds all things of edit distance 2 from the original word (plus the original word, but we know that is not in the dictionary.) Just as with the list of words of edit distance 1 he looks up each new word in the dictionary and adds it to candidates. If any words are found he returns the most frequently occurring one. If not he returns the original word. I found it awkward to return the same thing for correct and for not within edit distance 2 of any word in the dictionary, so I prepended a "?" to the word if it is not found.

Norvig did some experiments that showed 99% of the time mis-spellings are within edit distance 2 of the correct word. His spell corrector gets about 67% of the words correct. The first version of the paper reported much better results, but there was a subtle bug in his experiment that made the results look better than they were - see the link above.

So how does he generate all words of edit distance 1 from a given word? This is done by the method edits. He creates a list of everything that he can get by dropping a letter from word (delete), swapping each pair of adjacent letters, replacing each letter in word by all possible letters, and inserting each possible letter in each possible position.

Final questions - how to improve the spell checker.

Use a better "distance model" that takes into account types of errors and their relative frequencies, common mis-spellings.
Take word in context. "ther cat" vs "ther is". First case likely to be "the" or "their", second "there".