Mathematics Optical Character Recognition
Matthew Elkherj

Task

I am interested in working on the math OCR (optical character recognition) problem. According to the Infty project, there are multiple levels of solutions to this task:
  • "Level 1: bitmap images of printed materials (e.g. GIF, TIFF, PNG). This is the input to be processed.
  • Level 2: searchable digitized document (e.g. PS, PDF),
  • Level 3: logically structured document with links (e.g. HTML, MathML,LATEX),
  • Level 4: partially executable document (e.g. Mathematica, Maple),
  • Level 5: formally presented document. (e.g. Mizar, OMDoc)"
I would like to solve the problem at level 2, and if successful make some primitive transitions to level 3. By this I mean, I would like to extract characters from the document, and if successful extract some primitive spatial relationships between characters.

Method

There are a few approaches I could take, but I am interested in a particular one as of now. I think a latent variable probabilistic model could work, with some useful latent variables:
  • some measure of how close characters are together, the scale of various characters relative to pixels
  • how frequently characters overlap
  • whether we are in a text field or math field
  • how thick characters are (are we in a bolded section of text? if so, look for bold features)
First I will extract the "skeleton" of potential characters, possibly consisting of just line segments, maybe curves in a more complicated model. In addition I will find some measure of image quality, ie how sharp edges of characters are. From here, I plan on selecting groups of line segments, and if these segments "probably" match a character, searching for other attributes of that character. The probability of a group of line segments being a certain character will be affected by the classification of nearby characters, and by the quality of the region of the image. This is where I think a latent variable probabilistic model is useful.

Datasets

  1.  I have downloaded the Infty project first dataset. It is a collection of 30 mathematical and scientific articles with annotations of symbols and areas (text or math).
  2. MNIST digit handwriting.

Milestone Goal

By the milestone date, I would like to be able to extract some characters from the Infty dataset. This will involve work in preprocessing (which is one of the areas I want to concentrate on), and running an early version of the character extraction.

References

  1. Infty Project. http://www.inftyproject.org/
  2. Nakagawa, Nomura, Suzuki. Extraction of Logical Structures from Articles in Mathematics. http://www.springerlink.com/content/x4t3xc9l13pt6f5g/fulltext.pdf
  3. LeCun, Cortes. The MNIST Database of Handwritten Digits. http://yann.lecun.com/exdb/mnist/
  4. Belongie, Serge. Shape Matching and Object Recognition Using Shape Contexts. http://www.cs.berkeley.edu/~malik/papers/BMP-shape.pdf
  5. Sun, Xu et al. A Discriminative Latent Variable Chinese Segmenter with HybridWord/Character Information.