Mathematics Optical Character Recognition

Task

I am interested in working on the math OCR (optical character recognition) problem. According to the Infty project, there are multiple levels of solutions to this task:

"Level 1: bitmap images of printed materials (e.g. GIF, TIFF, PNG). This is the input to be processed.
Level 2: searchable digitized document (e.g. PS, PDF),
Level 3: logically structured document with links (e.g. HTML, MathML,LATEX),
Level 4: partially executable document (e.g. Mathematica, Maple),
Level 5: formally presented document. (e.g. Mizar, OMDoc)"

I would like to solve the problem at level 2, and if successful make some primitive transitions to level 3. By this I mean, I would like to extract characters from the document, and if successful extract some primitive spatial relationships between characters.

Method

There are a few approaches I could take, but I am interested in a particular one as of now. I think a latent variable probabilistic model could work, with some useful latent variables:

some measure of how close characters are together, the scale of various characters relative to pixels
how frequently characters overlap
whether we are in a text field or math field
how thick characters are (are we in a bolded section of text? if so, look for bold features)

First I will extract the "skeleton" of potential characters, possibly consisting of just line segments, maybe curves in a more complicated model. In addition I will find some measure of image quality, ie how sharp edges of characters are. From here, I plan on selecting groups of line segments, and if these segments "probably" match a character, searching for other attributes of that character. The probability of a group of line segments being a certain character will be affected by the classification of nearby characters, and by the quality of the region of the image. This is where I think a latent variable probabilistic model is useful.

Datasets

I have downloaded the Infty project first dataset. It is a collection of 30 mathematical and scientific articles with annotations of symbols and areas (text or math).
MNIST digit handwriting.

Milestone Goal

By the milestone date, I would like to be able to extract some characters from the Infty dataset. This will involve work in preprocessing (which is one of the areas I want to concentrate on), and running an early version of the character extraction.

References

Infty Project. http://www.inftyproject.org/
Nakagawa, Nomura, Suzuki. Extraction of Logical Structures from Articles in Mathematics. http://www.springerlink.com/content/x4t3xc9l13pt6f5g/fulltext.pdf
LeCun, Cortes. The MNIST Database of Handwritten Digits. http://yann.lecun.com/exdb/mnist/
Belongie, Serge. Shape Matching and Object Recognition Using Shape Contexts. http://www.cs.berkeley.edu/~malik/papers/BMP-shape.pdf
Sun, Xu et al. A Discriminative Latent Variable Chinese Segmenter with HybridWord/Character Information.