Task
I am interested in working on the math OCR (optical character recognition) problem. According to the Infty project, there are multiple levels of solutions to this task:
- "Level 1: bitmap images of printed materials (e.g. GIF, TIFF, PNG). This is the input to be processed.
- Level 2: searchable digitized document (e.g. PS, PDF),
- Level 3: logically structured document with links (e.g. HTML, MathML,LATEX),
- Level 4: partially executable document (e.g. Mathematica, Maple),
- Level 5: formally presented document. (e.g. Mizar, OMDoc)"
Method
There are a few approaches I could take, but I am interested in a particular one as of now.
I think a latent variable probabilistic model could work, with some useful latent variables:
- some measure of how close characters are together, the scale of various characters relative to pixels
- how frequently characters overlap
- whether we are in a text field or math field
- how thick characters are (are we in a bolded section of text? if so, look for bold features)
Datasets
- I have downloaded the Infty project first dataset. It is a collection of 30 mathematical and scientific articles with annotations of symbols and areas (text or math).
- MNIST digit handwriting.
Milestone Goal
By the milestone date, I would like to be able to extract some characters from the Infty dataset. This will involve work in preprocessing (which is one of the areas I want to concentrate on), and running an early version of the character extraction.
References
- Infty Project. http://www.inftyproject.org/
- Nakagawa, Nomura, Suzuki. Extraction of Logical Structures from Articles in Mathematics. http://www.springerlink.com/content/x4t3xc9l13pt6f5g/fulltext.pdf
- LeCun, Cortes. The MNIST Database of Handwritten Digits. http://yann.lecun.com/exdb/mnist/
- Belongie, Serge. Shape Matching and Object Recognition Using Shape Contexts. http://www.cs.berkeley.edu/~malik/papers/BMP-shape.pdf
- Sun, Xu et al. A Discriminative Latent Variable Chinese Segmenter with HybridWord/Character Information.