English 中文


A printed scientific document recognition system

Warning: MathOCR is still in pre-alpha stage, recognition result may not be good enough for practical purpose.


MathOCR is a printed scientific document recognition system written in pure Java, it is released under the terms of GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

MathOCR has the functionality of image preprocessing, layout analysis and character recognition, especially the ability to recognize mathematical expression. MathOCR can work without dependency on external libraries other than the standard Java distribution, however, it can also be used as a front-end to OCR system like Tesseract, GNU Ocrad or GOCR.

MathOCR project is started at March 2014 as a undergraduate research project to develop a printed mathematical formula recognition system in Sun Yat-Sen University, it was first released at September 2014. Later on, it continued development as the project of the undergraduate's thesis of the developer and became a document recognition system.

Get MathOCR

MathOCR can be downloaded from SourceForge's download page

Release notes

MathOCR 0.0.3 released[2015-05-07]

Major changes:

  • Logical layout analysis functionality is added
  • New structural analysis algorithm for mathematical expression
  • Output format can be LaTeX or HTML
  • New graphics user interface
  • A build-in command line interface
  • Image format PNM is supported

MathOCR 0.0.2 released[2014-11-29]

Minor changes to improve structural analysis algorithm.

MathOCR 0.0.1 released[2014-9-29]

This is the first release of MathOCR, features:

  • Input formats: PNG,JPEG,GIF,BMP
  • Output format: LaTeX
  • GUI provided
  • Basic image preprocessing tools
  • Original character recognition system for mathematics symbol
  • Possible to extend symbol set by user
  • Original structural analysis system using bottom-up approach

Technique summary

Image preprocessing

Standard approaches is used, these are the procedures:

  1. Convert input image into gray-scale image
  2. Convert gray-scale image into binarized image
  3. Apply filter(s) (optional)
  4. Skew detection and correction(optional)

Layout analysis

These are the procedures:

  1. Connected components analysis based on disjoint-set data structure
  2. Page segmentation based recursive XY-cut
  3. Reading order sort based on topology sort
  4. Text-Graphics classification using components' height
  5. Extract text line using projection
  6. Logical block classification using alignment and OCR result
  7. Paragraph growing using alignment

Optical character recognition

These are the normal procedures:

  1. Construct initial list of candidates for each glyph
  2. Use a sequence of matchers to filter out some candidates
  3. Template matching based on Hausdorff distance is used to rank the remaining candidates
  4. Combine glyphs to form character

To match special symbols like root sign and big delimiter, template is generated dynamically.

Optical formula recognition

These are the procedures:

  1. Fix some mis-recognition using the information from other symbols
  2. Construct a initial symbol adjoin graph
  3. Rewrite the symbol adjoin graph using some rules
  4. If the graph cannot be reduced to only one vertex, recognition fail


The default data files bundled with MathOCR are derive works of amsfonts by American Mathematical Society. The code used to read PNM files is derive from the JAI library.

In addition, I would like to thank my supervisor Dr. Peixing Li, this program would not be here without his encouragement.






MathOCR具备基本的图形预处理、版面分析和字符识别能力,特别是能够识别数学公式。MathOCR可以不依赖于标准Java库以外的库而独立工作,但也可以作为Tesseract、GNU Ocrad或GOCR等OCR系统的前端。





MathOCR 0.0.3 发布[2015-05-07]


  • 加入文档版面分析功能
  • 新的数学公式结构分析算法
  • 文档识别结果可输出为LaTeX或HTML格式
  • 新的图形用户界面
  • 内置命令行界面
  • 新增支持图片格式PNM

MathOCR 0.0.2 发布[2014-11-29]


MathOCR 0.0.1 发布[2014-9-29]


  • 可接受输入格式包括PNG,JPEG,GIF,BMP
  • 输出格式为LaTeX
  • 提供图形用户界面
  • 基本的图形预处理
  • 原创的数学符号识别系统
  • 用户可自行扩充支持的符号集
  • 原创的结构分析系统



产生MathOCR 0.0.3自带识别数据的训练数据在获取。




MathOCR 主页陈颂光 创作,采用 知识共享 署名 4.0 国际 许可协议进行许可。