English
中文
MathOCRA printed scientific document recognition systemWarning: MathOCR is still in pre-alpha stage, recognition result may not be good enough for practical purpose. IntroductionMathOCR is a printed scientific document recognition system written in pure Java, it is released under the terms of GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. MathOCR has the functionality of image preprocessing, layout analysis and character recognition, especially the ability to recognize mathematical expression. MathOCR can work without dependency on external libraries other than the standard Java distribution, however, it can also be used as a front-end to OCR system like Tesseract, GNU Ocrad or GOCR. MathOCR project is started at March 2014 as a undergraduate research project to develop a printed mathematical formula recognition system in Sun Yat-Sen University, it was first released at September 2014. Later on, it continued development as the project of the undergraduate's thesis of the developer and became a document recognition system. Get MathOCRMathOCR can be downloaded from SourceForge's download page Release notesMathOCR 0.0.3 released[2015-05-07]Major changes:
MathOCR 0.0.2 released[2014-11-29]Minor changes to improve structural analysis algorithm. MathOCR 0.0.1 released[2014-9-29]This is the first release of MathOCR, features:
Technique summaryImage preprocessingStandard approaches is used, these are the procedures:
Layout analysisThese are the procedures:
Optical character recognitionThese are the normal procedures:
To match special symbols like root sign and big delimiter, template is generated dynamically. Optical formula recognitionThese are the procedures:
AcknowledgeThe default data files bundled with MathOCR are derive works of amsfonts by American Mathematical Society. The code used to read PNM files is derive from the JAI library. In addition, I would like to thank my supervisor Dr. Peixing Li, this program would not be here without his encouragement. MathOCR一个印刷体科技文档识别系统警告:MathOCR仍处于准预览阶段,识别效果对于实用目的而言可能并不足够. MathOCR简介MathOCR是一个用Java语言编写的印刷体科技文档识别系统,在GNU通用公共许可证版本3或(按你的意愿)更新版本下发布。 MathOCR具备基本的图形预处理、版面分析和字符识别能力,特别是能够识别数学公式。MathOCR可以不依赖于标准Java库以外的库而独立工作,但也可以作为Tesseract、GNU Ocrad或GOCR等OCR系统的前端。 MathOCR项目在2014年作为中山大学大学生创新训练计划项目《图片中数学公式的自动识别》的副产物而于2014年3月开始开发,同年9月发布首个版本,是少有的作为自由软件的印刷体数学公式识别系统。其后,在2014年12月至2015年4月又作为开发者的本科毕业论文项目加入了文档逻辑版面分析功能,从而扩展为一个印刷体科技文档识别系统。 取得MathOCRMathOCR可以免费获取,请到下载页面。 发行注记MathOCR 0.0.3 发布[2015-05-07]这个版本有较大改动,包括:
MathOCR 0.0.2 发布[2014-11-29]这个版本主要是对数学公式结构分析算法作出了局部的改进。 MathOCR 0.0.1 发布[2014-9-29]这是MathOCR的首个公开发布的版本,它的特性包括:
技术参考主要技术的描述可参考以下文档:
产生MathOCR 0.0.3自带识别数据的训练数据在这获取。 致谢MathOCR自带的数据文件为美国数学学会amsfonts的派生品,而用来读入PNM文件的代码取自JAI库,特此致谢。 此外,还要感谢我的本科毕业论文导师黎培兴老师,正是他的鼓励使这个程序从构想变为现实。 |