«上一篇/Previous Article|本期目录/Table of Contents|下一篇/Next Article»

j. issn. 1673-629X. 2021. 11. 013]
点击复制

基于 Tesseract_OCR 文字识别的研究()

《计算机技术与发展》[ISSN:1006-6977/CN:61-1281/TN]

卷:: 31
期数:: 2021年11期

页码:: 76-80

栏目:: 图形与图像

出版日期:: 2021-11-10

文章信息/Info

Title:: Research on Text Recognition Based on Tesseract_OCR

文章编号:: 1673-629X(2021)11-0076-05

作者:: 曾悦¹ ; 马明栋²; 1. 南京邮电大学通信与信息工程学院,江苏南京 210003;
2. 南京邮电大学地理与生物信息学院,江苏南京 210003

Author(s):: ZENG Yue¹ ; MA Ming-dong²; 1. School of Telecommunications & Information Engineering,Nanjing University of Posts and Telecommunications,Nanjing 210003,China;
2. School of Geographical and Biological Information,Nanjing University of Posts and Telecommunications,Nanjing 210003,China

关键词:: 光学字符识别; 文字识别; Tesseract 框架; 微软基础类库; C++

Keywords:: optical character recognition; text recognition; Tesseract framework; Microsoft foundation classes; C++

分类号:: TP391

DOI:: 10. 3969 / j. issn. 1673-629X. 2021. 11. 013

摘要:: 光学字符识别(optical character recognition,OCR),简单来说,主要是利用光学技术和计算机技术将目前所使用的印刷体字符通过检测每个像素的亮、暗模式转换成一个黑白图像的文件,然后再使用识别的手段将这个黑白图像的文件转换成计算机可以识别的文字。该文主要分为四个模块:文字信息提取、字符识别、系统实现、实验结果与分析。文字信息提取模块包括图像预处理、文字信息区域的截取和修正、字符分割,对输入的图片进行处理,以降低随机噪声,确保文字信息区域包含完整的文字信息,提高识别的准确性。使用 Tesseract 的 OCR 引擎对处理后的文字信息区域部分进行识别,提取出图片中的文字信息。微软基础类库( Microsoft foundation classes,MFC) ,是微软公司实现的一个 C++类库,主要封装了一部分的 API 函数,灵活性大。最后,在 VS2015 环境下使用微软基础类库实现了一个文字识别系统,并对样本图片库进行系统的测试。测试结果表明,该系统具有更高的识别率。

Abstract:: Optical character recognition ( OCR) ,in simple terms,mainly uses optical technology and computer technology to convert the currently used printed characters into a black and white image by detecting the light and dark patterns of each pixel the file,and then uses the means of recognition to convert this black and white image file into text that can be recognized by the computer. This article is mainly divided into four modules:text information extraction,character recognition,system implementation, experimental results and analysis.The text information extraction module also includes image preprocessing,text information area interception and correction,and character segmentation. The input image is processed to reduce random noise, ensure that the text information area contains complete text information,and improve the accuracy? of recognition. The Tesseract’s OCR engine is used to recognize the processed text information area and extract the text information in the picture. Microsoft foundation classes ( MFC) is a C++ class library implemented by Microsoft Corporation,which mainly encapsulates a part of? ?API functions with great flexibility. Finally,a text recognition system is implemented using MFC in the VS2015 environment,and the sample picture library is systematically tested. The test shows that this system has a higher recognition rate.

相似文献/References:

[1]李晓袁保社陈卿任宏宇张建华[].基于像素积分投影的印刷体维文字母切分方法[J].计算机技术与发展,2012,(04):41.
　LI Xiao,YUAN Bao-she,CHEN Qing,et al.A Segmentation Method of Printed Uyghur Character Based on Projection Histogram of Pixels[J].,2012,(11):41.
[2]陈梓洋,王宇飞,钱侃,等. 自然场景下基于区域检测的文字识别算法[J].计算机技术与发展,2015,25(07):230.
　CHEN Zi-yang,WANG Yu-fei,QIAN Kan,et al. Character Recognition Algorithm Based on Region Detection in Natural Scene[J].,2015,25(11):230.
[3]任荣梓,高航. 基于反馈合并的中英文混排版面OCR技术研究[J].计算机技术与发展,2017,27(03):39.
　REN Rong-zi,GAO Hang. Investigation on Layout Analysis Technology of Chinese and English Mixed OCR Based on Feedback Merging[J].,2017,27(11):39.
[4]张婷婷,马明栋,王得玉.OCR 文字识别技术的研究[J].计算机技术与发展,2020,30(04):85.[doi:10. 3969 / j. issn. 1673-629X. 2020. 04. 016]
　ZHANG Ting-ting,MA Ming-dong,WANG De-yu.Research on OCR Technology[J].,2020,30(11):85.[doi:10. 3969 / j. issn. 1673-629X. 2020. 04. 016]
[5]章安,马明栋.基于 Tesseract 文字识别的预处理研究[J].计算机技术与发展,2021,31(01):73.[doi:10. 3969 / j. issn. 1673-629X. 2021. 01. 013]
　ZHANG An,MA Ming-dong.Research on Preprocessing Based on Tesseract Text Recognition[J].,2021,31(11):73.[doi:10. 3969 / j. issn. 1673-629X. 2021. 01. 013]
[6]蒋子敏,刘宁钟,沈家全.基于轻量级网络的 PCB 芯片文字识别[J].计算机技术与发展,2021,31(12):55.[doi:10. 3969 / j. issn. 1673-629X. 2021. 12. 010]
　JIANG Zi-min,LIU Ning-zhong,SHEN Jia-quan.PCB Chip Text Recognition Based on Lightweight Network[J].,2021,31(11):55.[doi:10. 3969 / j. issn. 1673-629X. 2021. 12. 010]
[7]童攀,龙炳鑫,拥措 *.基于注意力机制藏文乌金体古籍文字识别研究[J].计算机技术与发展,2023,33(10):163.[doi:10. 3969 / j. issn. 1673-629X. 2023. 10. 025]
　TONG Pan,LONG Bing-xin,YONG Cuo*.Research on Tibetan Ujin Ancient Book Character Recognition Based on Attention Mechanism[J].,2023,33(11):163.[doi:10. 3969 / j. issn. 1673-629X. 2023. 10. 025]
[8]岳霄,景诗云,史伟.基于改进DenseNet的西夏文识别研究[J].计算机技术与发展,2024,34(10):46.[doi:10.20165/j.cnki.ISSN1673-629X.2024.0181]
　YUE Xiao,JING Shi-yun,SHI Wei.Study on Recognition of Xixia Text Based on Improved DenseNet[J].,2024,34(11):46.[doi:10.20165/j.cnki.ISSN1673-629X.2024.0181]
[9]李子含,屈乐达,刘思源.基于FM-MobileViT网络的拓片甲骨文字识别[J].计算机技术与发展,2025,(05):23.[doi:10.20165/j.cnki.ISSN1673-629X.2025.0020]
　LI Zi-han,QU Le-da,LIU Si-yuan.Rubbing Oracle Bone Character Recognition Based on FM-MobileViT[J].,2025,(11):23.[doi:10.20165/j.cnki.ISSN1673-629X.2025.0020]

常用功能

工具/Tools

统计/Statistics

摘要浏览/Viewed1797
全文下载/Downloads889
评论/Comments