首页    期刊浏览 2024年12月02日 星期一
登录注册

文章基本信息

  • 标题:Optical Character Recognition Engines Performance Comparison in Information Extraction
  • 本地全文:下载
  • 作者:Tosan Wiar Ramdhani ; Indra Budi ; Betty Purwandari
  • 期刊名称:International Journal of Advanced Computer Science and Applications(IJACSA)
  • 印刷版ISSN:2158-107X
  • 电子版ISSN:2156-5570
  • 出版年度:2021
  • 卷号:12
  • 期号:8
  • DOI:10.14569/IJACSA.2021.0120814
  • 语种:English
  • 出版社:Science and Information Society (SAI)
  • 摘要:Named Entity Recognition (NER) is often used to acquire important information from text documents as a part of the Information Extraction (IE) process. However, the text documents quality affects the accuracy of the data obtained, especially for text documents acquired involving the Optical Character Recognition (OCR) process, which never reached 100% accuracy. This research tried to examine which OCR engine with the highest performance for IE using NER by comparing three OCR engines (Foxit, PDF2GO, Tesseract) over 8,562 government human resources documents within six document categories, two document structures, and four measurements. Several essential entities such as name, employee ID, document number, document publishing date, employee rank, and family member's name were trying to be extracted automatically from the documents. NER processes were done using Python programming language, and the preprocessing tasks were done separately for Foxit, PDF2GO, and Tesseract. In summary, each OCR engine has its drawbacks and benefit, such as Tesseract has better NER extraction and conversion time with better accuracy but lack in the number of entities acquired.
  • 关键词:Named entity recognition; information extraction; optical character recognition; government human resources documents
国家哲学社会科学文献中心版权所有