文章基本信息

标题：Dataset Generation for OCR
本地全文：下载
作者：Aparna Vara Lakshmi Vemuri ; T.V.Sai Krishna ; Atul Negi 等
期刊名称：International Journal of Computer Trends and Technology
电子版ISSN：2231-2803
出版年度：2011
卷号：2
期号：1
出版社：Seventh Sense Research Group
摘要：Telugu is one of the prominent scripts in India and Asia, with more than 62 million speakers. While it is seen that OCR technology is in a mature stage of development for English and other Roman/Latin scripts, the progress of OCR in Asian and particularly Indian scripts is in a relatively nascent stage. One of the reasons is the complexity of the orthography, especially in Telugu. While potentially 10000 syllables are frequently used in the language, the orthographic units are composed by combinations of 36 consonants and 16 vowels. A practical OCR system for Telugu script was proposed and developed by Negi et al [3], where the complexity of Telugu script and methods for its reduction were proposed. Their approach consists of identification and recognition of connected components. Their recognition used a modification to the template matching approach called the fringe distance method proposed by Brown [1]. In this paper we propose an improved and robust recognition strategy which first uses the pixel distributions of the script and later exploits the structural information of Telugu orthography. In this paper we don’t discuss layout related issues for the isolation of Telugu text regions, which is taken up elsewhere
关键词：Dataset Generation for OCR