文章基本信息

标题：HRNet Encoder and Dual-Branch Decoder Framework-Based Scene Text Recognition Model
本地全文：下载
作者：Meiling Li ; Xiumei Li ; Junmei Sun 等
期刊名称：International Journal of Antennas and Propagation
印刷版ISSN：1687-5869
电子版ISSN：1687-5877
出版年度：2022
卷号：2022
DOI：10.1155/2022/2996862
语种：English
出版社：Hindawi Publishing Corporation
摘要：Scene text recognition (STR) is designed to automatically recognize the text content in natural scenes. Different from regular document text, text in natural scenes has the characteristics of irregular shapes, complex background, and distorted and blurred contents, which makes STR challenging. To solve the problems of STR for distorted, blurred, and low-resolution texts in natural scenes, this paper proposes a HRNet encoder and dual-branch decoder framework-based STR model. The model mainly consists of an encoder module and a dual-branch decoder module composed of a super-resolution branch and a recognition branch in parallel. In the encoder module, the HRNet is adopted to realize the cross-parallel aggregation representation with multiple resolutions during feature extraction and then outputs four kinds of feature maps with different resolutions. Moreover, the supervised attention module is used to strengthen the learning of the important feature information. In the decoder module, the dual-branch structure is adopted, in which the super-resolution branch takes the feature maps with the highest resolution obtained in the encoder module as input and restores images by upsampling through transposed convolution. The four kinds of feature maps with different resolutions are fused through independent transposed convolution layers for multiscale fusion in the recognition branch and then inputted into the attention-based decoder for text recognition. To improve the accuracy of text recognition, the feature extraction effect of the encoder module is together supervised by the super-resolution branch loss and the recognition branch loss. In addition, the super-resolution branch is only used for training and is abandoned during testing to reduce the complexity of the model. The proposed model is trained on Synth90K and SynthText datasets and tested on seven natural scene datasets. Compared with classical models such as ASTER, TextSR, and SCGAN, the recognition accuracy of the proposed model is improved and better recognition results can be achieved on irregular and blurred datasets such as IC15, SVTP, and CUTE80.