LAL: linguistically aware learning for scene text recognition

Files
10204780.pdf(3.84 MB)
Accepted manuscript
Date
2020-10-12
Authors
Zheng, Yi
Qin, Wenda
Wijaya, Derry
Betke, Margrit
Version
Published version
OA Version
Citation
Y. Zheng, W. Qin, D. Wijaya, M. Betke. 2020. "LAL: Linguistically Aware Learning for Scene Text Recognition." Proceedings of the 28th ACM International Conference on Multimedia. MM '20: The 28th ACM International Conference on Multimedia. https://doi.org/10.1145/3394171.3413913
Abstract
Scene text recognition is the task of recognizing character sequences in images of natural scenes. The considerable diversity in the appearance of text in a scene image and potentially highly complex backgrounds make text recognition challenging. Previous approaches employ character sequence generators to analyze text regions and, subsequently, compare the candidate character sequences against a language model. In this work, we propose a bimodal framework that simultaneously utilizes visual and linguistic information to enhance recognition performance. Our linguistically aware learning (LAL) method effectively learns visual embeddings using a rectifier, encoder, and attention decoder approach, and linguistic embeddings, using a deep next-character prediction model. We present an innovative way of combining these two embeddings effectively. Our experiments on eight standard benchmarks show that our method outperforms previous methods by large margins, particularly on rotated, foreshortened, and curved text. We show that the bimodal approach has a statistically significant impact. We also contribute a new dataset, and show robust performance when LAL is combined with a text detector in a pipelined text spotting framework.
Description
License
© 2020 Copyright held by the owner/author(s). Publication rights licensed to ACM. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org.