Display-Semantic Transformer for Scene Text Recognition

Yang, Xinqi; Silamu, Wushour; Xu, Miaomiao; Li, Yanbing

doi:10.3390/s23198159

Open AccessArticle

Display-Semantic Transformer for Scene Text Recognition

by

Xinqi Yang

^1,2,3,

Wushour Silamu

^1,2,3,

Miaomiao Xu

^1,2,3 and

Yanbing Li

^1,2,3,*

¹

College of Computer Science and Technology, Xinjiang University, No. 777 Huarui Street, Urumqi 830017, China

²

Xinjiang Laboratory of Multi-Language Information Technology, Xinjiang University, No. 777 Huarui Street, Urumqi 830017, China

³

Xinjiang Multilingual Information Technology Research Center, Xinjiang University, No. 777 Huarui Street, Urumqi 830017, China

^*

Author to whom correspondence should be addressed.

Sensors 2023, 23(19), 8159; https://doi.org/10.3390/s23198159

Submission received: 10 September 2023 / Revised: 26 September 2023 / Accepted: 27 September 2023 / Published: 28 September 2023

(This article belongs to the Special Issue Image Processing and Pattern Recognition Based on Deep Learning—2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

:

Linguistic knowledge helps a lot in scene text recognition by providing semantic information to refine the character sequence. The visual model only focuses on the visual texture of characters without actively learning linguistic information, which leads to poor model recognition rates in some noisy (distorted and blurry, etc.) images. In order to address the aforementioned issues, this study builds upon the most recent findings of the Vision Transformer, and our approach (called Display-Semantic Transformer, or DST for short) constructs a masked language model and a semantic visual interaction module. The model can mine deep semantic information from images to assist scene text recognition and improve the robustness of the model. The semantic visual interaction module can better realize the interaction between semantic information and visual features. In this way, the visual features can be enhanced by the semantic information so that the model can achieve a better recognition effect. The experimental results show that our model improves the average recognition accuracy on six benchmark test sets by nearly 2% compared to the baseline. Our model retains the benefits of having a small number of parameters and allows for fast inference speed. Additionally, it attains a more optimal balance between accuracy and speed.

Keywords:

visual information; linguistic knowledge; transformer; scene text recognition; cross-modal attention

1. Introduction

The first step in the scene recognition task involves detecting text regions in natural images, regardless of their shape. These identified regions are then cropped and further processed to extract their text content. One of the primary goals of Scene Text Recognition (STR) is to accurately recognize consecutive characters from the extracted regions [1,2]. STR finds its usefulness in various applications, including reading road signs, billboards, logos, and printed shirts. The practical significance of STR can be observed in autonomous driving, augmented reality, retail, education, and devices designed for the visually impaired individuals [3]. Unlike traditional Optical Character Recognition (OCR), which predominantly deals with uniform text attributes, STR encounters numerous challenges, including different font styles, varying orientation, diverse text shapes, uneven illumination, occlusion, and blurring. Furthermore, the images obtained from natural environments often exhibit noise, blurriness, distortion, or warping, making STR an immensely important yet highly demanding problem.

STR primarily involves a visual task. However, when faced with unreadable portions in the image, e.g., occlusion or distortion, relying solely on image features is insufficient for accurate inferences. Scene text images consist of two levels of content: visual and linguistic information. However, the visual model lacks linguistic capabilities and merely focuses on the visual texture of characters, neglecting the linguistic information [4]. Incorporating linguistic knowledge into STR models has become a key research focus, as it enables the model to reason about characters based on context [5]. Numerous approaches have explored how to integrate linguistic knowledge into STR models, drawing inspiration from Natural Language Processing (NLP) techniques [6,7]. Recent studies have shifted their attention towards assisting scene text recognition by acquiring linguistic information [4,8,9,10]. Hence, the prevalent trend in recent methodologies is the adoption of a visual and linguistic modeling two-step framework, see [11] (as shown in Figure 1). In this approach, the visual model exclusively emphasizes the textual appearance of characters, neglecting any linguistic aspects. In contrast, language models use language learning structures such as Recurrent Neural Networks (RNNs) [12], Convolutional Neural Networks (CNNs) [13], and transformers [4] to infer the associations between characters.

Despite the favorable recognition outcomes garnered by these approaches in subsequent research, certain challenges persist. Language models impose an incremented computational burden, and numerous contemporary methods employ intricate bidirectional inference architectures to acquire more reliable linguistic insights. Regrettably, this exacerbates the computational load of the models, significantly curbing their efficacy in real-world scenarios [10,12,14]. In addition, an optimal STR model must prioritize not only recognition precision but also model velocity and computational efficacy.

In the work of this paper, we built a DST model based on DeiT-tiny and designed a semantic module and a semantic visual interaction module to allow the model to learn linguistic knowledge actively. Additionally, our model prioritizes computational efficiency, resulting in higher accuracy, faster speed, and reduced running costs.

2. Related Work

The study of STR has been the focus of extensive research over a considerable period [1,2,15]. Significant advancements have been accomplished in scene text recognition, owing to the advancements in deep learning [16,17,18]. STR techniques can be broadly categorized into semantic-free and semantic-aware methods based on their utilization of linguistic information.

2.1. Semantic-Free Methods

Earlier methods were based on semantic-free methods. These methods focused on extracting visual characteristics from an image to identify similarities, without explicitly considering the linguistic relationships between characters. Bai et al. proposed a novel strategy that utilizes sequences to enhance the conventional approach of character segmentation. This approach effectively avoids the challenges related to character segmentation while simultaneously capturing the relationships between local image regions and their semantic information [19,20]. Building upon this, Shi et al. proposed an approach based on Connectionist Temporal Classification (CTC), where the visual features extracted by the CNN are reshaped into a sequence of features, and then loss modeling is performed by RNN and CTC. Subsequently, this sequence is subject to loss modeling by leveraging RNN and CTC [21]. In a subsequent study, He et al. introduced several techniques to enhance recognition accuracy. These strategies involve the design of deep convolutional recurrent networks [22], the integration of multiple RNNs [23], and the implementation of graph convolutional networks to guide CTC decoding instead of relying solely on RNNs [24], among others. Segmentation-based methods perceive recognition as a segmentation problem, either standard or modified, where every single character represents a specific category [7,25,26]. Nevertheless, these methods commonly necessitate character-level annotations, which can be challenging to acquire. Additionally, they exhibit sensitivity toward segmentation noise. In general, the recognition performance of semantic-free methods is limited as they often need to pay more attention to the contextual information of the text. In real-world application scenarios, understanding scene texts accurately usually requires incorporating both semantic comprehension and contextual information. To overcome these limitations of existing methods, it is necessary to utilize more in-depth semantic understanding and contextual information to enhance the accuracy and robustness of scene text recognition.

2.2. Semantic-Aware Methods

Semantic-aware methods, on the other hand, use linguistic rules to aid the recognition process [10,27,28,29]. In the study by Li et al., RNNs were employed to acquire knowledge of the sequential patterns within a sequence without requiring manual specification of N-grams [18]. On the other hand, Aster first employs a correction module prior to recognition, then utilizes RNNs to capture the linguistic properties of the previously predicted character [12]. However, the sequential and time-dependent nature of RNNs restricts both the computational efficiency and the performance of semantic reasoning [4]. To overcome this limitation, SRN introduces a global semantic inference module for purely linguistic modeling using a transformer unit [30]. This module takes the predictions of the visual model as input and predicts the relationships between characters to refine the recognition outcomes [4]. Fang et al. introduced a novel architecture that integrates visual and linguistic modeling by utilizing CNNs [13]. The R2AM technique applies recursive CNNs to extract relevant features and incorporates LSTMs as implicit language models to enhance linguistic modeling [18]. JVSR proposes a multi-level decoder that refers to visual features multiple times to enhance semantic features [31]. Specifically, it is based on a multi-level RNN-attention decoder where each level generates an output sequence and utilizes visual features to update each hidden state. MATRN introduces cross-modal transformer modules to explore the interaction between visual and language features, incorporating semantic features and visual features for enhanced interaction, providing a more comprehensive and accurate text recognition method for better recognition performance [32]. LevOCR further investigates how to effectively fuse visual and language features [33]. In order to learn internal language models at the visual level, VisionLAN introduces a visual inference module that randomly masks visual features corresponding to characters during the training process [11]. MGPSTR proposes a novel approach to improve the performance of scene text recognition by learning implicit context to enhance the model’s robustness and accuracy [34]. LPV improves the understanding and recognition capability of scene text by introducing a technique called Linguistic More, which utilizes additional linguisitic features and enhances the processing of language information, enhancing the efficiency and accuracy of scene text recognition [35]. ABINet proposes an autonomous, bi-directional, and iterative language modeling approach, improving recognition accuracy by iteratively optimizing the language model [36]. PARseq proposes a scene text recognition method based on the replacement autoregressive sequence model, the core idea of which is to model text sequences using autoregression, which brings a new idea and methodology for the research and application in the field of scene text recognition [3]. Although these methods have achieved commendable results in scene text recognition tasks, introducing additional language models has resulted in excessive model parameters and increased inference time and complexity. Therefore, finding a better balance between the computational cost of language models and the accuracy of model recognition is also a challenging problem.

Our work relates to semantic-aware methods, designing a textual model and semantic visual interaction module. However, what distinguishes our approach from most semantic-aware methods is our focus on achieving higher accuracy with fewer parameters. We prioritize the balance between model accuracy, speed, and parameter quantity. Unlike semantic-free methods, we actively introduce semantic information to assist scene text recognition, leveraging contextual cues to enhance character recognition accuracy. The experimental results show that our model excels in inference time, number of model parameters, and algorithm complexity, which are only 15.9 ms, 8.57 m, and 5.919 G, respectively, and has a relatively better recognition effect.

3. DST Model

In order for the model to obtain better recognition with less computational cost, based on the Vision Transformer model, we designed a DST model as shown in Figure 2. To control the parameter size of the model, the visual model uses a Vision Transformer to extract visual information. In the semantic branch, we designed a masked language model to help the model actively learn the language knowledge, which can reason about the characters based on the context and improve the recognition effect and robustness of the model. In addition, we build a semantic-visual interaction module to achieve interactive learning between semantics and vision. In this way, visual features can be enhanced by semantic information, which helps the model to find the characters to be recognized, thus improving the recognition effect of the model.

3.1. Image Processing

The DST model first passes the input image through a linear mapping, which consists of a convolutional layer with a convolutional kernel and stride that are both P×P (P×P is the patch size of the Vit (Vision Transformer)) and the resulting initial features are summed with a learnable positional encoding of the exact dimensions. The resulting vector sum is used as the input to the encoder.

3.2. Visual Model

As shown in Figure 3, the vision model consists of a 12-layer Vision Transformer (Vit), which is a direct extension of the transformer to the image. Each Vit layer consists of a multi-headed self-attentive module, i.e., q = k = v. the image size of the input to our model is set to 32×128, and each input image

x \in R^{H \times W \times C}

is divided into a series of 2D patches

x_{p} \in R^{N \times (P^{2} C)}

. The image size is H×W and the number of channels is C, while the dimensions of these block patches are P × P. The length of the generated patch sequence is 192. Each embedding adds a learnable positional embedding of the same dimensionality as it, and the sum of the two of them is taken as the input to the Vit encoder before being processed by the first Vit layer. The input to this encoder is:

Z_{0} = [x_{c l a s s}; x_{p}^{1} E; x_{p}^{2} E; \dots; x_{p}^{N} E] + E_{p o s}

(1)

where

E \in R^{P^{2} C \times D}

and

E_{p o s} \in R^{(N + 1) \times D}

.

Firstly, the input of each layer Vit passes through a layer normalization (LN) and then it continues through the Multi-headed Self-Attention layer (MSA) followed by one more layer normalization and Multilayer Perceptron (MLP). In addition, MSA establishes the interdependence between the feature vectors. The MLP consists of two linear layers of GELU activation functions to complete the feature extraction and the output obtained is used for residual concatenation.

The formula for the MSA module is:

z_{l}^{'} = M S A (L N (z_{l - 1})) + Z_{l - 1}

(2)

where L represents the encoder depth of the Vit.

The output of the MLP block is:

Z_{l} = M L P (L N (z_{l}^{'})) + z_{l}^{'}

(3)

The final output feature

Z_{l} \in R^{(N + 1) \times D}

is used as input to the semantic visual interaction module.

3.3. Masked Language Model

The model is more robust due to its ability to use semantic information to help understand visual cues in scene text recognition. Our model designs a masked language model with transformer units to extract semantic information. As shown in Figure 4, the multi-head attention module of this masked language model uses positional information as q. We first generate a sequence of vectors with a fixed constant for each position index dimension; otherwise it is 0. Then, a sinusoidal positional embedding is used with the same, followed by two MLP layers to obtain the positional embeddings, and the positional embeddings are computed using the following formulae.

P E (p o s, 2 i) = sin (p o s / {10,000}^{2 i / d_{m o d e l}})

(4)

P E (p o s, 2 i + 1) = cos (p o s / {10,000}^{2 i / d_{m o d e l}})

(5)

Where pos represents the positional information, i is the character position, and d represents the dimension of the positional information. The value 10,000 in position encoding represents the length of the sine and cosine functions’ cycles, which provide unique encoding for different input positions. Choosing 10,000 is performed to maintain a certain level of periodicity and accommodate the dimensions of the hidden layers. In this PE matrix, the sine variable is inserted at even positions, while the cosine variable is inserted at odd positions. Then, the visual features extracted by Vit are passed through a linear layer to obtain the seed text, which is used as k and v in the multi-head attention module after word embedding. Meanwhile, we use a mask to prevent information leakage over time. The output obtained after the multi-head attention module is through the MLP layer and residual connection to obtain the extracted semantic information.

The output of the multi-attention module is:

MultiHead (Q, K, V) = C o n c a t ({head}_{1}, \dots, {head}_{n}) W^{0}

(6)

{head}_{i} = Attention (Q w_{i}^{Q}, K w_{i}^{K}, V w_{i}^{V})

(7)

where Q, K, and V represent the query, key, and value, respectively, and

W_{i}^{Q}

,

W_{i}^{K}

and

W_{i}^{V}

are the weight matrices of the ith attention head. The splicing function Concat combines the outputs of the different attention heads; in addition, the process uses a matrix of trainable parameters

W^{0}

.

The final output of the multi-head attention module is obtained by the residual connection of the results of the attention module with the input query. The resulting sum is then subjected to a normalization layer.

The output of the residual connection and the output of the normalization layer are:

x = Q + Dropout (MultiHead (Q, K, V))

(8)

S^{'} = LayerNorm (\frac{γ}{σ} (x - μ) + β)

(9)

During the process of training, the average and variance of the dimension vectors in the sample are represented by

μ

and

σ

, respectively, while

γ

and

β

denote the scaling factor and translation factor, respectively, obtained through learning.

The FeedForward Network (FFN) module performs feature extraction and the output of the FFN module is:

FFN (S^{'}) = W_{2} ReLU (W_{1} (S^{'}) + b_{1}) + b_{2}

(10)

The weight matrix and bias parameters of the FFN, denoted as

W_{1}

,

W_{2}

,

b_{1}

, and

b_{2}

, are learnable parameter matrices. The final semantic information S is obtained by passing the results of the FFN layer through the Dropout layer, followed by residual concatenation, and finally through the normalization layer.

The output of the layer normalisation layer is:

S = LayerNorm (S^{'} + Dropout (FFN (S^{'})))

(11)

3.4. Semantic-Visual Interaction Module (SVIM)

In order to establish the correspondence between visual semantics, we use semantic features as the query to interact with visual features, which is beneficial to help the model find the character to be recognized. For this purpose, we designed a semantic visual interaction module, as shown in Figure 4. The input to the semantic visual interaction module comes from the information extracted from the visual model and the masked language model. The multi-head attention layer will first extract this input, and the obtained extracted information will then go through the multi-layer perceptron, residual connection, and layer normalization to obtain the improved visual information. In the multi-head attention module, the query is the linguistic information extracted by the semantic module, and key and value are the visual information extracted by the visual model. Meanwhile, in the attention module, we use a mask to prevent the leakage of semantic information across time steps. The interaction between semantic and visual information is based on the semantic enhancement of visual information. Enhanced visual information is captured in Equations (6)–(11).

3.5. Character Prediction and Loss Calculation

Enhanced visual features are projected through linear projections to obtain character predictions.

y_{i} = L i n e a r (V_{L}^{i})

(12)

For i = 1 … S. S is the maximum text length the model can predict plus two of the [GO] and [s] tokens.

The cross-entropy loss function is effective in driving model optimization for classification tasks. When the model-predicted results do not match the true labels, the cross-entropy loss function generates a more considerable loss value, which directs the model’s attention towards incorrect predictions. The backpropagation algorithm updates the model parameters to reduce the loss value. In multi-class classification problems, the cross-entropy loss function quantifies the quality of the model’s predictions for each class. The accuracy of the model across different classes can be evaluated by calculating the cross-entropy between the probability distribution outputted by the model and the true labels. When used with the softmax activation function, it effectively drives the model to predict the correct class with higher probability and continuously improves the classification performance during training. So, we use the cross-entropy loss function to calculate the loss.

L_{C E} (y, t) = - \sum_{i = 1}^{C} t_{i} log y_{i}

(13)

The loss

L_{}

is denoted by denoting the loss, the model output is denoted by y, the true labels are denoted by t, and the number of label categories is denoted by C.

t_{i}

denotes the true label value of the ith category and

y_{i}

denotes the predicted probability value of the ith category.

3.6. Methodology

3.6.1. Datasets

As with most scene text recognition methods, the proposed network model is trained by using two publicly available synthetic datasets, namely MJSynth [37] and SynthText [38]. MJSynth has nine million synthetic text images, which use a combination of fonts and colors to render the text in a naturalistic manner into real images, thus generating a dataset of text images. The images presented in each instance showcase synthetic text in various forms, encompassing distinct textual content and text styles. Primarily designated for model training, this dataset, SynthText, comprises an extensive collection of seven million images. Notably, it encompasses examples featuring special characters. Each text image includes an authentic vertical text layout. While most characters within the images are in English, there are also genuine vertical columns of text in other languages. Unlike MJSynth, the text in SynthText is integrated into real-life scenes, including billboards, street signs, and road signs. We conducted extensive experiments on six standard benchmark datasets, including three regular text datasets, IIIT5Kwords, Street View Text, ICDAR2013, and three irregular text datasets, ICDAR2015, Street View Text Perspective, and CUTE80.

IIIT5K-words (IIIT5K) [39] is an extensive collection of 5000 images from natural scene images assembled through Google Image Search. This dataset comprises 2000 images designated for training purposes and 3000 for testing. The textual representations within these patches follow a consistent horizontal arrangement.

The StreetViewText (SVT) [40] dataset consists of 647 images extracted from Google Street View and precisely cropped for testing purposes. The majority of these images are in a horizontal orientation. However, it is essential to note that they are significantly affected by various issues such as noise, blurriness, and a low level of resolution.

ICDAR2013 [41] was a dataset consisting of 288 scene-true images. Additionally, it included 1095 images that were cropped from mall images. However, for testing purposes, only 857 images were used. The remaining images were discarded due to their inclusion of non-alphanumeric characters or having less than three characters.

The dataset ICDAR2015 [42] consists of word chunks extracted from incidental scene images taken from various angles. As a result, most word chunks within this dataset exhibit irregularities, such as being oriented, perspective, or curved. It comprises 4468 training data samples and 2077 test data samples.

StreetViewTextPerspective (SVTP) [43] is a dataset containing 639 images from side view snapshots in Google Street View. These images have significant perspective distortions and are specifically designed for model testing.

CUTE80 [44] includes 288 high-resolution images specifically designed for testing purposes. Among these images, a substantial portion consists of curved and irregular text.

3.6.2. Experimental Setup

In our experiments, the dimensions of the input images were set to a height of 32 and a width of 128. For visual feature extraction, we used DeiT-tiny [45] as a module for the model, retaining the same experimental configuration as before, except for a different input image size. To enhance the data [46], we applied standard techniques for scene text images, such as blurring, noise, perspective distortion, and rotation. Our model training was performed using two RTX A5000 GPUs with a batch size 360 and one million iterations. The Adadelta optimizer [47] was chosen for optimization with an initial learning rate of 1. To fine-tune the learning rate during training, we used the cosine annealing LR [48] strategy.

4. Experiment

4.1. Proof of Concept for the DST Model

Since the DeiT-tiny pure vision model cannot utilize contextual information for assisted recognition in some low-quality images, we aim to improve the recognition effect of the model by building a masked language model to help the DST model use contextual information for scene text recognition. Moreover, to further utilize the semantic information extracted by the masked language model, we build a semantic visual interaction module to help the model recognize the character by using the interaction between semantic visuals. We conduct experimental analysis and ablation validation of the proposed framework in the following sections. Based on the results of the analyses, it is proven that our proposed framework is feasible and effective.

4.2. Comparison of Accuracy with Existing Methods on Six Standard Benchmarks

Our model is built on DeiT-tiny, an augmented model incorporating knowledge distillation and Vit. In our designed model, we use DeiT-tiny as a visual model to extract the visual information and enhance it with a semantic branch, semantic module, and semantic visual interaction module, which effectively enhances the robustness of the model and improves the recognition rate of the model. Table 1 compares our models with the most advanced scene text recognition methods. We have carefully analyzed and summarized the results obtained for each model on the six datasets. It can be seen that by comparing our model with other models, our model works best on three datasets, ICDAR2013, SVTP, and ICDAR2015, in addition to performing well in terms of accuracy on the other three datasets. Overall, our model improves by nearly 2% on average over the baseline. The DST model effectively improves recognition accuracy by adding the masked language model and the visual semantic interaction module.

4.3. Ablation Study

Our subsequent experiments are also based on the DeiT-tiny model. The findings presented in Table 2 illustrate that incorporating a module for semantic information into the baseline aids in enhancing the average accuracy by nearly 1%. This implies that the amalgamation of visual and semantic information is a practical approach to recognition. To make the recognition even more effective, we add a semantic-visual interaction module to the model, which enhances the recognition accuracy of the model through the interaction between semantic and visual information. Experiments show that our improvements are practical, and our model’s recognition rate improves by nearly 2% compared to the baseline.

4.4. Comparison of Inference Times and Model Parameters

By comparing with some classical language models, SRN is an approach that uses semantic reasoning to improve the accuracy of scene text recognition, and by introducing semantic awareness and reasoning modules, the system can better understand the contextual information of the text, thus achieving superior performance. Visionlan proposes a novel scene text recognizer that integrates text detection and recognition, and by using a visual language modeling network, the system can better exploit the ABINet is ability to accurately recognize text in various complex scenarios through comprehensive modeling of text contextual information and iterative optimization. MGPSTR uses a multi-granularity prediction approach, which fully uses local and global information about the text to achieve more accurate and robust scene text recognition. PARSeq uses an aligned autoregressive sequential model for PARSeq uses an aligned autoregressive sequence model for scene text recognition, which enables the system to better understand and recognize complex scene texts by dealing with changes in text order. As shown in Table 3, our model outperforms SRN and visionLAN in accuracy by 0.28% and 0.45%, thanks to our designed language model and semantic visual interaction module, and the size of the computational cost and the inference speed of our model are much better than those of SRN and visionLAN. Though compared with ABINet, MGP-STR, and PARSeq, our model is slightly lower in recognition accuracy, the computational cost of these models is, on average, about 4.3 times that of our model, and the inference speed of ABINet is much greater than that of our model. In terms of model complexity, only PARseq performs better; all other models have greater complexity than ours. Based on these data, it is evident that our model exhibits excellent performance both in terms of accuracy and computational cost.

5. Conclusions

In order to enable the scene text recognition model to actively learn linguistic knowledge and reduce the computational cost of the model, we designed the DST model based on DeiT-tiny. Firstly, we added a language branch based on the transformer and designed a masked language model to enable the model to learn linguistic knowledge actively, thus solving the problem that the model only focuses on visual texture without learning linguistic knowledge and the constructed semantic visual interaction module enhances the visual information through the interactions between linguistic vision. Secondly, compared with most of the language models with sizeable computational costs, our model has a minor computational cost of only 8.57 M parameters and achieved better recognition results and significantly improved the model’s performance. Compared to existing baseline models for scene text recognition, our model consistently performs well in terms of inference speed, recognition rate, and number of model parameters, making our model more practical and feasible for real-world applications in various scenarios.

Author Contributions

X.Y. designed and performed experiments on the model and wrote the manuscript of the paper, which was funded by W.S., M.X. and Y.L. reviewed the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This study was supported by Research on Basic Theory and Key Technology of Discrete Intelligent Manufacturing Based on Industrial Big Data U1911401 2020.01–2023.12 National Natural Science Foundation of China (NSFC) Joint Fund Project and Research on Key Technology of Uyghur-Chinese Speech Translation System, U1603262, National Natural Science Foundation of China (NNSFC) Joint Fund, 2.42 Million.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Written informed consent for publication of this paper was obtained from all authors.

Data Availability Statement

The data that support the findings of this study are openly available in the public domain.

Conflicts of Interest

The authors declared no potential conflict of interest concerning this article’s research, authorship, and publication.

References

Long, S.; He, X.; Yao, C. Scene text detection and recognition: The deep learning era. Int. J. Comput. Vis. 2021, 129, 161–184. [Google Scholar] [CrossRef]
Zhu, Y.; Yao, C.; Bai, X. Scene text detection and recognition: Recent advances and future trends. Front. Comput. Sci. 2016, 10, 19–36. [Google Scholar] [CrossRef]
Bautista, D.; Atienza, R. Scene text recognition with permuted autoregressive sequence models. In Proceedings of the European Conference on Computer Vision, Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 178–196. [Google Scholar]
Yu, D.; Li, X.; Zhang, C.; Liu, T.; Han, J.; Liu, J.; Ding, E. Towards accurate scene text recognition with semantic reasoning networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 12113–12122. [Google Scholar]
Wang, P.; Da, C.; Yao, C. Multi-granularity prediction for scene text recognition. In Proceedings of the European Conference on Computer Vision, Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 339–355. [Google Scholar]
Luong, M.T.; Pham, H.; Manning, C.D. Effective approaches to attention-based neural machine translation. arXiv 2015, arXiv:1508.04025. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
Qiao, Z.; Zhou, Y.; Yang, D.; Zhou, Y.; Wang, W. Seed: Semantics enhanced encoder-decoder framework for scene text recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 13528–13537. [Google Scholar]
Yue, X.; Kuang, Z.; Lin, C.; Sun, H.; Zhang, W. Robustscanner: Dynamically enhancing positional clues for robust text recognition. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 135–151. [Google Scholar]
Zhan, F.; Lu, S. Esir: End-to-end scene text recognition via iterative image rectification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 2059–2068. [Google Scholar]
Wang, Y.; Xie, H.; Fang, S.; Wang, J.; Zhu, S.; Zhang, Y. From two to one: A new scene text recognizer with visual language modeling network. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 14194–14203. [Google Scholar]
Shi, B.; Yang, M.; Wang, X.; Lyu, P.; Yao, C.; Bai, X. Aster: An attentional scene text recognizer with flexible rectification. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 41, 2035–2048. [Google Scholar] [CrossRef] [PubMed]
Fang, S.; Xie, H.; Zha, Z.J.; Sun, N.; Tan, J.; Zhang, Y. Attention and language ensemble for scene text recognition with convolutional sequence modeling. In Proceedings of the 26th ACM International Conference on Multimedia, Seoul, Republic of Korea, 26 October 2018; pp. 248–256. [Google Scholar]
Wang, T.; Zhu, Y.; Jin, L.; Luo, C.; Chen, X.; Wu, Y.; Wang, Q.; Cai, M. Decoupled attention network for text recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–8 February 2020; Volume 34, pp. 12216–12224. [Google Scholar]
Chen, X.; Jin, L.; Zhu, Y.; Luo, C.; Wang, T. Text recognition in the wild: A survey. ACM Comput. Surv. (CSUR) 2021, 54, 1–35. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Lee, C.Y.; Osindero, S. Recursive recurrent nets with attention modeling for ocr in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 2231–2239. [Google Scholar]
Bai, J.; Chen, Z.; Feng, B.; Xu, B. Chinese image text recognition on grayscale pixels. In Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy, 4–9 May 2014; IEEE: Toulouse, France, 2014; pp. 1380–1384. [Google Scholar]
Jaderberg, M.; Simonyan, K.; Vedaldi, A.; Zisserman, A. Reading text in the wild with convolutional neural networks. Int. J. Comput. Vis. 2016, 116, 1–20. [Google Scholar] [CrossRef]
Shi, B.; Bai, X.; Yao, C. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 2298–2304. [Google Scholar] [CrossRef] [PubMed]
He, P.; Huang, W.; Qiao, Y.; Loy, C.; Tang, X. Reading scene text in deep convolutional sequences. In Proceedings of the AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA, 12–17 February 2016; Volume 30. [Google Scholar]
Su, B.; Lu, S. Accurate recognition of words in scenes without character segmentation using recurrent neural network. Pattern Recognit. 2017, 63, 397–405. [Google Scholar] [CrossRef]
Hu, W.; Cai, X.; Hou, J.; Yi, S.; Lin, Z. Gtc: Guided training of ctc towards efficient and accurate scene text recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 11005–11012. [Google Scholar]
Liao, M.; Zhang, J.; Wan, Z.; Xie, F.; Liang, J.; Lyu, P.; Yao, C.; Bai, X. Scene text recognition from two-dimensional perspective. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 1–27 January 2019; Volume 33, pp. 8714–8721. [Google Scholar]
Xing, L.; Tian, Z.; Huang, W.; Scott, M.R. Convolutional character networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; IEEE: Toulouse, France, 2019; pp. 9126–9136. [Google Scholar]
Cheng, Z.; Xu, Y.; Bai, F.; Niu, Y.; Pu, S.; Zhou, S. Aon: Towards arbitrarily-oriented text recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; IEEE: Toulouse, France, 2018; pp. 5571–5579. [Google Scholar]
Li, H.; Wang, P.; Shen, C.; Zhang, G. Show, attend and read: A simple and strong baseline for irregular text recognition. In Proceedings of the AAAI conference on artificial intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 8610–8617. [Google Scholar]
Yang, M.; Guan, Y.; Liao, M.; He, X.; Bian, K.; Bai, S.; Yao, C.; Bai, X. Symmetry-constrained rectification network for scene text recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; IEEE: Toulouse, France, 2019; pp. 9147–9156. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar] [CrossRef]
Bhunia, A.K.; Sain, A.; Kumar, A.; Ghose, S.; Chowdhury, P.N.; Song, Y.Z. Joint visual semantic reasoning: Multi-stage decoder for text recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 14940–14949. [Google Scholar]
Na, B.; Kim, Y.; Park, S. Multi-modal text recognition networks: Interactive enhancements between visual and semantic features. In Proceedings of the European Conference on Computer Vision, Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 446–463. [Google Scholar]
Da, C.; Wang, P.; Yao, C. Levenshtein OCR. In Proceedings of the European Conference on Computer Vision, Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 322–338. [Google Scholar]
Yang, X.; Qiao, Z.; Wei, J.; Zhou, Y.; Yuan, Y.; Ji, Z.; Yang, D.; Wang, W. Masked and Permuted Implicit Context Learning for Scene Text Recognition. arXiv 2023, arXiv:2305.16172. [Google Scholar]
Zhang, B.; Xie, H.; Wang, Y.; Xu, J.; Zhang, Y. Linguistic More: Taking a Further Step toward Efficient and Accurate Scene Text Recognition. arXiv 2023, arXiv:2305.05140. [Google Scholar]
Fang, S.; Xie, H.; Wang, Y.; Mao, Z.; Zhang, Y. Read like humans: Autonomous, bidirectional and iterative language modeling for scene text recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 7098–7107. [Google Scholar]
Jaderberg, M.; Simonyan, K.; Vedaldi, A.; Zisserman, A. Synthetic data and artificial neural networks for natural scene text recognition. arXiv 2014, arXiv:1406.2227. [Google Scholar]
Gupta, A.; Vedaldi, A.; Zisserman, A. Synthetic data for text localisation in natural images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 2315–2324. [Google Scholar]
Mishra, A.; Alahari, K.; Jawahar, C. Scene text recognition using higher order language priors. In Proceedings of the BMVC-British Machine Vision Conference, Surrey, UK, 3–7 September 2012; BMVA: Surrey, UK, 2012. [Google Scholar]
Wang, K.; Babenko, B.; Belongie, S. End-to-end scene text recognition. In Proceedings of the 2011 International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; IEEE: Toulouse, France, 2011; pp. 1457–1464. [Google Scholar]
Karatzas, D.; Shafait, F.; Uchida, S.; Iwamura, M.; i Bigorda, L.G.; Mestre, S.R.; Mas, J.; Mota, D.F.; Almazan, J.A.; De Las Heras, L.P. ICDAR 2013 robust reading competition. In Proceedings of the 2013 12th International Conference on Document Analysis and Recognition, Washington, DC, USA, 25–28 August 2013; IEEE: Toulouse, France, 2013; pp. 1484–1493. [Google Scholar]
Karatzas, D.; Gomez-Bigorda, L.; Nicolaou, A.; Ghosh, S.; Bagdanov, A.; Iwamura, M.; Matas, J.; Neumann, L.; Chandrasekhar, V.R.; Lu, S.; et al. ICDAR 2015 competition on robust reading. In Proceedings of the 2015 13th International Conference on Document Analysis and Recognition (ICDAR), Tunis, Tunisia, 23–26 August 2015; IEEE: Toulouse, France, 2015; pp. 1156–1160. [Google Scholar]
Phan, T.Q.; Shivakumara, P.; Tian, S.; Tan, C.L. Recognizing text with perspective distortion in natural scenes. In Proceedings of the IEEE International Conference on Computer Vision, Washington, DC, USA, 2–8 December 2013; pp. 569–576. [Google Scholar]
Risnumawan, A.; Shivakumara, P.; Chan, C.S.; Tan, C.L. A robust arbitrary text detection system for natural scene images. Expert Syst. Appl. 2014, 41, 8027–8048. [Google Scholar] [CrossRef]
Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jégou, H. Training data-efficient image transformers & distillation through attention. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; PMLR: Sydney, Australia, 2021; pp. 10347–10357. [Google Scholar]
Cubuk, E.D.; Zoph, B.; Shlens, J.; Le, Q.V. Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 14–19 June 2020; pp. 702–703. [Google Scholar]
Zeiler, M.D. Adadelta: An adaptive learning rate method. arXiv 2012, arXiv:1212.5701. [Google Scholar]
Loshchilov, I.; Hutter, F. Sgdr: Stochastic gradient descent with warm restarts. arXiv 2016, arXiv:1608.03983. [Google Scholar]

Figure 1. Two-step architecture of the visual language model.

Figure 2. The model architecture of the proposed DST ("*" represents learnable class embedding).

Figure 3. Transformer encoder for the L layer.

Figure 4. Details of masked language models and SVIM.

Table 1. Comparison with state-of-the-art methods.

Methods	Year	Datasets	IC13	SVT	III5K	IC15	SVTP	CUTE80	Avg
TRBA	2019	MJ+ST	93.6	87.5	87.9	77.6	79.2	74	84.6
TextScanner	2019	MJ+ST	92.9	90.1	93.9	79.4	84.3	83.3	88.5
DAN	2020	MJ+ST	93.9	89.2	94.3	74.5	80	84.4	87.2
SRN	2020	MJ+ST	95.5	91.5	94.8	82.7	85.1	87.8	90.4
RobustScanner	2020	MJ+ST	94.8	88.1	95.3	77.1	79.5	90.3	88.4
SAM	2021	MJ+ST	95.3	90.6	93.9	77.3	82.2	87.8	88.3
ViTSTR	2021	MJ+ST	93.2	87.7	88.4	78.5	81.8	81.3	85.6
VisionLAN	2021	MJ+ST	95.7	91.7	95.8	83.7	86	88.5	90.23
PIMNet	2021	MJ+ST	95.2	91.2	95.2	83.5	84.3	84.4	90.5
MGP-STR	2022	MJ+ST	95.7	93	95.6	83.6	89	88.5	91.3
PARSeq	2022	MJ+ST	95.7	92.4	96	83.1	88.7	90.6	91.4
baseline	-	MJ+ST	95.799	91.499	91.333	82.827	85.116	81.597	88.811
DST	-	MJ+ST	97.316	93.045	92.8	84.539	87.752	83.681	90.68

Table 2. Comparison of ablation experiments.

	IC13	SVT	IIIT5K	IC15	SVTP	CUTE80	Avg
baseline	95.799	91.499	91.333	82.827	85.116	81.597	88.811
+MLM	96.266	92.382	92.26	83.821	86.822	81.944	89.804
+MLM+SVIM	97.316	93.045	92.8	84.539	87.752	83.681	90.68

Table 3. Comparison of ablation experiments.

Methods	Year	IC13	SVT	IIIT5k	IC15	SVTP	CUTE80	Avg	Parm (m)	Speed (ms)	FLOPs (G)
SRN	2020	95.5	91.5	94.8	82.7	85.1	87.8	90.4	54.7	25.4	11.36
VisionLAN	2021	95.7	91.7	95.8	83.7	86	88.5	90.23	32.8	28	-
ABINet	2021	95.2	93.4	97	83.4	89.6	89.2	91.9	36.7	51.3	10.94
MGPSTR	2022	95.7	93	95.6	83.6	89	88.5	91.3	52.6	9.37	25.4
PARSeq	2022	95.7	92.4	96	83.1	88.7	90.6	91.4	23.8	11.7	3.25
ours	-	97.316	93.045	92.8	84.539	87.752	83.681	90.68	8.57	15.9	5.919

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yang, X.; Silamu, W.; Xu, M.; Li, Y. Display-Semantic Transformer for Scene Text Recognition. Sensors 2023, 23, 8159. https://doi.org/10.3390/s23198159

AMA Style

Yang X, Silamu W, Xu M, Li Y. Display-Semantic Transformer for Scene Text Recognition. Sensors. 2023; 23(19):8159. https://doi.org/10.3390/s23198159

Chicago/Turabian Style

Yang, Xinqi, Wushour Silamu, Miaomiao Xu, and Yanbing Li. 2023. "Display-Semantic Transformer for Scene Text Recognition" Sensors 23, no. 19: 8159. https://doi.org/10.3390/s23198159

APA Style

Yang, X., Silamu, W., Xu, M., & Li, Y. (2023). Display-Semantic Transformer for Scene Text Recognition. Sensors, 23(19), 8159. https://doi.org/10.3390/s23198159

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Display-Semantic Transformer for Scene Text Recognition

Abstract

1. Introduction

2. Related Work

2.1. Semantic-Free Methods

2.2. Semantic-Aware Methods

3. DST Model

3.1. Image Processing

3.2. Visual Model

3.3. Masked Language Model

3.4. Semantic-Visual Interaction Module (SVIM)

3.5. Character Prediction and Loss Calculation

3.6. Methodology

3.6.1. Datasets

3.6.2. Experimental Setup

4. Experiment

4.1. Proof of Concept for the DST Model

4.2. Comparison of Accuracy with Existing Methods on Six Standard Benchmarks

4.3. Ablation Study

4.4. Comparison of Inference Times and Model Parameters

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI