TRIG: Transformer-Based Text Recognizer with Initial Embedding Guidance

Scene text recognition (STR) is an important bridge between images and text, attracting abundant research attention. While convolutional neural networks (CNNS) have achieved remarkable progress in this task, most of the existing works need an extra module (context modeling module) to help CNN to capture global dependencies to solve the inductive bias and strengthen the relationship between text features. Recently, the transformer has been proposed as a promising network for global context modeling by self-attention mechanism, but one of the main shortcomings, when applied to recognition, is the efficiency. We propose a 1-D split to address the challenges of complexity and replace the CNN with the transformer encoder to reduce the need for a context modeling module. Furthermore, recent methods use a frozen initial embedding to guide the decoder to decode the features to text, leading to a loss of accuracy. We propose to use a learnable initial embedding learned from the transformer encoder to make it adaptive to different input images. Above all, we introduce a novel architecture for text recognition, named TRansformer-based text recognizer with Initial embedding Guidance (TRIG), composed of three stages (transformation, feature extraction, and prediction). Extensive experiments show that our approach can achieve state-of-the-art on text recognition benchmarks.


Introduction
STR, aiming to read the text in natural scenes, is an important and active research field in computer vision [26,53]. Text reading can obtain semantic information from images, playing a significant role in a variety of vision tasks, such as image retrieval, key information extraction, and document visual question answering.
Among the feature extraction module of existing text recognizers, convolutional architectures remain dominant. For example, ASTER [39] uses ResNet [12] and SRN [47] uses FPN [24] to aggregate hierarchical feature maps from ResNet50. As we all know, the text has linguistic information and almost every character arXiv:2111.08314v1 [cs.CV] 16 Nov 2021 has a relationship with each other. So features with global contextual information can decode more accurate characters. Unfortunately, the convolutional neural network (CNN) has an inductive bias on locality for the design of the kernel. It lacks the ability to model long-range dependencies, hence text recognizers should use context modeling structures to gain better performance. It is a common practice that Bi-directional LSTM (BiLSTM) [14] is effective to enhance context modeling. Such context modeling modules introduce additional complexity and operations. So a question comes: Why not replace CNN with another network which can model long-range dependencies in a feature extractor without an additional context modeling module?
With the introduction of the transformer [42], the question has an answer. Recently, it has been proposed to regard an image as a sequence of patches and aggregate feature in global context by self-attention mechanisms [9]. Therefore, we propose to use a pure transformer encoder as the feature extractor instead of CNN. Due to the ability of dynamic attention, global context, and better generalization of the transformer, the transformer encoder can provide global and robust features without the extra context modeling module. By this way, we can simplify the four-stage STR framework (transformation stage, feature extraction stage, context modeling stage, and prediction stage) proposed in Baek et al. [2] to three stages by removing the need for a context modeling module. Our extensive experiments prove the effectiveness of the three-stage architecture. It shows that the additional context modeling module degrades performance rather than any gain and the feature extractor exactly models long-range dependencies when using the transformer encoder.
Despite the strong ability of the transformer, the high demand for memory and computation resources may cause difficulty in the training and inference process. For example, the authors of Vision Transformer (ViT) [9] used extensive computing resources to train their model (about 230 TPUv3-core-days for the ViT-L/16). It is hard for researchers to access such huge resources. The main reason for the high complexity of the transformer is the self-attention mechanism inside. Complexity and sequence length are squared. Therefore, reducing the sequence length can effectively reduce the complexity. With the consideration of efficiency, we do not simply use the square patch size proposed in the ViT-like backbone used in image classification, segmentation, and object detection [40,3,52,41]. Instead, we propose the 1-D split to split the picture into rectangle patches whose height is the same as the input image, shown in Figure 1c. In this way, the image can convert to a sequence of patches (1-D split) whose length is shorter than the 2-D split (the height of patch size is smaller than the input image). The design of patch size has the advantage of fewer Multiply Accumulate operations (MACs), which leads to faster training and inference with fewer resources.
The prediction stage is another important part of the text recognizer, which decodes the feature to text. An attention-based sequence decoder is commonly used in previous works and has a hidden state embedding to guide the decoder. Recent methods [39,22,27] use the frozen zero embedding to initialize the hidden The text image is split into rectangle patches whose height is the same as the height of the rectified image. Using the same patch dimension, the 1-D split is more efficient than the 2-D split.
state, which remains the same when different images are inputted, influencing the accuracy of the decoder. To make the hidden state of the decoder adaptive to different inputs, we propose a learnable initial embedding learned from the transformer encoder to dynamically learn information from images. The adaptive initial embedding can guide the decoding process to reach better accuracy.
To sum up, this paper presents three main contributions: 1. We propose a novel three-stage architecture for text recognition, TRIG, namely TRansformer-based text recognizer with Initial embedding Guidance. TRIG leverages the transformer encoder to extract global context features without an additional context modeling module used in CNN-based text recognizers. Extensive experiments on several public scene text benchmarks demonstrate the proposed framework can achieve state-of-the-art (SOTA) performance. 2. A 1-D split is designed to divide the text image as a sequence of rectangle patches with the consideration of efficiency. 3. We propose a learnable initial embedding to dynamically learn information from the whole image, which can be adaptive to different input images and precisely guide the decoding process.

Related Work
Most traditional scene text recognition methods [8,33,45,20,31] adopt a bottomup approach, which first detects individual characters with a sliding window and classifies them by using hand-crafted features. With the development of deep learning, top-down methods were proposed. These approaches can be roughly divided into two categories by applying a transformer or not, namely transformerfree methods and transformer-based methods.

Transformer-Free Methods
Before the proposal of the transformer, STR methods only use CNN and recurrent neural network (RNN) to read the text. CRNN [38] extracts feature sequences using CNN, and then encodes the sequence by RNN. Finally, Connectionist Temporal Classification (CTC) [10] decodes the sequence to the text results. By design, this method is hard to address curve or rotated text. To deal with it, Aster proposes the method of spatial transformer networks (STN) [17] with the 1-D attention decoder. Without spatial transformation [43,22], propose methods to handle irregular text recognition by 2-D CTC decoder or 2-D attention decoder. Furthermore, segmentation-based methods [23] can also be used to read text, which should be supervised by character-level annotations. SEED [34] uses semantic information which is supervised by a pre-trained language model to guide the attention decoder.

Transformer-Based Methods
The transformer, first applied to the field of machine translation and natural language processing, is a type of neural network mainly based on the selfattention mechanism. Inspired by NLP success, ViT applies a pure transformer to tackle the image classification tasks and attains comparable results. Then, Dataefficient Image Transformers (DeiT) [40] achieves competitive accuracy with no external data. Unlike ViT and DeiT, the detection transformer (DETR) [5] uses both the encoder and decoder parts of the transformer. DETR is a new framework of end-to-end detectors, which attains comparable accuracy and inference speed with Faster R-CNN [35]. We summarize four ways to use a transformer in STR. (a) Master [27] uses the decoder of the transformer to predict output sequence. It owns a better training efficiency. In the training stage, a transformer decoder can predict out all-time steps simultaneously by constructing a triangular mask matrix. (b) a transformer can be used to translate from one language to another. So SATRN [21] and NRTR [37] adopt the encoder-decoder of the transformer to address the cross-modality between the image input and text output. The image input represents features extracted by shallow CNN. In addition, SATRN proposes two new changes in the transformer encoder. It uses an adaptive 2D position encoding and adds convolution in feedforward layer. (c) SRN [47] not only adopts the transformer encoder to model context but also uses it to reason semantic. (d) the transformer encoder works as a feature extractor including context modeling. Our work uses this method of using the transformer encoder. It is different from recent methods.

Methodology
This section describes our three-stage text recognition model, TRIG, in full detail. As shown in Figure 2, our approach TRIG consists of three stages: Transformation (TRA), Transformer feature extractor (TFE), and attention decoder (AD). TRA rectifies the input text by a thin-plate spline (TPS) [4]. TFE provides robust visual features. AD decodes the feature map to characters. First, we describe the stage of TRA. Second, we show the details of the stage of TFE. Then, the AD stage is presented. After that, we introduce the loss function. Finally, we analyze the efficiency of our method with different patch sizes. The overview of three-stage text recognizer, TRIG. The first stage is transformation (TRA), which is used to rectify text images. The second stage is a Transformer Feature Extractor (TFE), aiming to extract effective and robust features and implicitly model context. The third stage is an attention decoder (AD), which is used to decode the sequence of features into characters.

Transformation
Transformation is a stage to rectify the input text images by rectification module. This module uses a shallow CNN to predict several control points, and then TPS transformation is applied to diverse aspect ratios of text lines. In this way, the perspective and curve text can be rectified. Note, the picture becomes the size of 32 × 100 here.

Transformer Feature Extractor
The TFE is illustrated in Figure 3. In this stage, the transformer encoder is used to extract effective and robust features. First, the rectified image is split into patches. Unlike the square size of patches in [9,40,3,52,41], the rectified image is split into rectangle patches, whose size is h × w (h is same as the height of the rectified image). Then the rectified image X ∈ R H×W ×C , where H, W, C is the height, width, and channel of the rectified image, can be mapped to a sequence X s ∈ R (H×W ÷(h×w))×(3×h×w) . Then, a trainable linear projection W E ∈ R (3×h×w)×D (embedding matrix) is used to obtain the patch embeddings E ∈ R (H×W ÷(h×w))×D , where D is the dimension of patch embeddings. The transcription procedure is given by: Initial embedding E init is a trainable vector, appended to the sequence of patch embeddings, which goes through transformer encoder blocks, and is then used to guide the attention decoder. Similar to the role of class token in ViT, we introduce a trainable vector E init called Initial embedding. In order to encode the position information of each patch embeddings, we use the standard learnable position embeddings. The position embedding can be parameterized by a learnable positional embedding table. For example, position i has i-th position embedding in the embedding table. The position embeddings E pos have the same dimension D as patch embeddings. At last, the input feature embeddings F 0 ∈ R (H×W ÷(h×w)+1)×D is the sum of position embeddings and patch embeddings: Transformer encoder blocks are applied to the obtained input feature embeddings F 0 . As we all know, the transformer encoder block consists of multi-head self-attention (MSA) and multi-layer perceptron (MLP). Following the architecture from ViT, layer normalization (LN) is applied before MSA and MLP. The MLP contains two linear transformations layers with a GELU non-linearity. The dimension of input and output is the same, and the dimension of the innerlayer is four times the output dimension. The transformer encoder block can be represented as following equations: where l denotes the index of blocks, and L is the index of the last block. The dimension of F l and F l is D and 4D.
To obtain better performance, we add a Residual add module. The Residual add module uses skip edge to connect MSA modules in adjacent blocks, following Realformer [13]. The multi-head process can be unified as: where query matrix Q, key matrix K and value matrix V are linearly pro- After transformer feature extraction, the feature map Please note that f means feature embedding. N denotes the index of feature embeddings without f init .

Attention Decoder
The architecture is illustrated in Figure 3. We use an attention decoder to decode the sequence of the feature map. f init is used to initialize the RNN decoder and [f 1 , f 2 , . . . , f N ] is used as the input of the decoder. First, we obtain an attention map from the feature map and the internal state from RNN:  where b, w, W d , V d are trainable parameters, s t−1 is the hidden state of the recurrent cell within the decoder at time t. Specifically, s 0 is equivalent to f init .
Then, we use attention map to element-wise product the feature map [f 1 , f 2 , . . . , f N ], and combine all of them to obtain a vector g t which called a glimpse: Next, RNN is used to produce an output vector and a new state vector. The recurrent cell of the decoder is fed with: where [g t , f (y t−1 )] denotes the concatenation between g t and the one-hot embedding of y t−1 .
Here, we use GRU [7] as our recurrent unit. At last, the probability for a given character can be expressed as:

Training Loss
Here, we use the standard cross-entropy loss, which can be defined as: where y 1 , y 2 , . . . , y T is the groundtruth text represented by a character sequence.

Efficiency Analysis
For the TFE, we assume that the hidden dimension of the TFE is D and that the input sequence length is T . For the decoder, the complexity gap between 1 and D split and 2-D split comes from the process of obtaining an attention map and glimpse. We assume that the sequence length is T . The complexities are both O(T ). Therefore, the shorter sequence has lower complexity. The MACs of AD using the 1-D split is 0.925G when it is 5.521G using the 2-D split.
Based on the above analysis, we propose to use a 1-D split to increase efficiency.

Experiments
In this section, we demonstrate the effectiveness of our proposed method. First, we give a brief introduction to the datasets and the implementation details. Then, our method is compared with state-of-the-art methods on several public benchmark datasets. Next, we make some discussions on our method. Finally, we perform ablation studies to analyze the performance of different settings.

Dataset
In this paper, models are only trained on two public synthetic datasets MJSynth (MJ) [16] and SynText (ST) [11] without any additional synthetic dataset, real dataset or data augmentation. There are 7 scene text benchmarks chosen to evaluate our models.
MJ contains 9 million word box images, which is generated from a lexicon of 90K English words.
ST is a synthetic text dataset generate by an engine proposed in [11]. We obtain 7 million text lines from this dataset for training. IIIT5K (IIIT) [30] contains scene texts and born-digital images, which are collected from the website. It consists of 3000 images for evaluation.
Street View Text (SVT) [44] is collected from the Google Street View. It contains 647 images for evaluation, some of which are severely corrupted by noise and blur or have low resolution.
ICDAR 2003 (IC03) [28] have two different versions of the dataset for evaluation: versions with 860 or 867 images. We use 867 images for evaluation.
ICDAR 2013 (IC13) [19] contains 1015 images for evaluation. ICDAR 2015 (IC15) [18] contains 2077 images, captured by Google Glasses, some of which are noisy, blurry, and rotated or have low resolution. Researchers have used two different versions for evaluation: 1811 and 2077 images. We use both of them.
SVT-Perspective (SVTP) [33] contains 645 cropped images for evaluation. Many of the images contain perspective projections due to the prevalence of non-frontal viewpoints.
CUTE80 (CUTE) [36] contains 288 cropped images for evaluation. Many of these are curved text images.

Implementation Details
The proposed TRIG is implemented in the PyTorch framework and trained on two RTX2080Ti GPUs with 11GB memory. As for the training details, we do not perform any type of pre-training. The decoder recognizes 97 character classes, including digits, upper-case and lower-case letters, 32 ASCII punctuation marks, end-of-sequence (eos) symbol, padding symbol, and unknown symbol. We adopt the AdaDelta optimizer and the decayed learning rate. Our model is trained on ST and MJ for 24 epochs with a batch size of 640, the learning rate is set to 1.0 initially and decayed to 0.1 and 0.01 at the 19th and the 23rd epoch. The batch is sampled 50% from ST and 50% from MJ. All images are resized to 64 × 256 during both training and testing. We use the IIIT dataset to select our best model. At inference, we resize the images to the same size as for training. Furthermore, we use beam search by maintaining five candidates with the top accumulative scores at every step to decode.

Comparison with State-of-the-Art
We compare our methods with previous state-of-the-art methods on several benchmarks. The results are shown in Table 1. Even compared to reported results using additional real or private data and data augmentation, we achieve satisfying performance. Compared with other methods trained only by MJ and ST, our method achieves the best results on four datasets including SVT, IC13, IC15, and SVTP. TRIG provides an accuracy of +1. 5   Discussion on Training Procedures Our method TRIG needs long training epochs. We train TRIG and ASTER with the same learning rate and optimizer for 24 epochs. The whole training procedure takes a week. The learning rate, which is 1.0, drops by a factor of 10 after 19 and 23 epochs. As shown in Figure 4, the accuracy of TRIG on IIIT is lower than ASTER in the preceding epoch and better than ASTER after 14 epochs. At last, the accuracy of TRIG on IIIT and its average accuracy on all datasets is 0.8pp and 2pp higher than ASTER. We can conclude that the transformer feature extractor needs longer epochs to train to improve the accuracy. Because the transformer does not have properties such as CNN (i.e., shift, scale, and distortion invariance).

Discussion on Long-Range Dependencies
The decoder of the transformer uses triangle masks to promise that the prediction of one-time step t can only access the output information of its previous steps. Taking inspiration from this, we use the mask to let the transformer encoder only access the nearby feature embeddings. By this, the transformer encoder will have a small reception. We empirically analyze the performance of long-range dependencies by comparing two kinds of methods with or without masks, demonstrated in Table 2. The detail of the mask is shown in Figure 5. We observe the decrease of accuracy when using mask no matter CTC decoder and attention decoder. It indicates that long-range dependencies are related to effectiveness and the feature extractor of our method can capture long-range dependencies. Table 2. Performance comparison with mask and context modeling module. There are two kinds of decoder, CTC and attention decoder. All models are trained on two synthetic datasets. The batch size is 896 and the training step is 300000. The number of blocks in TFE is 6 and skip attention is not used.

Mask
Context Discussion on Extra Context Modeling Module As the effective module of context modeling is used in STR networks, we consider whether we need an extra context modeling module when using the transformer encoder to extract features. After the transformer feature extractor, a context modeling module (BiLSTM) is added. As shown in Table 2, the accuracy of almost all data sets is lower than methods without BiLSTM (except +0.2pp on SVTP); therefore, we conclude that it is not necessary to add BiLSTM after the transformer feature extractor to the model context, because the transformer feature extractor has an implicit ability to model context. An extra context modeling module may break the context modeling that the model has implicitly learned. Table 3 shows the average accuracy in seven scene text benchmarks with different patch sizes when the models are trained with the same settings. The number of blocks in TFE is 6 and the batch size is 192. The training step of all models is 300,000 and skip attention is not used. The transformation stage is used for 1-D split because it loses the information on height dimension. The rest of the methods do not use a transformation stage, because the feature map is 2-D and a 2-D attention decoder can be used for it to have a glimpse of the character feature. When using 1-D split, the method with the patch size of 32 × 4 is better than other patch sizes. Therefore, we set this patch size as the patch size of TRIG. Furthermore, our method with 1-D split (patch size of 32 × 4) achieves better average accuracy than 2-D split (patch size of 16 × 4, 8 × 4, and 4 × 4). Therefore, the design of 1-D split of rectangle patch with the transformation stage can work better than the other patch size with the 2-D attention decoder. Discussion on Initial Embedding To illustrate how the learnable initial embedding helps TRIG improve performance, we collect some individual cases from the benchmark datasets to compare the predictions of TRIG with and without the learnable initial embedding. As shown in Figure 6, the prediction without initial embedding guided may lose the first character or mispredict the first character while TRIG with initial embedding predicts the right character. Furthermore, the initial embedding can also bring the benefit to decode all characters, such as 'starducks' to 'starbucks'.  Figure 7 shows the attention map from each embedding to the input images. We used Attention Rollout [1] following ViT. We averaged the attention weights of TRIG across all heads and recursively compute the attention matrix. At last, we can get pixel attribution for embeddings in (b). The first row shows what part of the rectified picture is responsible for the initial embedding. Furthermore, the rest of the rows represent the attention map of feature embeddings [f 1 , f 2 , . . . , f N ]. The initial embedding is mainly relevant to the first character. For each feature embedding, some of the features come from adjacent embeddings and others come from distant embeddings. According to Figure 7, we can roughly learn how the transformer feature extractor extract features and model long-range dependencies.

Discussion on Attention Visualization
Discussion on Efficiency To verify the efficiency of our model, we compare the MACs, parameters, GPU memory, and inference speed of our model using 1-D split (square patch with the size of 4 × 4) and 2-D split (rectangle patch with the size of 32 × 4) and ASTER. The result is shown in Table 4. The MACs of TRIG with 1-D split and training GPU memory cost (the batch size is 32) are

Ablation Study
In this section, we perform a series of experiments to evaluate the impact of blocks, heads, embedding dimension, initial embedding guide, and skip attention on recognition performance. All models are trained from scratch on two synthetic datasets. The results are reported on seven standard benchmarks and shown in Tabel 5. We can make the following observations: (1) The TRIG with 12 blocks is better than the model of six blocks. The performance can be improved by stacking more transformer encoder blocks. However, when transformer encoder goes deeper, the stacking blocks cannot increase the performance (the average accuracy of 24 blocks decreases by 0.2pp). The accuracy may reach the bottleneck. It is expected that stacking blocks lead to more challenging training procedures.
(2) Initial embedding can bring gains to the model on average accuracy no matter skip attention is applied or not. This shows the effectiveness of initial embedding guidance. (3) Skip attention is important to accuracy. Regardless of the depth of the feature extractor, the addition of skip attention brings gains to the performance. (4) When other conditions are guaranteed to be the same, the average accuracy of 16 heads is better than the condition of 8 heads.
Besides, the average accuracy of 512 dimensions is better than 256 dimensions.

Conclusions
In this work, we propose a three-stage transformer-based text recognizer with initial embedding guidance named TRIG. In contrast to the existing STR network, this method only uses transformer feature extractor to extract robust features and does not need a context modeling module. A 1-D split is designed to divide text images. Besides, we propose a learnable initial embedding learned from transformer encoder to guide the attention decoder. Extensive experiments demonstrate that our method sets the new state of the art on several benchmarks.
We also demonstrate that the longer training epochs and long-range dependencies are essential to TRIG.
We consider two promising directions for future work. First, a better transformer architecture that is more suitable for Scene text recognition can be de-signed to extract more robust features. For example, a transformer can be designed as a pyramid structure such as CNN or some other structure. Second, we see potential in using a transformer in an end-to-end text spotting system.