ViTSTR-Transducer: Cross-Attention-Free Vision Transformer Transducer for Scene Text Recognition

Attention-based encoder–decoder scene text recognition (STR) architectures have been proven effective in recognizing text in the real world, thanks to their ability to learn an internal language model. Nevertheless, the cross-attention operation that is used to align visual and linguistic features during decoding is computationally expensive, especially in low-resource environments. To address this bottleneck, we propose a cross-attention-free STR framework that still learns a language model. The framework we propose is ViTSTR-Transducer, which draws inspiration from ViTSTR, a vision transformer (ViT)-based method designed for STR and the recurrent neural network transducer (RNN-T) initially introduced for speech recognition. The experimental results show that our ViTSTR-Transducer models outperform the baseline attention-based models in terms of the required decoding floating point operations (FLOPs) and latency while achieving a comparable level of recognition accuracy. Compared with the baseline context-free ViTSTR models, our proposed models achieve superior recognition accuracy. Furthermore, compared with the recent state-of-the-art (SOTA) methods, our proposed models deliver competitive results.


Introduction
Scene text recognition (STR) is an optical character recognition task that recognizes text in the real world [1].STR remains an unsolved problem because of many factors, such as complex background, noise inference, and irregular deformations [1].With recent advances in deep neural networks, many sophisticated deep-learning-based STR methods [2][3][4][5][6][7][8][9][10][11][12][13][14] have emerged.Among them, transformer decoder-based and attentionbased methods [4,[6][7][8][9][10][11][12][13][14] have demonstrated their efficacy in the field of STR.The transformer decoder-based methods comprise two main components: a visual encoder and a linguistic decoder.The visual encoder is responsible for extracting visual features and their spatial relationships.The linguistic decoder functioning as a language model focuses on capturing character-level dependencies.The interactions between the visual and linguistic features are modeled via cross-attention mechanisms that allow the decoder to access pertinent segments of the encoder's representations to generate one-at-a-time predictions.Nevertheless, as the size of visual features and the number of predicted characters increase, the model complexity and memory requirements in the cross-attention layers also grow proportionally.Let H and W be the height and width of an input image, and let H and W be the height and width of the resulting feature maps in which H and W are H S h and W S w for strike factors, S h and S w , respectively.For T d target characters, the cross-attention complexity is O(H W T d D) for D model dimension.This computational caveat diminishes the attractiveness of the transformer decoder-based methods for a long textline recognition task or in a low-resource setting.
RNN-T [15,16], a technique developed for speech recognition, is a language-aware method, but it does not employ any cross-attention mechanisms.Thus, compared with other language-aware methods, RNN-T exhibits faster inference.However, during training, RNN-T produces an output sequence that does not align with a target sequence and the true alignment between an input sequence and a target sequence is not known in advance.Thus, RNN-T needs to identify all possible alignments during training and this leads to a memory-intensive training process [17].In addition, it was designed to process only one-dimensional (1D) visual feature sequences.
If a single alignment is known or imposed in advance, it is possible to remove the memory-intensive training bottleneck of handling all possible alignments.In this regard, ViTSTR [18] is a context-free encoder-only method that produces an output sequence aligning with a target sequence and, thus, does not require handling all possible alignments.By default, the encoder of ViTSTR is a two-dimensional (2D) vision transformer (ViT) feature extractor.However, ViTSTR cannot learn a language model that is beneficial in challenging scenarios like character occlusion.Thus, in comparison to the language-aware methods, ViTSTR exhibits lower recognition accuracy.
Inspired by ViTSTR and RNN-T, we propose a cross-attention-free STR framework called ViTSTR-Transducer, which learns a language model but without any high-complexity cross-attention mechanisms.Like ViTSTR, our proposed framework uses a pre-trained 2D ViT backbone to generate a patch-level feature sequence that aligns with a target sequence.Our proposed framework surpasses ViTSTR through the incorporation of a decoder with low complexity (i.e., no cross-attention mechanism) similar to RNN-T, but it does not require handling all possible alignments during training.
We validated the proposed ViTSTR-Transducer framework by using a standard STR pipeline by training on a collection of public synthetic datasets and subsequently finetuning on a collection of public labeled real datasets.The performance of our proposed models was evaluated on public benchmark datasets.
Our contributions can be summarized as follows: 1.
We propose a cross-attention-free STR framework called ViTSTR-Transducer.Thus, our proposed framework has a constant decoding time with respect to the size of visual feature maps.

2.
The analysis of inference time and accuracy indicates that our ViTSTR-Transducer models offer considerably lower latency while maintaining competitive recognition accuracy, compared with the baseline attention-based models.Compared with the baseline context-free ViTSTR models, our ViTSTR-Transducer models achieve superior recognition accuracy.

3.
Compared with the state-of-the-art (SOTA) attention-based methods, our ViTSTR-Transducer models achieve competitive recognition accuracy.

4.
The ablation results on the encoder's backbone show that a ViT-based backbone, via its self-attention layers, allows the rearrangement of feature order to align with that of a target sequence.

Related Work
New Depending on the manner in which characters are generated and refined, STR methods can be classified as context-free, context-aware, and enhanced context-aware methods, while many recent context-aware and enhanced context-aware methods [4,[6][7][8][9][10][11][12][13][14]19] have achieved SOTA recognition performance on benchmark datasets, they also entail high latency because of one-at-a-time decoding.These methods reply on high-complexity cross-attention mechanisms to align visual and linguistic features during decoding.In contrast, context-free methods [2,3,5,18,20] require low latency because of parallel decoding but yield sub-optimal recognition performance.
New Section 2.1 provides an overview of the context-free methods, followed by the context-aware methods in Section 2.2 and the enhanced context-aware methods in Section 2.3, respectively.Lastly, we discuss the recent advances in SOTA vision transformerbased STR methods.

Context-Free Methods
Context-free methods are characterized by their ability to generate subsequent outputs without relying on information from previous outputs.Introduced by Graves et al. [21], a connectionist temporal classification (CTC) is a context-free algorithm that estimates a total probability between an input sequence and an output sequence by marginalizing over all possible alignments [22,23].Many CTC-based methods [2,3,5,20] follow a similar framework that comprises an optional rectification module, a visual feature extractor, a sequence modeler, and a decoder (CTC).Vision transformer for STR (ViTSTR) by Atienza [18] can be viewed as a CTC-based method that uses a pre-trained ViT model and imposes a single alignment.A major drawback of the CTC-based methods and ViTSTR is their strong assumption of conditional independence.Consequently, the CTC-based methods and ViTSTR do not learn an internal language model [23] during training that is beneficial in challenging scenarios like character occlusion.Nevertheless, it is possible to integrate an external language model to provide guidance during CTC training [24].

Context-Aware Methods
Context-aware methods are characterized by their ability to generate subsequent outputs based on information from previous outputs.Attention-based methods are contextaware or language-aware.The attention-based methods [4,6,9] predict one character at a time by selecting relevant parts of features using its hidden state.This conditional decoding enables an attention-based decoder to learn a language model implicitly.However, while an attention-based method achieves higher accuracy, it also comes with increased latency [25].In the case of perspective or curved texts, 2D attention-based methods [7,8] were proposed.However, attending the entire visual feature maps during decoding increases the computation burden significantly.
Recurrent neural network transducer (RNN-T) by Graves [15,16], on the other hand, is an attention-free method but can acquire a language model.However, a major caveat of RNN-T is the intensive computation of a total probability for an input and output sequence pair by marginalizing over all possible alignments as the true alignment is not known in advance [17].Nevertheless, Ngo et al. [26] experimented with an RNN-T-based offline handwritten text recognition for Japanese and Chinese scripts.

Enhanced Context-Aware Methods
The aforementioned context-aware methods utilize an autoregressive language model that strictly considers the context in a left-to-right manner.On the other hand, the enhanced context-aware methods improve the language modeling aspect of the context-aware methods by incorporating external language information, bidirectional context, or iterative refinement.Like the context-aware methods, the enhanced context-aware methods [10][11][12][13][14]19] still require cross-attention computations, the complexity of which is proportional to the size of visual feature maps and the number of decoded characters.

Vision Transformer-Based Methods
Inspired by the success of transformer networks [27] in natural language processing (NLP), Dosovitskiy et al. [28] introduced vision transformer (ViT) networks as a computation-friendly alternative to convolutional architectures.However, a major drawback of ViT is its reliance on large-scale training data due to the absence of inductive biases or priors.As a result, more data-efficient ViT architectures [29,30] have emerged.To capture multi-scale features, a hierarchical vision transformer network called Swin was proposed by Liu et al. [31], followed by an optimal searched architecture called S3 ViT network by Chen et al. [32].The ViT architecture has rapidly gained popularity and has been widely embraced within STR methods.The ViT-based STR methods [18,33,34] have shown promising results.

Proposed Method
Building upon the inspiration from ViTSTR [18] and RNN-T [15,16], we introduce a cross-attention-free ViTSTR-Transducer framework for scene text recognition.This framework learns a language model without the need for cross-attention modules.Thus, ViTSTR-Transducer is a low-complexity and high-latency STR framework that does not sacrifice performance accuracy.
We begin by describing the building blocks of the proposed framework by detailing ViTSTR in Section 3.1.1,followed by RNN-T in Section 3.1.2.Lastly, the details of the proposed ViTSTR-Transducer framework are provided in Section 3.1.3.

ViTSTR
ViTSTR [18] is a context-free encoder-only architecture that is based on a pre-trained ViT backbone, as shown in Figure 1a.ViTSTR takes an input image, I, and generates a flattened patch-level feature sequence, F, that is forced to align with a training target sequence during training.
where ViT-Encoder is a pre-trained ViT encoder, followed by a flattening operator.The flattening operator converts a 2D array of feature vectors into a 1D structure.Let H , W , and D be the height, width, and channel of the encoder's 2D feature maps, respectively.Then, T = H W .The predicted character distribution sequence, Ŷ = ( ŷ1 , . . ., ŷT ), ŷi ∈ R C , is expressed as where SoftmaxClassifier is a simple feed-forward network, followed by softmax normalization along class dimension and C is the number of class labels.A cross-entropy loss is calculated between the predicted character distribution, Ŷ, and the ground-truth text, Y.For example, in Figure 1a, f 1 and f 6 are trained to be the features of the letters k and t, respectively.During inference, ViTSTR employs a no-label token (i.e., a dash in Figure 1a) for indicating spaces or the end of decoding.ViTSTR uniquely assumes that F aligns with Y.We discovered that the messagepassing self-attention layers present in a ViT model allow for the rearrangement of feature order, making this assumption feasible.For example, the feature, f 1 in Figure 1a, belongs to the patch on the top left of the input image and is assigned to predict the first character regardless of whether the patch contains the first character or not.Even if the first character appears at the bottom right corner of the input image, the feature f 1 is learned to represent the bottom right region through self-attention layers.This is because the feature f 1 is determined by highly relevant features that contribute to the overall minimization of the cross-entropy loss.

RNN-T
RNN-T [15,16] is a special form of the context-aware methods but it does not utilize any cross-attention mechanisms, as shown in Figure 1b.Alongside an encoder or a feature extractor, RNN-T employs a predictor and a joiner.The predictor network, functioning as an autoregressive (AR) left-to-right language model, solely relies on a context sequence.The joiner network is a simple feed-forward network that integrates visual and linguistic features from both the predictor and the encoder to generate character predictions.For an input image, I, and a context sequence, Z, as shown in Figure 1b, the outputs, F = ( f 1 , . . ., f T ), f i ∈ R D , and G = (g 1 , . . ., g U ), g i ∈ R D , of the encoder and the predictor are expressed as where T and U are the lengths of a feature sequence and a context sequence, respectively.Encoder is usually a CNN-based 1D feature extractor.Predictor is usually an RNN language model.Since F and G do not align with a target sequence and the true alignment is not known, RNN-T employs the joiner network to aggregate F and G by identifying all possible alignments.Let C be the number of class labels.The output, H = (h 1,1 , . . ., h T,1 , . . ., h T,U ), h i,j ∈ R C , of the joiner is given by where Joiner is a simple feed-forward network that outputs a three-dimensional tensor whose element is represented as For a given batch size, B, the space complexity of H is O(BTUC).By assuming that an alignment a = (a 1 , a 2 , . . ., a K ) is given and K is the length of a, the probability p(a) is given by where p k (a k ) is a label probability of a at the k-th position.p k (a k ) is obtained from H [17].
During training, instead of computing p(a) over a single alignment, RNN-T computes a total probability, p(Y), over all possible alignments by using dynamic programming since there can be a large number of possible alignments.RNN-T loss is minimized by minimizing the negative logarithmic total probability, − log p(Y).

ViTSTR-Transducer
Inspired by ViTSTR and RNN-T, ViTSTR-Transducer is a low-complexity autoregressive ViT-based STR framework but without any cross-attention mechanisms.The overall ViTSTR-Transducer framework is illustrated in Figure 1c.ViTSTR-Transducer uses a ViTSTR-like encoder to generate a flattened patch-level feature sequence.The pre-trained ViT model takes an input image, I, and outputs a 2D feature sequence that is flattened to form a 1D feature sequence, F = ( f 1 , . . ., f T ).F is obtained by Equation ( 1).As we argued in Section 3.1.1,a flattened patch-level feature sequence of ViTSTR aligns with a target sequence.For instance, in Figure 1c, f 1 and f 6 are visual features corresponding to the letters k and t, respectively.
To integrate linguistic information, we employ an autoregressive left-to-right language model (ARLM) that utilizes preceding characters to predict the subsequent character.The ARLM takes a context sequence, Z.The ARLM can be transformer decoders or recurrent neural networks.The ARLM returns a contextual vector sequence, G = (g 1 , . . ., g U ), where Embedding is a character embedding layer and D is the model dimension.U is the length of a context sequence.In actual implementation, F is truncated so that G and F share the same length (i.e., U = T).
To aggregate the visual and linguistic information, G and F are fused by weighted element-wise addition and followed by a softmax classifier to predict a class distribution sequence, Ŷ = ( ŷ1 , . . ., ŷU ), ŷi ∈ R C , where C is the number of class labels.Ŷ is expressed as where • is the Hadamard product and Linear is a linear projection layer.σ is a sigmoid activation function and α ∈ R D .SoftmaxClassifier is a simple feed-forward network, followed by softmax normalization along the class dimension.When computing Ŷ, F and G align with each other.This means that f 1 and g 1 in Figure 1c are jointly responsible for predicting the letter k.This strict alignment assumption between F and G is not made when computing H in RNN-T and this is why it requires handling all possible alignments to estimate p(Y).
Since F and G align with a training target sequence, Y, Ŷ also aligns with Y. Thus, the cross-entropy loss between a predicted class distribution sequence, Ŷ, and a training target sequence, Y, is given by Loss = CrossEntropy( Ŷ, Y), (10) where CrossEntropy is a cross-entropy loss operator.
During inference, the context sequence, Z, is initialized with a start-of-sentence token, SOS, to compute G that is, together with F, used to estimate Ŷ and a predicted character is sampled from ŷU , the last vector of Ŷ.The process is repeated until an end-of-sentence token, EOS, is reached.

Datasets
In this section, we describe the sources of our training, finetuning, and evaluation data, while the training data are synthetic, the finetuning and evaluation data are real labeled data.

Public Synthetic Datasets
Because it may be impractical or impossible to obtain a large amount of real data for training a text recognition system, synthetic labeled data were utilized for this purpose.
To expand the range of data available for training, three main synthetic datasets, namely MJSynth (MJ) [35], SynthText (ST) [36], and SynthAdd (SA) [8], were combined into a single training dataset.The total number of training samples used was 13.9 million images.

Experiments
In this section, we present the experiment setup and implementation details.

Experiment Setup
To validate our ViTSTR-Transducer framework, we set up three cases: (1) the baseline ViTSTR models [18] (see Figure 1a), (2) the baseline transformer decoder-based models [7,34], and (3) our proposed ViTSTR-Transducer models (see Figure 1c).The baseline ViTSTR models follow the architecture by Atienza [18].The baseline transformer decoder-based models use a standard transformer decoder [27] with cross-attention layers, while our ViTSTR-Transducer models use a modified transformer decoder without crossattention layers.All models use a DeiT-Small backbone [30] as a 2D feature extractor since it is a good trade-off between model size and performance.Instead of RNN-T, the baseline transformer decoder-based models were employed to evaluate the impact of cross-attention mechanisms, compared with our cross-attention-free ViTSTR-Transducer models.
The specifications of the used DeiT-Small backbone are provided in Table 1 while the specifications of the baseline transformer-based decoder are presented in Table 2.The DeiT-Small backbone takes an input image of 32 × 128 pixels.The resulting feature maps are 8 × 16 and, thus, F is a sequence of 128 feature vectors (excluding class token vector), which is long enough to predict English words.Each feature vector, f i , is in R D , where D = 384.The prediction covers only case-insensitive alphanumeric characters.In addition, three special characters (SOS, EOS, and PADDING) are included.Hence, C = 39.

Implementation Details
The models underwent a pre-training phase using synthetic datasets, which lasted for 15 full epochs.Within each epoch, the entire training data were used and a set of data augmentation techniques [10] was applied on each batch of 192 images.The training process utilized a cyclic learning schedule, with values ranging between 10 −4 and 10 −5 , and a gradient clip of 50 was implemented.
After the training phase, the models underwent a finetuning process using real labeled datasets.This process continued for 50 full epochs and the same set of data augmentation techniques was applied.Similar to the pre-training phase, a cyclic learning schedule was used, with smaller values ranging between 10 −5 and 10 −6 , and a gradient clip of 50 was applied during the finetuning process.
Rather than training the models with a combination of synthetic and real datasets all at once, dividing the training phase from the finetuning phase enables us to distinguish between weaknesses in the data and weaknesses in the models.This approach is particularly useful if the models struggle to generalize on the public benchmark datasets, as it allows us to identify the root cause of the issue more effectively.This was highlighted in a study by Liu et al. [51].

Results
In Section 4.1, we provide and evaluate the recognition accuracy and efficiency of our ViTSTR-Transducer models, compared with the baseline models.In Section 4.2, we present the ablation analyses of the complexities of the encoder and decoder.Lastly, in Section 4.3, we compare the recognition accuracy of our ViTSTR-Transducer models in comparison to the SOTA attention-based methods.Table 3 presents the cumulative required FLOPs for an input image that is assumed to have the maximum number of 25 characters [18].All models shared the same encoder's FLOPs as they shared the same DeiT-Small backbone.The table suggests that our proposed DeiT-S-ViTSTR-T model requires only 37% of the cumulative FLOPs needed by the baseline DeiT-S-Tr.Dec.model to generate all characters one at a time for a given input image.

Recognition Accuracy and Efficiency Comparison with the Baseline Methods
In terms of maximum latency (i.e., at 25 characters), Figure 3 suggests that our proposed model requires approximately 70% of the latency needed by the baseline DeiT-S-Tr.Dec.model.As shown in the same figure, the DeiT-S-ViTSTR model exhibited a constant latency regardless of the number of decoded characters.This is due to its parallel decoding nature, although it results in lower recognition accuracy.To date, we have discussed the efficiency in terms of FLOPs and latency of our proposed model in comparison to the baseline models.Now, we focus on the recognition accuracy comparison.Table 4 compares the recognition accuracy for both synthetic (S) and real (R) training data cases.When trained on the synthetic training data case, as shown in Table 4a, our proposed model outperformed the baseline DeiT-S-ViTSTR model by achieving a total word recognition accuracy of 89.5% vs. 87.9%.However, the proposed model slightly underperformed, compared with the baseline DeiT-S-Tr.Dec.model that achieved a total accuracy of 91.1%.When trained on the real training data, as shown in Table 4b, our proposed model outperformed the baseline DeiT-S-ViTSTR model by achieving a total word recognition accuracy of 95.3% vs. 94.4%.Additionally, its performance was competitive with the baseline DeiT-S-Tr.Dec.model that obtained a total accuracy of 95.9%.
In summary, our proposed method significantly reduces decoding latency while effectively maintaining a reasonably robust performance level, especially when trained on real training data.

Ablation Analyses of the Encoder and Decoder Complexities
The above experiments were conducted using the default Deit-Small [30] as a 2D backbone and a transformer-based decoder with three transformer decoder units, as provided in Table 2. To further evaluate the robustness of the proposed method, which is based on the assumption that the order of the visual features, F , aligns with the order of the linguistic features, G, we performed an ablation study on the encoder complexity by considering Deit-Tiny [30] and Deit-Medium [30] as backbones.Additionally, we assessed the decoder complexity by experimenting with the transformer-based decoders with one and five transformer decoder units (denoted as DEC1 and DEC5, respectively).
As shown in Table 5, our proposed DeiT-*-ViTSTR-T (*: T, S, or M) models significantly outperformed the baseline DeiT-*-ViTSTR models regardless of the training data source and backbone.Compared with the baseline DeiT-*-Tr.Dec.models, our proposed models achieved comparable recognition accuracy.This highlights that the assumption that the order of the visual features, F, aligns with the order of the linguistic features, G, holds regardless of the backbone complexity.The table also indicates that increasing the backbone complexity results in a slight to moderate improvement in recognition accuracy.However, this improvement comes at the cost of a significant increase in model size, as demonstrated in Table 6.For instance, the DeiT-M-ViTSTR-T model requires twice as many parameters as our DeiT-S-ViTSTR-T model.Nevertheless, the total accuracy improvements were approximately 1.2% and 0.6% for the synthetic and real training data, respectively.
In addition, Table 5 also shows that as the encoder complexity increases (i.e., from Deit-T to DeiT-S to DeiT-M), the performance gaps between our proposed DeiT-*-ViTSTR-T and the baseline DeiT-*-Tr.Dec.models become smaller.This highlights the diminishing role of the cross-attention layers in the decoder in aligning the visual and linguistic features.Instead, this alignment is learned and achieved by the self-attention layers in the encoder side.
As demonstrated in Table 5, the impact of decoder complexity on the recognition accuracy was marginal, regardless of the training data source.However, increasing the decoder complexity directly resulted in significant increases in model size and inference time, as indicated in Table 6.Unlike the encoder, which processes only once for a given input image, the decoder iteratively processes until the decoding process is complete.Consequently, any latency increase per decoding step has a magnifying impact on the overall decoding latency, as illustrated in the case of the DeiT-S-DEC5-ViTSTR-T model, which requires up to 164.1 ms.In summary, enhancing the encoder complexity can improve recognition accuracy but also results in a notable increase in model size.The impact of decoder complexity on recognition accuracy is marginal.However, increasing the decoder complexity can both enlarge the model size and significantly extend latency.

Recognition Accuracy Comparison with the SOTA Methods
In this section, we compare with the SOTA attention-based methods that employ crossattention mechanisms to align visual and linguistic features during decoding.It should be highlighted that establishing fair comparisons with the SOTA methods is challenging due to the varying experimental conditions of each SOTA method.These conditions include factors, such as backbone architecture, data augmentation, pre-training strategy, training epochs, stopping criteria, and more.
Irrespective of the training data source, our proposed models achieved competitive recognition accuracy with the recent SOTA attention-based methods, as illustrated in Table 7.For instance, when trained on the synthetic data in Table 7a, our cross-attentionfree DeiT-S-ViTSTR-T model obtained a total accuracy of 89.5% vs. 90.3%and 90.5% achieved by the SATRN [7] and SRN [14] models, respectively.In addition, utilizing the DeiT-Medium [30] as a backbone, the DeiT-M-ViTSTR-T model performed on par with the SOTA methods even without any cross-attention mechanisms.Similarly, when trained on the real training data in Table 7b, our proposed DeiT-S-ViTSTR-T model obtained a total accuracy of 95.3% vs. 95.1% and 95.7% achieved by the MAERec [52] and TRBA [10] models, respectively.Once again, the DeiT-M-ViTSTR-T model performed competitively with the SOTA methods.Despite its smaller size, the DeiT-T-ViTSTR-T model showcased remarkable performance, as shown in Table 7. Notably, when trained on the real training data, it demonstrated comparable competitiveness with the DiG-ViT [34] models.This suggests that the DeiT-T-ViTSTR-T model strikes a harmonious balance between model size and recognition accuracy, making it a suitable choice for resource-constrained environments.
In summary, despite the absence of cross-attention mechanisms during decoding, the accuracy of our proposed models, particularly with real labeled data, competes effectively with that of the SOTA attention-based methods that depend on cross-attention mechanisms during decoding.Furthermore, the above experimental results further validate our proposed method's primary assumption stating that the visual and linguistic features, denoted as F and G, align with the target character sequence.As a result, there is no need for cross-attention layers to align these two features.

Limitations and Future Work
We identify the following limitations of the proposed ViTSTR-Transducer framework.

•
Because the decoder operates as a cross-attention-free autoregressive language model, it cannot effectively synthesize bidirectional linguistic information.This bidirectional context is valuable for handling occlusion cases and iterative refinement [10,11].

•
In this study, ViT-based backbones have been proven to produce the visual features, F, that align with the linguistic features, G.However, further experiments are required to confirm whether this assumption is valid for a pure convolutional backbone or a hybrid convolutional transformer backbone.
Hence, our future work will address the first limitation by replacing the autoregressive language model with a masked language model that incorporates contextual information from both directions.Additionally, we will conduct experiments using a hybrid convolutional transformer architecture, instead of a vision transformer, to further validate our proposed ViTSTR-Transducer framework.

Conclusions
We present a cross-attention-free ViTSTR-Transducer framework that draws inspirations from a vision transformer for scene text recognition (ViTSTR) and recurrent neural network transducer (RNN-T).Our proposed framework does not need any cross-attention mechanisms that are present in mainstream attention-based encoder-decoder architectures during decoding.Nevertheless, our proposed framework can still acquire a language model.Despite having less computational demands and lower latency, our proposed ViTSTR-Transducer models achieve performance competitive to the recent SOTA attention-based STR methods.Our proposed models achieve a significant latency improvement over the baseline attention-based models while maintaining a comparable level of recognition accuracy.

Figure 1 .
Figure 1.(a) ViTSTR architecture.ViTSTR uses a pre-trained ViT model to generate a flattened patch-level feature sequence that aligns with a target sequence.(b) RNN-T architecture.RNN-T We begin by evaluating the efficiency of the proposed ViTSTR-Transducer model (DeiT-S-ViTSTR-T) in comparison to the baseline methods (DeiT-S-ViTSTR and DeiT-S-Tr.Dec.), while the DeiT-S-ViTSTR model is encoder-only and context-free, our DeiT-S-ViTSTR-T and the baseline DeiT-S-Tr.Dec.models utilize an autoregressive left-to-right decoder.However, the decoder of the proposed model does not incorporate any crossattention layers, the complexity of which is O(H W T d D).

Figure 2
compares the required FLOPs for each decoding step between our proposed DeiT-S-ViTSTR-T and the baseline DeiT-S-Tr.Dec.models.The figure illustrates the linear scaling of the required FLOPs for subsequent decoded characters.This linearly increasing pattern emerges because the autoregressive decoders of both our DeiT-S-ViTSTR-T and the baseline DeiT-S-Tr.Dec.models depend on previously decoded characters to predict the next one.At each decoding step, our DeiT-S-ViTSTR-T model required less than 44% of the FLOPs needed by the baseline DeiT-S-Tr.Dec.model.This improvement is attributed to the removal of cross-attention layers in the decoder of the proposed model.It should be noted that the baseline DeiT-S-ViTSTR model is not included in this figure because it is an encoder-only and context-free model that decodes all characters in parallel.

Table 2 .
Specifications of the transformer-based decoder.

Table 5 .
Ablation results of the encoder and decoder complexities.Bold: highest.ViTSTR-T: