Explainable Connectionist-Temporal-Classification-Based Scene Text Recognition

Connectionist temporal classification (CTC) is a favored decoder in scene text recognition (STR) for its simplicity and efficiency. However, most CTC-based methods utilize one-dimensional (1D) vector sequences, usually derived from a recurrent neural network (RNN) encoder. This results in the absence of explainable 2D spatial relationship between the predicted characters and corresponding image regions, essential for model explainability. On the other hand, 2D attention-based methods enhance recognition accuracy and offer character location information via cross-attention mechanisms, linking predictions to image regions. However, these methods are more computationally intensive, compared with the 1D CTC-based methods. To achieve both low latency and model explainability via character localization using a 1D CTC decoder, we propose a marginalization-based method that processes 2D feature maps and predicts a sequence of 2D joint probability distributions over the height and class dimensions. Based on the proposed method, we newly introduce an association map that aids in character localization and model prediction explanation. This map parallels the role of a cross-attention map, as seen in computationally-intensive attention-based architectures. With the proposed method, we consider a ViT-CTC STR architecture that uses a 1D CTC decoder and a pretrained vision Transformer (ViT) as a 2D feature extractor. Our ViT-CTC models were trained on synthetic data and fine-tuned on real labeled sets. These models outperform the recent state-of-the-art (SOTA) CTC-based methods on benchmarks in terms of recognition accuracy. Compared with the baseline Transformer-decoder-based models, our ViT-CTC models offer a speed boost up to 12 times regardless of the backbone, with a maximum 3.1% reduction in total word recognition accuracy. In addition, both qualitative and quantitative assessments of character locations estimated from the association map align closely with those from the cross-attention map and ground-truth character-level bounding boxes.


Introduction
Scene text recognition (STR) identifies text in natural scenes and remains a vibrant research field due to challenging imaging conditions [1,2].Current deep learning methods for STR typically comprise a visual feature extractor, a sequence modeler, and a decoder.The choice of decoder significantly impacts model recognition performance, latency, and explainability, given the same feature extractor and sequence modeler design.State-of-the-art (SOTA) methods categorize by their decoding of visual features into characters using connectionist temporal classification (CTC) and attention-based and Transformer decoders [3][4][5].
A 2D attention-based or Transformer decoder, using 2D feature maps, excels in recognition accuracy and character localization through a cross-attention mechanism.Unlike Transformer-based object detectors, such as DETR [6] and V-DETR [7], which directly output object bounding boxes, a Transformer-based text recognizer outputs only characters.These characters can be localized via the decoder's cross-attention map.With enough inductive biases, including locality, the Transformer decoder attends only to the locations of the objects of interest [7].Thus, the Transformer decoder generates a cross-attention map, linking predicted characters to relevant image regions.This location information yields benefits like model explainability [8][9][10][11][12] and text rectification [13].Figure 1(2) exemplifies the overlaid cross-attention maps (summed across predicted characters) from a Transformer decoder, illustrating alignment between character positions and attention weights.However, it should be noted that the attention-based decoder has high latency due to an intricate attention mechanism [3,14].Conversely, the CTC decoder offers superior latency efficiency but often sacrifices recognition accuracy compared with the attention-based decoder [3,4,14].The CTC decoder demands a 1D class probability distribution sequence input, prompting the common use of a 1D feature extractor in existing CTC-based methods [14][15][16][17][18].However, this approach hampers the ability to establish explainable 2D spatial relationship between the predicted characters and relevant image regions.The 2D-CTC [19] method emerged to handle 2D feature maps, extending the 1D CTC algorithm to process the height dimension.However, using 2D-CTC involves a trade-off, resulting in higher inference latency and training costs, particularly with larger 2D feature maps.
For explainable character localization using a 1D CTC decoder, we introduce a ViT-CTC STR architecture that enables a 1D CTC decoder with a pretrained vision Transformer (ViT) to act as a 2D feature extractor.To incorporate the 2D feature extractor, we propose a novel marginalization-based technique that predicts 2D joint probability distributions over the height and class dimensions.By marginalizing the height dimension, we obtain a 1D class probability distribution sequence suited for a 1D CTC decoder.
Our proposed method also generates an association map, serving for character localization and model prediction explanation.This map resembles the role of a cross-attention map in the attention-based architectures but with significantly lower computational demand.Qualitative comparisons between the overlaid cross-attention and association maps are depicted in Figure 1(2),(3), respectively, showcasing alignment.Moreover, unlike 2D-CTC [19], our method maintains consistent inference latency and training cost, regardless of 2D feature map size.To quantitatively measure the alignment between character positions from the association map and the ground-truth character locations, we propose an alignment evaluation metric (AEM).
Our contributions can be summarized as follows: 1.
We introduce a novel marginalization-based method for enabling a 2D feature extractor to be compatible with a 1D CTC decoder.This method yields an association map that links predicted characters to relevant image regions, enabling character localization and improving prediction explainability.

2.
We derive an alignment evaluation metric (AEM) that measures the alignment between character positions from the association map and the ground-truth character locations.This metric can also be used for the cross-attention map.

3.
Using our method, we experimented with the ViT-CTC architecture with various pretrained ViT backbones and a 1D CTC decoder.Our ViT-CTC models outperform the recent SOTA methods on public benchmark datasets.

4.
Compared with a Transformer-decoder-based model, a ViT-CTC model offers a remarkable speed boost, surpassing the former by up to 12 times, regardless of the ViT backbone used.This speed gain comes with a maximum reduction in total word recognition accuracy of 3.1%.Hence, the ViT-CTC model is particularly attractive for low-latency, resource-constrained environments.

Related Work
In this section, we provide a brief review of common decoders in mainstream scene text recognition (STR) architectures.In addition, we also describe the recent advances of vision Transformer (ViT) architectures and their adoptions in STR, followed by model explanation through visualizations.

Scene Text Recognition
Scene text recognition is a variant of unsegmented sequence labeling tasks in which a 2D input stream of pixels is labeled with a sequence of characters.Other similar perceptual tasks include speech and gesture recognition [20].
Graves et al. [20] introduced the CTC algorithm, which maps a recurrent neural network (RNN) output sequence of a speech signal to a character sequence.CTC incorporates a blank token ( ) to handle multiple input-to-output alignments.Instead of predicting a probability of a single alignment, CTC estimates a total probability by marginalizing over all possible alignments.CTC gained popularity in text recognition, leading to numerous CTC-based STR methods [14][15][16][17][18].These methods typically employ a common pipeline encompassing optional rectification, a 1D convolutional feature extractor, a recurrent sequence modeler, and a 1D CTC decoder.While most CTC-based methods were initially designed for the Latin script, Gunna et al. [21] and Hu et al. [4] extended the CTC-based recognition pipeline to different Indian and Vietnamese scripts, respectively.However, a 1D CTCbased approach (using a 1D feature extractor) is unable to establish explainable 2D spatial relationships between predicted characters and relevant image areas.
To tackle this, 2D-CTC [19], an extension of the 1D CTC algorithm with the height dimension, handles 2D feature maps.However, it leads to increased inference latency and training cost, particularly based on the height of feature maps.Moreover, there is a lack of standardized, optimized 2D-CTC implementations in prevalent deep learning frameworks.
In contrast to a 1D CTC decoder, an attention-based decoder accommodates both 1D and 2D feature extractors.One-dimensional attention-based methods [4,14,22,23] substitute a CTC decoder with an attention-based one to enhance recognition performance by capturing character dependencies.Recognizing limitations in accurately predicting characters within complex and curved text, 2D attention-based methods [9,24] emerged.
As Transformer networks [10] gained prominence, the Transformer decoder became the standard attention-based decoder, leading to Transformer-decoder-based methods [25,26].Via cross-attention mechanisms, the attention-based decoder produces a cross-attention map, associating each predicted character with relevant input image regions.The crossattention map is widely used for visual explanations of model predictions [8][9][10][11][12].Despite its superior performance, Baek et al. [14] showed that an attention-based decoder, using the same feature extractor, yields about three times higher latency than a CTC-based decoder.

Vision Transformer
Transformers [10] have established themselves in natural language processing (NLP).Vision Transformers (ViT) [27] extend this architecture to vision tasks by dividing images into patches and projecting them as tokens, similar to words in NLP.The ViT's training demands are computationally efficient, but it lacks inductive biases.Addressing this, effective ViT models require substantial training data (priors).Data-efficient image Transformers [28][29][30] were introduced to alleviate data demands, achieving competitive outcomes against convolutional networks.ViT swiftly integrated into existing STR setups as a 2D feature extractor and sequence modeler.ViT-based STR methods [1,5,31] were subsequently proposed, displaying SOTA performance, particularly when trained on real labeled data.

Visual Model Explanations
To help users understand model failure and discover biases in training data, transparent models are necessary [32].Nevertheless, deep neural networks (DNNs) behave as black boxes, making them difficult to understand.According to Junkang and Joe [32], an explanation map is a map that highlights relevant regions that contribute to a model's decision.The explanation can be obtained by using class activation mapping (CAM)based or attention-based methods.Gradient-weighted class activation mapping (Grad-CAM) [33] is an example of CAM-based methods.Grad-CAM computes the gradients of a given class to produce a low-resolution localization map that highlights relevant image regions.Xu et al. [8] utilized an attention mechanism and visualized the attention map to show human intuition-like alignments between a model-generated caption and relevant image regions.

Proposed Method
In our study, ViT-CTC models leverage pretrained vision Transformers and a 1D CTC decoder.This allows our models to draw on extensive visual pretraining and exploit 2D spatial feature relationships via self-attention layers, all while retaining the low latency of a 1D CTC decoder.The introduced marginalization-based method also facilitates character localization and model prediction explanations through a novel association map that is absent in the existing 1D CTC-based methods.
In this section, we present the details of our proposed marginalization-based method in 2D class probability space.We begin by providing a concise overview of the 1D CTC algorithm and its assumptions in Section 2.1.1,followed by the detailed derivations of the proposed method in Section 2.1.2.We formulate the association map that relates each model prediction to relevant image regions in Section 2.1.3.Lastly, we derive an alignment evaluation metric (AEM) that measures the alignment between character locations estimated using the association and cross-attention maps and ground-truth character locations in Section 2.1.4.

Connectionist Temporal Classification (CTC)
CTC assigns a total probability of an output sequence (Y) given an input sequence (X) [20,34,35].Instead of assigning a probability to the most likely alignment, CTC estimates a total probability by summing over all possible alignments between an input and output sequence.CTC introduces a blank or no-label token ( ) to allow the alignments and the input to have the same length.For any alignment, repeated characters are merged and blank tokens are removed to produce a final output sequence.For example, A 1 = ( , c, , a, , t) and A 2 = (c, c, , a, , t) are two of the possible and valid alignments for the same word, cat.Mathematically, the total probability assigned by CTC is given by [20,34,35] where p(Y|X) is a total probability of an (X, Y) pair.A = (a 1 , . . ., a W ) is an alignment and S X,Y = (A 1 , . . ., A n ) is a set of possible and valid alignments between X and Y. p t (a t |X) is a conditional probability on X at a prediction frame, t.Thus, at each timestep t, a learning algorithm must produce a valid probability distribution (i.e., 1D vector) over characters.
In the context of text recognition, the width dimension is treated as time while the height dimension is often collapsed by convolution and pooling layers.
Since S X,Y can be large, naive implementation is computationally inefficient.This is mitigated by dynamic programming by merging two alignments with the same output at the same t.Modern deep learning libraries have a built-in, optimized, efficient, lowlevel implementation of CTC.During inference, a greedy decoding scheme is used by selecting the most likely output at each prediction frame independently to obtain the highest probability alignment, A * , from which and duplicate characters are removed and merged, respectively [34].The greedy and parallel decoding nature allows CTC to achieve low latency that is crucial in low-resource and real-time environments.A * is given by The CTC algorithm makes the following assumptions [34]: 1.
Conditional independence.The predicted characters are conditionally independent, meaning there are no dependencies between characters.

2.
Monotonicity.When handling the subsequent feature vector, the current character can persist or the subsequent character must be processed.

3.
Many to one.There can be multiple feature vectors corresponding to a single output character.This implies that the length of feature vectors must be greater than or equal to the length of target characters.

The Proposed Marginalization-Based Method
The concept of the proposed method is to handle 2D feature maps with a 1D CTC decoder without adding complexity.This is achieved by applying the marginalization rule in 2D class probability space.
Concretely, as shown in Figure 2, a ViT encoder takes an input image and produces 2D feature maps, represented by F = (F 1,1 , . . ., F H ,W ), F i,j ∈ R D , where H , W , and D are the height, width, and embedding dimensions of the feature maps.F is directly fed to a linear layer to produce unnormalized 2D score distributions, S = (S 1,1 , . . ., S H ,W ), where LinearLayer is a feedforward neural network.Each S i,j is an unnormalized vector and C is the number of class labels.A softmax normalization is applied to S along both H and C dimensions to produce where Softmax H ,C is a softmax operator along the H and C dimensions.A cross-section along W is a valid 2D joint probability distribution over the H and C dimensions.A 3D graphical illustration of U is provided in Figure 3.
Next, U is marginalized over the H dimension to produce a sequence of valid 1D probability distributions over the C dimension, P = (P 1 , . . ., P W ), P j ∈ R C , that is required by a CTC decoder.P j is given by where each P j is a normalized class probability distribution vector.In the case of a 1D feature extractor (i.e., H = 1), U is exactly P. The overall text recognition workflow with the proposed method is shown in Figure 2.
Beyond the CTC algorithm's assumptions, our proposed method assumes horizontal or curved textline, excluding vertical orientation.

Association Map (AM)
In the existing CTC-based methods, the height dimension is physically discarded by feature averaging or pooling layers.The proposed method preserves the height dimension, making a 2D feature extractor compatible with a CTC decoder.
Thanks to the proposed method, a cross-section along the W dimension of U forms a valid 2D joint probability distribution over the H and C dimensions, as shown in Figure 3. Based on U, we can derive a novel association map (AM) that enables linking each predicted character to relevant image regions.This spatial connection serves two purposes: (1) explaining model predictions and (2) character localization.
The association map functions in the same way as the localization map of Grad-CAM [33], but without gradients, and the attention map [8], but without an attention mechanism.
Concretely, given the most likely alignment, A * = (a 1 , . . ., a W ), AM = (AM 1,1 , . . ., AM H ,W ), AM i,j ∈ {0, 1}, is expressed as where j is a prediction timestep or frame.a j is a CTC predicted character at j. U i,j,ind(a j ) is a probability of character, a j , at timestep, j, and height, i. ind() is a character-to-index mapping.α is a threshold between zero and one while is a blank token required by a CTC decoder.A high α associates a predicted character, a j , with the high probability image regions.The resulting character regions are illustrated in Figure 1(3).

Alignment Evaluation Metric (AEM)
Model predictions are explicable through visualization of the association and crossattention maps (Figure 1).We also quantitatively assess alignment between character positions in these maps and the ground truth character locations.Given the absence of explicit character coordinate predictions by the association and cross-attention maps, the intersection-over-union (IoU) metric is unsuitable.Instead, we introduce an alignment metric suitable for both association and cross-attention maps.
Concretely, given character regions R k on the association map and GT k as the groundtruth bounding box (depicted in Figure 4), the alignment evaluation metric (AEM) for a predicted character, k, is given by The AEM for a given text of length, L, is given by In the case of the cross-attention map, we first sum the cross-attention map over all attention heads in the case of multi-headed attention mechanism and normalize for each predicted character, k, to obtain CA = (CA 1,1 , . . ., CA H ,W ) , CA i,j ∈ R|0 ≤ CA i,j ≤ 1. Examples of the resulting overlaid and normalized cross-attention map are given in Figure 5a.In contrast to the association map, the cross-attention map is more diffuse due to the decoder's need to compute continuous attention weights across the entire feature maps.We filter out regions with low attention weights below the threshold, β.The filtered, binary crossattention map in Figure 5b, CAF = (CAF 1,1 , . . ., CAF H ,W ) , CAF i,j ∈ {0, 1}, is given by where β is between zero and one.A high β associates a predicted character, k, with the high attention weight regions.With the CAF, AEM k and AEM TEXT are computed, according to the above equations.
The fine-tuning datasets comprise the training sets of SVT, IIIT, IC03, IC13, IC15, and the real labeled datasets above.The idea of introducing the fine-tuning datasets based on real labeled data is to identify whether our ViT-CTC models have any inherent weaknesses or if there are any blindspots in the training datasets [53].The fine-tuning datasets comprise 2.4M labeled images.A few samples from the fine-tuning datasets are shown in Figure 6b.

Synthetic Character-Level Annotation Dataset
Character-level annotations are not available with the existing datasets.Thus, to quantitatively evaluate the character locations derived from the association and cross-attention maps, we use SynthTiger (https://github.com/clovaai/synthtiger,accessed on 1 August 2023) to synthetically generate a small dataset of 446 scene text images with character-level bounding boxes.A few samples of the generated images with character-level annotations are given in Figure 7.

Experiment Design
We experimented with different backbones, including three variants of DeiT-III [29] (DeiT-Small, DeiT-Medium, and DeiT-Base) and a CaiT-Small [30].The assessment of the backbone's complexity impact on recognition performance can be achieved by employing the DeiT-Small, DeiT-Medium, and DeiT-Base backbones.Furthermore, the inclusion of CaiT-Small enables us to compare the recognition performance of different ViT architectures.
The details of these four pretrained ViT backbones are shown in Table 1.For an input image of 224 × 224 pixels, the output feature maps are 14 × 14 × D, and D is the embedding dimension, which is provided in the same table for each ViT backbone.For each pretrained ViT backbone, we setup two ViT-CTC models, employing both the baseline feature averaging method (FA) [5] and the proposed marginalization method (M), presented in Section 2.1.1.In FA, feature maps are arithmetically averaged along the height or vertical dimension to produce a 1D feature sequence for a character classifier and a CTC decoder.As a result, it does not provide character location information.
Similarly, for recognition performance and latency comparison purposes, we also setup the Transformer-decoder-based models that are also based on the same ViT backbones, while the specifications of the Transformer decoder are provided in Table 2.It should be noted that only our ViT-CTC models using the proposed marginalization method and the Transformer-decoder-based models can offer character locations in addition to recognition.The estimated character locations are qualitatively and quantitatively evaluated against the ground-truth locations.In the case of a CTC decoder, the character set comprises 37 characters, encompassing case-insensitive letters, numbers, and a blank token denoted as .On the other hand, for a Transformer decoder, the character set consists of 39 characters, including case-insensitive letters, numbers, and three distinct special tokens (PADDING: zero padding; EOS: end of sentence; SOS: start-of-sentence).The input images were resized to 224 × 224 pixels.
The training strategy comprised two phases: (1) training on the synthetic datasets and (2) fine-tuning on the real datasets.These two phases of training allow us to identify models' weaknesses or training datasets' blindspots during evaluation [53].The training process lasted for 50 iterations.During each iteration, 300,000 images were randomly selected, and a batch of 64 images was used for training without any data augmentation to ensure a fair comparison with the SOTA methods [4,14].In addition, because the synthetically generated training images were already augmented during generation, additional data augmentation, such as [54,55], may affect the recognition accuracy negatively [45].The total training is equivalent to around two epochs on all of the training data.The fine-tuning phase followed the same settings as before, but it only lasted for 30 iterations, which is approximately equivalent to three epochs over the entire fine-tuning dataset.The cyclic learning schedules between 10 −4 and 10 −5 and between 10 −5 and 10 −6 were used for the training and finetuning phases, respectively.For all the models, pretrained ViT weights [29,30] were used with a gradient clip of ten.

Results
In this section, we present the experimental outcomes and important analyses.To evaluate the performance of our ViT-CTC models using the proposed method (M), we begin by providing the ablation analyses of the encoder complexities and architectures in Section 3.1, followed by comparing their accuracy with the baseline and SOTA-based methods that do not provide character locations in Sections 3.2 and 3.3.In Section 3.4, we compare with the baseline Transformer-decoder-based models that provide character locations via the cross-attention map.Lastly, we provide the qualitative and quantitative evaluation of character location derived from the proposed method and the cross-attention map.

Ablation Analyses of the Encoder Complexities and Architectures
In this section, we present the ablation analyses concerning ViT-based feature extractor complexities since the feature extractor is the main component in the proposed method.We utilize three variants of DeiT backbones (namely DeiT-S, DeiT-M, and DeiT-B) and explore different encoder architectures employing a CaiT-S backbone.
Table 3 demonstrates that increasing the complexity of the ViT-based feature extractor, specifically transitioning from DeiT-S to DeiT-M and DeiT-B, results in higher total word recognition accuracy for both synthetic and real training data.However, these improvements are accompanied by larger model sizes and heightened computational demands, as indicated in Table 1.Table 3 also shows that despite having a much smaller model size and computational demand, the CaiT-S model achieves a comparable total recognition accuracy with the DeiT-B model for both synthetic and real training data.

Recognition Accuracy Comparison with the Baseline Feature Averaging
In this section, we perform a comparison to assess the recognition accuracy of our ViT-CTC models using both the proposed method (M) and the baseline feature averaging (FA).Since FA does not yield character localization, the comparison in this section primarily centers around the recognition accuracy between the two methods.
As indicated in Tables 4a,b, there are minimal distinctions in terms of recognition accuracy between the two methods, regardless of source of training data (i.e., real and synthetic).The findings can be distilled into three primary points.Firstly, the proposed method, offering both model explainability and character location information, does not lead to any loss of recognition accuracy.Secondly, the utilization of a 2D feature extractor such as a ViT backbone improves the recognition accuracy of a CTC decoder, whereas the majority of CTC-based methods depend on a tailored 1D feature extractor.Thirdly, the utilization of real labeled data, albeit limited, results in a substantial recognition performance improvement compared with relying solely on synthetic training data.

Recognition Accuracy Comparison with the SOTA CTC-Based Methods
Similar to the preceding section, this section compares the recognition accuracy of our ViT-CTC models using our proposed method (M) with the SOTA CTC-based methods lacking character location information.Among the SOTA methods in Table 5, only the DiG-ViT [5] and GTC [4] models use real labeled data for training.The other models use solely synthetic data for training.The table suggests that integrating real labeled data can improve recognition accuracy on benchmark datasets.However, various factors like backbone architecture, training iterations, and data augmentation also play a significant role in this improvement.Among these methods, only ViTSTR [1] and DiG-ViT employ a ViT backbone; the rest rely on convolutional backbones.DiG-ViT employs the feature averaging technique to convert 2D feature maps to 1D for a CTC decoder.GTC [4] uses an attention-based decoder to guide a CTC decoder.Focusing on the models trained only on synthetic data (S), Table 5a shows that our ViT-CTC models using the proposed method (M) outperform the SOTA CTC-based methods, such as TRBC (TPS-ResNet-BiLSTM-CTC) [14], in recognition accuracy (bold numbers in the table).This recognition accuracy improvement is attributed to the advanced feature extraction of pretrained ViT backbones.Meanwhile, when considering methods trained or fine-tuned on real labeled data (R), Table 5b shows that our ViT-CTC models slightly outperform the SOTA DiG-ViT models (bold numbers in the table).Thus, regardless of the training data source, our ViT-CTC models with the proposed method (M) consistently show superior or comparable performance to the SOTA CTC-based methods.

Recognition Accuracy and Efficiency Comparison with the Baseline Transformer-Decoder-Based Models
Earlier sections evaluated our proposed ViT-CTC models' recognition accuracy against the CTC-based methods that lack character localization.Now, we jointly compare recognition accuracy and latency with a Transformer-decoder-based architecture that can associate predicted characters with relevant image regions.
A CTC decoder is acknowledged for its faster inference but lower recognition accuracy compared with a Transformer decoder that learns an implicit language model [3][4][5]14].This section quantitatively assesses the trade-off between the two decoders in terms of both latency and recognition accuracy.
Tables 6a,b compare the recognition accuracy of our ViT-CTC models using our proposed method against Transformer-decoder-based models.Regardless of the training data source, the Transformer-decoder-based models consistently achieved higher recognition accuracy on benchmark datasets due to their ability to capture character dependencies through implicit language modeling that is absent in a CTC decocder.
However, this recognition accuracy advantage was offset by increased latency, as shown in Figure 8 and Table 7.The inference time of a Transformer decoder is directly tied to the number of decoded characters, while a CTC decoder maintains a constant inference time.Quantitatively, the inference speed of a CTC decoder surpasses a Transformer decoder by up to 12 times, making it more appealing in low-latency and low-resource scenarios.Considering both latency and recognition accuracy, Figure 9 summarizes the trade-off between a CTC decoder and a Transformer decoder using different ViT backbones.With the same ViT backbone, the CTC decoder outperforms the Transformer decoder significantly in terms of efficiency, with a speed advantage of up to 12 times.However, this speed gain is countered by a maximum reduction in overall word recognition accuracy of 3.1%.

Qualitative Evaluation of Association Map
Until now, we have examined our ViT-CTC models' recognition performance and efficiency in comparison to the CTC and Transformer-decoder-based models.This section shifts focus to the significance of the association map, denoted as AM, which is a key output of our proposed method.The detailed derivation of the AM can be found in Section 2.1.3.Utilizing an AM enables the establishment of explainable 2D spatial relationships between the model's predictions and relevant image regions.This spatial link is crucial for understanding the model's predictions and localization.The AM generated by our proposed method corresponds to the cross-attention map formed by the cross-attention module within the Transformer decoder.This module selectively incorporates relevant features for adaptive character predictions.As α increases, the association maps maintain high probability regions while discarding those below α, as seen in Figure 10d.Compared with the Transformer decoder's cross-attention maps in Figure 10e, overall alignments are observed.These alignments validate the accuracy and reliability of the association maps from our proposed method that does not reply on a computationally-intensive cross-attention mechanism

Quantitative Evaluation of Association Maps
In this section, we quantitatively evaluate our ViT-CTC models' association map and the Transformer-decoder-based models' cross-attention map.Employing Equation ( 6) for the association map and Equation ( 9) for the cross-attention map, we calculate alignment evaluation metrics (AEMs) using Equation ( 8).This was performed using different threshold values α and β, respectively, on the synthetic dataset with character-level annotations, as detailed in Section 2.2.3.To ensure fairness, only image samples correctly recognized by both our ViT-CTC and the Transformer-decoder-based models were included in the evaluation.
Figure 11 depicts that the average alignment evaluation metric (AEM) of the crossattention map remains stable across different β values, showing good alignment accuracy with the ground-truth character locations.In contrast, the average AEM of the association map exhibits slight sensitivity to α, particularly at higher values.For α ≤ 0.95, the average AEM of the association map remains above 98% accuracy, signifying strong alignment between the estimated and ground-truth character locations.Thus, the association map is comparable to the cross-attention map in localizing the predicted characters, while the former has a significantly lower computational demand.
Figure 12 compares the estimated character locations from the association and crossattention maps with the ground-truth bounding boxes in a few highly curved text images.Both methods' estimated character locations closely align with the ground-truth positions.

Limitations and Future Work
Since a CTC decoder is many to one, the pretrained ViT backbone must produce 2D feature maps, the width of which must be greater than or equal to the length of text in an input image.For a ViT-CTC backbone that takes an input image of 224 × 224 pixels and returns 14 × 14 feature maps, it can predict at most 14 characters.Moreover, due to its reliance on left-to-right alignments, a CTC decoder is unable to recognize vertical or highly oriented text images.
Furthermore, due to the sizable receptive field of 16 × 16 pixels in the pretrained ViT backbones employed in this research, the character locations they generate exhibit low resolution.
Thus, future experiments will consider other pretrained ViT or hybrid CNN-Transformer backbones that output dense feature maps, increasing the number of predicted characters and enhancing the resolution of the resulting association map.We will also explore two potential applications of the association maps.Firstly, the association map can guide a Transformer decoder to counter attention drift in long textline images.Secondly, estimated character locations can aid text rectification for highly curved text images.

Conclusions
In this paper, we propose a marginalization-based method that enables a 2D feature extractor with a 1D CTC decoder by predicting an output sequence of 2D joint probability distributions over the height and class dimensions.The height dimension is marginalized to suit a 1D CTC decoder.In addition, the proposed method yields an association map that can be used to determine character locations and explain model predictions.
The experimental results show that our ViT-CTC models outperform the recent CTCbased SOTA methods on the public benchmark datasets in terms of recognition accuracy.Compared with a Transformer-decoder-based model, a ViT-CTC model has a maximum reduction in total word recognition accuracy of 3.1%, regardless of the ViT backbone.However, a ViT-CTC model exhibits a substantial speed improvement, surpassing a Transformer-decoder-based model by up to 12 times.Both the qualitative and quantitative evaluations of the character locations estimated from the association map closely correspond with those estimated using the cross-attention map and the ground-truth character-level bounding boxes.

Figure 1 .
Figure 1.The cross-attention vs. the association maps.The first row consists of text images.The second and third rows consist of the cross-attention and association maps, respectively, that associate each predicted character with image regions.The last row consists of text transcriptions.The crossattention map is obtained from a Transformer decoder, while the association map is obtained from a ViT-CTC model.Best viewed in color.

Figure 2 .
Figure2.The proposed marginalization-based method: A 2D feature sequence, F = (F 1,1 , . . ., F H ,W ), is produced by a 2D feature extractor such as a ViT backbone.F is fed to a linear layer to produce S = (S 1,1 , . . ., S H ,W ) from which a softmax normalization is performed over both H and C dimensions.Next, the normalized U = (U 1,1 , . . ., U H ,W ) is marginalized over the H dimension to produce P = (P 1 , . . ., P W ) that is fed to a CTC decoder.D and C are the feature and class dimensions, respectively.

Figure 3 .
Figure 3. 3D graphical illustration of U for an input image.(a) Input image.(b) The computed U.At W = 1, the bright cells, responding to the character L, have a high probability.Best viewed in color.
(a) Input image with ground-truth character bounding boxes (b) Association map

Figure 4 . 5 .
Figure 4.The estimated character locations, R k , from the association map.(a) Input image with ground-truth character bounding boxes, GT k .(b) Estimated character regions.Best viewed in color.
. The synthetic training set comprises 8.5M images from 50% of MJSynth, 50% of SynthText, 100% of SynthAdd, and 10% of SynthTiger.The mixing ratio is around 4:3:1.3:1.Combining different training sources is to increase diversity of training data.Some samples from the training datasets are shown in Figure 6a.

6 .
(a) Synthetic training samples (b) Real fine-tuning samples Figure Sample training and fine-tuning images.(a) Sample images from the training datasets.(b) Sample images from the fine-tuning datasets.

Figure 7 .
Figure 7. Sample text images with character-level annotations.

Figure 8 .
Inference time comparison between our ViT-CTC models and the Transformer-decoderbased models on an RTX 2060 GPU.Trendlines are projected to the maximum number of characters (i.e., 25) [1].Tr.Dec.: Transformer decoder.CTC-M: CTC decoder with the proposed method.CTC-FA: CTC decoder with feature averaging.Best viewed in color.

Figure 9 .
Figure 9. Maximum inference time vs. recognition accuracy comparisons between the ViT-CTC models using the proposed method and the Transformer-decoder-based models on an RTX 2060 GPU.Tr.Dec.: Transformer decoder.Best viewed in color.

Figure 10
displays the association maps corresponding to different α values for two examples where text from the top intrudes.Instead of '1932' and 'COLLEGE', the groundtruth words are 'ATHLETIC' and 'LONDON'.The ViT-CTC model accurately predicts both words.Examination of the association maps reveals the model accurately linking the correctly predicted characters with the relevant lower regions containing 'ATHLETIC' and 'LONDON', as opposed to upper regions with '1932' and 'COLLEGE'.Thus, association maps not only explain the model's predictions but also offer localization for those predictions.

Figure 10 .
Figure 10.Association maps for different values of α.The color bars show image regions, corresponding to predicted characters.Best viewed in color.

Figure 11 .
Figure 11.The average AEMs of the association and cross-attention maps as a function of α and β, respectively.Best viewed in color.
(a) Input (b) Association Map (c) Cross-Attention Map Figure 12.Illustrations of the estimated character locations from the association (α = 0.8) and cross-attention (β = 0.5) maps vs. the ground-truth character locations.Best viewed in color.

Table 1 .
Specifications of the pretrained ViT backbones.

Table 2 .
Specifications of the Transformer decoder.

Table 3 .
Word recognition accuracy (%) of the ablation results of the encoder complexities and architectures with the proposed method (M).FT: fine-tuning on real data.Bold: highest.(a)Methods trained on synthetic training data (S).

Table 4 .
Word recognition accuracy (%) comparison between the proposed method (M) and the baseline feature averaging (FA).FT: fine-tuning on real data.Bold: highest.(a)Methods trained on synthetic training data (S).

Table 5 .
Word recognition accuracy (%) comparison between the proposed method (M) and the SOTA CTC-based methods.FT: fine-tuning on real data.Size: parameters in millions.M: the proposed method.Bold: highest.(a)Methods trained on synthetic training data (S).

Table 6 .
Word recognition accuracy (%) comparison with the baseline Transformer-decoder-based models.FT: fine-tuning on real data.Size: parameters in millions.Tr.Dec.: Transformer decoder.M: the proposed method.Bold: highest.(a) Methods trained on synthetic training data (S).