Next Article in Journal
Real-Time Sensor for Measuring the Surface Temperature of Thermal Protection Structures Based on the Full-Time Domain Temperature Inversion Method
Previous Article in Journal
A Non-Contact Privacy Protection Bed Angle Estimation Method Based on LiDAR
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Single-Character-Based Embedding Feature Aggregation Using Cross-Attention for Scene Text Super-Resolution

by
Meng Wang
*,
Qianqian Li
and
Haipeng Liu
School of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650500, China
*
Author to whom correspondence should be addressed.
Sensors 2025, 25(7), 2228; https://doi.org/10.3390/s25072228
Submission received: 27 December 2024 / Revised: 13 March 2025 / Accepted: 27 March 2025 / Published: 2 April 2025
(This article belongs to the Section Sensor Networks)

Abstract

:
In textual vision scenarios, super-resolution aims to enhance textual quality and readability to facilitate downstream tasks. However, the ambiguity of character regions in complex backgrounds remains challenging to mitigate, particularly the interference between tightly connected characters. In this paper, we propose single-character-based embedding feature aggregation using cross-attention for scene text super-resolution (SCE-STISR) to solve this problem. Firstly, a dynamic feature extraction mechanism is employed to adaptively capture shallow features by dynamically adjusting multi-scale feature weights based on spatial representations. During text–image interactions, a dual-level cross-attention mechanism is introduced to comprehensively aggregate the cropped single-character features with textual prior, also aligning semantic sequences and visual features. Finally, an adaptive normalized color correction operation is applied to mitigate color distortion caused by background interference. In TextZoom benchmarking, the text recognition accuracies of different recognizers are 53.6%, 60.9%, and 64.5%, which are improved by 0.9–1.4% over the baseline TATT, achieving an optimal SSIM value of 0.7951 and a PSNR of 21.84. Additionally, our approach improves accuracy by 0.2–2.2% over existing baselines on five text recognition datasets, validating the effectiveness of the model.

1. Introduction

The task of scene text image super-resolution (STISR) in vision research endeavors to reconstruct high-resolution (HR) textual sequences from their low-resolution (LR) counterparts in scene images. Due to poor imaging conditions, scene text images often suffer from low resolution, significantly hindering text detail acquisition and subsequent tasks such as scene text recognition [1,2,3] and scene text detection [4,5]. Existing methods often leverage text-specific features, such as stroke and character structures, to enhance the clarity and discriminability of super-resolved text images [6,7,8]. This paper focuses on addressing limitations related to the dense distribution of characters and the inefficient representation of foreground sequences amidst complex backgrounds in real scenes.
With the advancement of deep learning, deep convolutional neural networks (DCNNs) have become fundamental to STISR due to their powerful nonlinear mapping capabilities and adaptability. TextSR utilizes a recognition loss to guide the training of generative adversarial networks (GANs). TPGSR [9] integrates semantic features from a text recognizer into its generative network. TATT [10] devised an interpreter and a structural consistency loss to handle text with spatial deformations. As shown in Figure 1a,b, existing methods fail to adequately address the dense character distribution and the adverse impact of complex backgrounds on character representation and reconstruction.
TSRN captures contextual information by integrating sequential residual blocks. TBSRN leverages attention modules to process sequential data, improving robustness for text regions in arbitrary orientations. Gestalt [11] focuses on modeling character stroke-level structures in text images. LEMMA [12] significantly improves the quality of text image recovery by combining local detail enhancement and multi-scale feature fusion. These schemes usually use traditional convolution operations. Their limited receptive fields restrict them to capturing local features such as character strokes and edges. However, as continuous data, the readability of text also depends on global information such as the relative position and overall arrangement of characters. Thus, an effective architecture integrating both local details and global semantics requires further exploration.
Recently, MNTSR [13] employed self-supervised end-to-end memory networks, PerMR [14] fused low-level strokes and high-level semantics from a text recognition network to enhance visual quality, and SGENet [15] incorporated textual semantics with lightweight web design. As shown in Figure 1c, there is often a color drift between the text area and the background area in the images generated by methods similar to the one shown. These methods use semantic information or focus only on the text structure while ignoring the color and texture of the original image.
Aiming at the above bottlenecks, this paper proposes single-character-based embedding feature aggregation using cross-attention for scene text super-resolution tasks, with the following main contributions:
  • This study proposes a two-branch feature aggregation strategy, which integrates independently cropped single-character image features with corresponding character probability sequences. This approach ensures that high-level prior information focuses on individual character structures, effectively mitigating complex background interference. It also significantly reduces interference from neighboring densely distributed characters.
  • To leverage the complementary capabilities of convolutional kernels with varying receptive fields, an improved inception module is introduced in shallow layers for dynamic multi-scale feature extraction. By dynamically weighting scaled convolutional kernels, the global overview features and fine-grained features are adaptively adjusted for each input, thus enriching the feature expressions to comprehensively understand the salient vision content.
  • Leveraging adaptive normalization to learn cross-domain mapping relationships, a color correction operation adaptively adjusts the mean and standard deviation of target images pixels. This enhances super-resolution quality without altering the original image content. Experiments are performed on the public dataset TextZoom, and the results show the superiority of the proposed model compared to the existing baselines. The average recognition accuracy on the test sets of CRNN, MORAN, and ASTER is improved by 1%, 1.5%, and 0.9%, respectively.

2. Related Works

2.1. Image Super-Resolution

Super-resolution (SR) tasks aim to generate HR images from LR inputs through a prior spatial learning methodology. Traditional SR algorithms include interpolation-based [16,17,18] frequency-domain-based [19,20], and learning-based [21] methods. With the rise of deep learning, Dong et al. [22] pioneered the use of convolutional layers to learn representations for SR reconstruction. Since then, SR networks have been scaled and adapted for STISR tasks, including VDSR [23], an SR architecture using 20 convolutional layers, ESPCN [24], which is formulated using a sub-pixel convolutional network and a recursively structured deep network [25], and a deep residual network [26] with jump connections. SRGAN later introduced adversarial learning for SR reconstruction. Recently, diffusion models have been extended to different vision tasks, including SR after surpassing GANs. For instance, SRDiff [27], the first diffusion-based SR model for single-image reconstruction, effectively addresses spatial detail confusion and training instability to generate realistic results.

2.2. Scene Text Recognition for STISR

Scene text recognition (STR) involves automatically detecting and recognizing textual content in natural scenes. Deep learning has driven significant progress in STR. CRNN transforms a text recognition problem into a sequence learning problem, modeling the position and shape of individual characters. MORAN enables text detection and recognition through a spatial transformation network and a recognition network. ASTER [28] introduces an adaptive correction network and an attention-based sequence recognition model. TextSR, TPGSR, TATT, and others leverage these recognizers to extract text prior to guiding SR networks. In this paper, we adopt the above three recognizers to capture semantic information and integrate it with image features for enhanced reconstruction.

2.3. Scene Text Super-Resolution

Unlike single-image super-resolution (SISR), STISR aims to enhance text image quality for improved readability and downstream recognition tasks. Although SISR approaches can theoretically be applied to STISR, e.g., Dong et al. performed the STISR task by extending the SRCNN [29] baseline to text images. PlugNet [30] employs a lightweight SR module to extract features from LR images. However, STISR generally requires special processing to maintain the character and clarity of text. Recently, more models specifically designed for STISR have been proposed.
Transformer-based methods (e.g., TATT, C3-STISR [31], and TATSR [32]) leverage the global modeling capabilities of the transformer to capture contextual information in text images, significantly improving restoration performance, which is particularly effective in complex backgrounds and diverse font scenarios. Methods based on diffusion models (e.g., TextDiff [33], RGDiffSR [34], TCDM [35], and PEAN [36]) generate high-quality, high-resolution text images by utilizing diffusion processes, further enhancing image details and visual realism. Semantic and prior-guided approaches (e.g., HiREN [37], PCAN [38], and DPMN [39]) optimize the super-resolution process by incorporating semantic information, text recognition modules, or prior knowledge, significantly improving the semantic restoration quality of text images. These methods demonstrate robust performance in handling low-resolution, blurry, or noise-affected text images.
Furthermore, to address the demands of practical applications, efficient and lightweight methods (e.g., ESTISR [40] and Pixel Adapter [41]) focus on designing computationally efficient super-resolution models suitable for mobile devices and real-time processing scenarios, providing practical technical support for deployment. Notably, the introduction of Real-CE [42] has established a comprehensive dataset and evaluation framework for Chinese–English bilingual scene text image super-resolution, significantly advancing research in multilingual text super-resolution.

3. The Proposed Network Architecture

The proposed SCE-STISR is illustrated in Figure 2. According to it, LR images X L R are used as inputs for preprocessing, which employs DIFE to extract shallow features X I and also takes into account both global and local representations. Then, a pre-trained TR predicts text probability sequences t p . A SCBD clip X I is inserted into character feature X S according to character position information. X I , X S , and t p are parallelly input into DBFA to guide the aggregation of visual and semantic representations. Ultimately, a high-level prior P 2 = f D B F A ( X I , X S , t p ) is formulated as the input of the reconstruction module, which includes a color correction layer for improving color consistency and accuracy and finally a sequence residual block for performing SR image reconstruction.

3.1. Image Preprocessing

The image preprocessing procedure consists of three parts, namely dynamic inception feature extraction, single-character boundary detection, and the text recognizer. Through these steps, LR images X L R are fed as inputs, and shallow features X I , character image features X S , and text priors t p are extracted as outputs.

3.1.1. Dynamic Inception Feature Extraction

A single convolutional layer has limited receptive fields, extracting only single-scale features. However, text characters exhibit multi-scale properties, such as font, size, and style, making single convolutional layers inefficient for extracting rich prior features. The DIFE [43] applied in this paper is illustrated in Figure 3. Multi-scale convolution kernels capture structural features at all levels, from letter shapes to word layouts, adapting to diverse datasets and task requirements. A 1 × 1 convolutional kernel reduces or upgrades feature channels, adjusting parameter scales and network nonlinearity for efficient architecture design.
X i = f c ( k h × h , X L R ) , h = 1,3 , 5,7 , 9
where X L R R H × W × 3 is the LR text image, where H and W are the height and width, respectively. X i R H × W × C denotes the shallow features extracted from the text image at each layer with i ϵ [ 0,4 ] , and C denotes the number of feature channels. k h × h is the size of the convolution kernel sized by h × h , and f c ( · ) denotes the convolution function.
Since different inputs have different priorities in the feature space captured by multi-scale convolutional kernels, this paper introduces a dynamic feature-weighting mechanism. This mechanism adaptively weights each convolutional kernel branch based on the input feature distribution. This strategy enhances the network’s sensitivity to diverse features and enables it to more accurately understand and improve its ability to process complex text structures, boosting overall performance and generalization. The computational process can be formulated as:
w i = f s i g f F C f G A P ( X L R ) , i [ 0,4 ]
X I = i = 0 4 w i X i
where represents compound operation and f G A P represents the global average pooling operation on input X L R . After that, through the f F C fully connected layer function, the respective weight coefficients of the five convolutions of different input samples are learned. Finally, the weights on each dimension are constrained to [0, 1] by sigmoid function f s i g to obtain the normalized dynamic weights w i . In addition, X I R H × W × C is the shallow feature output.

3.1.2. Single-Character Boundary Detection

Since LR images usually contain multiple characters, the relative position and order of arrangement between characters are critical. The transformer’s self-attention [44] can directly model the association between any two characters, thereby generating more accurate text position information. To effectively separate foreground text from the background and maximize textual prior information to guide semantic and visual reconstruction, this study adopts transformer-based SCBD [45] to predict the character positions and clip individual characters, forming an image block sequence.
SCBD is shown in Figure 4, where the input X I is flattened as X P = f f ( X I ) , X P R N × ( P 2 × C ) is a sequence of two-dimensional (2D) image blocks, the resolution of each image block is P × P , and the number of image blocks N = H × W / P 2 . To help the model maintain position sensitivity while processing the sequence data, a further position coding E p o s is fed into the encoder to predict the character position:
{ ( P C j , P W j ) } j = 0 M = f E ( X P + E p o s )
In Equation (4), ( P C j , P W j ) refers to the center position and width of each predicted character, j represents the jth character, and f E ( · ) refers to the cascade transformer encoder.
To reduce background interference, focus on the foreground of the text, and weaken the influence between neighboring characters, we use the predicted positional information ( P C j , P W j ) to crop out the corresponding single-character features:
X S j = f c l i p ( X I , ( P C j , P W j ) )
where f c l i p ( · ) denotes the clipping operation, X S j indicates a single-character feature, X S j is flattened as X ˙ S j = f f ( X S j ) , and X S = ( X ˙ S 0 , X ˙ S 1 , X ˙ S 2 , . . . , X ˙ S M ) represents the output of the concatenation of X ˙ S j .

3.1.3. Text Recognizer

The aim of the TR is to capture text probability sequences in LR images as prior information, thus guiding the model to reconstruct SR images with precise text semantics. In previous studies, methods such as CRNN, Moran, and Aster show excellent performance in the field of text recognition. Therefore, this paper selects these pre-trained recognition models to obtain the coding vector of the text category t p j :
t p j = f T R ( X S j )
In this equation, f T R ( · ) is the TR function and t p = ( t p 0 , t p 1 , t p 2 , . . . , t p M ) R M × S is the probability sequence of the prior text, where M denotes the length of the sequence, and S is the number of categories in the reference text label set. In general, this set consists of the Arabic numerals 0–9, 26 letters of the alphabet, and a blank character.

3.2. Dual-Branch Feature Aggregation

DBFA is a key component in the proposed architecture, as shown in Figure 5. The purpose of DBFA is to interpret prior text t p before image features, thereby exerting precise semantic guidance on relevant spatial locations in the image feature domain. Deep interactions between text priors and image features intensify character adhesion and background interference, causing incorrect semantic guidance. To address these challenges, we focus on aggregating individual character features and global features to supplement missing background information.
T ¯ t = f L N ( f M S A ( t p ) + t p )
T t = f L N ( f F F N ( T t ) + T t )
where f L N ( · ) , f M S A · , and f F F N ( · ) refer to functions of the layer norm, multi-head self-attention (MSA), and feedforward network (FFN), respectively. MSA performs global correlations between textual semantic elements and outputs contextually enhanced textual features T t .
To achieve depth alignment between text features T t and image features, two-level multi-head cross-attention (MCA) is used, where text feature T t and single-character feature X S are used as first-level MCA inputs, with T t as the query and X S as the key and value, allowing each character to find the image feature that corresponds to it. The input tensors X S and T t are first split into n sub-tensors in the channel dimension.
Q i = T t W i Q , K i = X S W i K , V i = X S W i V , i = 1,2 , . . . , n
where W i Q , W i K , and W i V are the linear mapping matrices corresponding to the ith attention head, respectively.
P 1 i = s o f t m a x ( Q i K i T d k ) V i
In this equation, P 1 i is the calculated attention per head and d k is the length of K i for scaling the attention. We process the results P 1 i with a channel-wise concatenation operation f c o n (·) and a linear projection function f w , described as:
P ¯ 1 = f w f c o n ( P 1 1 , P 1 2 , . . . , P 1 n )
The output P ¯ 1 is passed to a FFN for feature refinement, obtaining a high-level prior P 1 without background interference.
In addition, the prior P 1 contains character image block X S as a key, text features T t as a value, and the size is consistent. It can be directly input into the secondary MCA to avoid dimensional inconsistencies. We use P 1 to connect semantic information and global visual features. We flatten the shallow features X I as the query. Unlike the X S in first-level MCA, X I is not flattened according to single characters, and the background and foreground information is not clear in terms of the primary and secondary. However, X I contains global features, which effectively improves the visual quality problem caused by missing backgrounds. The P 1 is a key and T t is a value.
P ¯ 2 = f L N ( f M C A ( X I , P 1 , t p ) + X I )
P 2 = f L N ( f F F N ( P ¯ 2 ) + P ¯ 2 )
Each pixel point in X I can be queried by P 1 to the text feature corresponding to it, obtaining a high-level prior P 2 for valid semantic mappings.
Since the cropped image is a sequence of image block features arranged by character, cluttered background interference is avoided when using transformer aggregation. When calculating the correlation, this reduces the interference of the neighboring characters on the recovery of the current character. Global features P 2 can more accurately assign semantic information to the spatial domain via P 1 .

3.3. Reconstructed Module

The reconstruction module in this study consists of a sequential residual block (SRB), a color correction block (CCB), and an upsampling layer. P 2 + X I is sent to the SRB for SR reconstruction. Subsequently, the resulting features are again sent to the SRB for element-by-element summation with X I to ultimately obtain the SR reconstruction features X S R C . However, the SR image restored through upsampling the reconstructed features X S R C via the pixel-shuffle layer may suffer from color drifting in either the text part or the background part. In view of this, this study considers adaptive normalization to adjust the feature representation of the image to match the statistical properties of the target domain. This ensures that the model can generate images consistent with the target style.
In detail, the mean and variance of the color values of the SR image and LR image were first calculated using the f M ( · ) and f V ( · ) functions as the target criterion:
μ X I C , μ X S R C = f M ( X S R C , X I C )
σ X I C , σ X S R C = f V ( X S R C , X I C )
where c denotes the RGB channel and μ X S R C , μ X I C and σ X S R C , σ X I C denote the mean and variance for X S R C and X I . Secondly, X S R C is normalized to improve the model’s accuracy:
X S R N = X S R C μ X S R C σ X S R C
Subsequently, by applying μ X I C and σ X I C to the normalized feature X S R N , the SR image is adjusted so that its color value’s mean and variance are consistent with those of the LR image to obtain the color-corrected reconstructed feature X ¯ S R .
X ¯ S R = X S R N · σ X I C + μ X I C
The reconstruction process in this study consists of five SRB and CCB modules connected in series, with a final upsampling layer to obtain the reconstructed high-resolution image features X S R .

3.4. Loss Function

In this study, the total loss function consists of pixel loss L p i x and text recognition prior loss L t p .
L p i x = X S R X H R 2
The L p i x loss generates high-quality images from LR images by constraining the L 2 paradigm of super-resolution images and high-resolution images.
L t p = β A S R A H R 1 + γ K L ( t p , t H R )
In Equation (19), β and γ denote hyperparameters, A H R is the distribution of attention that the model expects to see, and A S R is the distribution of attention that the model actually sees. A H R , A S R R B × S × ( H × W ) , B is the batch size, and S is the maximum length of the text. A H R , A S R are obtained when the multi-head attention mechanism calculates the alignment relationship between text and image features. t p and t H R denote the text probability sequences of SR and HR images obtained by the text recognizer. L t p denotes that the text recognition branch is fine-tuned by constraining the L 1 paradigm and the Kullback–Leibler divergence of the text prior recognized from the LR image and the real image. The total loss function L is expressed as follows.
L = L p i x + α L t p
In Equation (20), α is the equilibrium parameter. During the training process, the loss function is used as the optimization objective, and this error signal is then fed to the network through the back-propagation iterations to update all the module parameters.

4. Experimental Results and Discussion

According to this study, DBFA, as the main component based on cross-attention, captures deep interactions between global and local feature semantics for visual reconstruction. In addition, DIFE dynamically adjusts the multi-scale feature weights according to the different structures, shapes, and distributions of the inputs. Moreover, the CCB adaptively relieves the color distortion caused by background interference. After the experimental configurations are presented, as seen in Section 4.1, the effectiveness of these proposed components is verified by ablation experiments in Section 4.2. In Section 4.3, we further evaluate the model’s performance on TextZoom through comparative tests and demonstrate its robustness on the STR dataset.

4.1. Dataset and Experimental Details

TextZoom: This STISR dataset is derived from two state-of-the-art SISR datasets, RealSR and SRRAW, captured by a multifocal camera in the field. These datasets are more realistic and challenging than synthetic data. TextZoom uses a field-of-view matching and transforming approach to initially align the images with different focal lengths and crop the text images with the same-sized text box to obtain textual images with different resolutions. Textual images with larger focal lengths serve as HR images, while those with smaller focal lengths serve as LR images. Smaller focal lengths result in blurrier image details. According to stratified random sampling based on scene types, 17,367 pairs of LR–HR training sets and 4373 pairs of test sets are included in TextZoom, with text annotations, border types, and raw focal lengths. The test set is divided into three subsets: 1619 samples for the easy subset, 1411 for the medium subset, and 1343 for the hard subset.
STR dataset: To evaluate effectiveness across different data distributions, five English STR datasets—ICDAR2013 [46], ICDAT2015 [47], SVT [48], SVTP [49], and CUTE80 [50]—are used. Specifically, ICDAR2013 contains 1015 test samples, while ICDAR2015 contains 2077 samples. The text in these images may appear in diverse scenes, exhibiting issues such as distortion, occlusion, and uniform illumination. SVT contains 350 test samples with significant scale variations and complex backgrounds. The text may be bent or distorted, and the varying lighting conditions increase recognition difficulty. SVTP contains 645 test samples. Although the data are synthetic, the text images are high-quality but lack the complexity of real scenes. CUTE80 contains 288 text images with characters arranged along curved paths, forming curved text lines. These images have high resolution and quality but no LR counterparts. In preprocessing, the images with resolutions smaller than 16 × 64 are selected and degraded with Real-ESRGAN [51] to test the robustness of the model.
The STISR benchmarks are implemented using the PyTorch 1.13.1 framework, and all experiments are performed on a single RTX4090 GPU. The number of MSA mechanisms in SCBD is set to four and the number of SRBs is set to five. In addition, the model with an input batch size of 64, image width of 64, and height of 16 is optimized using Adam [52]. The number of training rounds is set to 500 and the learning rate is set to 0.001, which yields an output with a width of 128 and a height of 32. In this study, three text recognizers, ASTER, CRNN, and MORAN, are applied to assess the recognition accuracy. To assess the quality of image reconstruction, we adopt the peak signal/noise ratio (PSNR) and structural similarity index measure (SSIM) [53] as reference metrics.

4.2. Ablation Experiment

In this section, the effectiveness of the components DIFE, DBFA, and CCB are validated on TextZoom, with CRNN applied as a text recognizer.

4.2.1. The Role of Dual-Branch Feature Aggregation

DBFA is designed to accurately interpret the semantic information t p and the corresponding positions in the image features X I , and align between the textual prior and the image features. Then, it is compared with other textual prior interpreters: firstly, t p is fused with X I using the inverse convolution block. Secondly, t p and X I are aligned using the SFT layer [54] to obtain parameter pairs based on a few textual prior conditions. Then, an affine transformation is applied to adaptively fuse each intermediate feature map. Finally, TPI is used to compute the correlation between the textual prior and image features to guide the SR textual reconstruction. To verify the effectiveness of DBFA, local text feature aggregation (LTFA) is separately tested as the textual prior interpreter.
Table 1 shows that the DBFA component achieves the highest recognition accuracy and average accuracy across the easy, medium, and hard subsets. Additionally, the PSNR and SSIM metrics are optimal, demonstrating excellent SR performance. As shown by f D (50.6%) and f S F T (49.2%), although textual priors guide the reconstruction, they are not accurately assigned to corresponding positions in the image space, leading to the underutilization of prior information. Notably, TPI (52.8%) aligns image features with the text regions effectively but fails to distinguish the foreground and background clearly, and the attention mechanism does not sufficiently allocate weights to the text regions. Using only local textual feature integration (53.1%) ignores the background’s influence and reduces the visual quality. In contrast, DBFA focuses on both the foreground and background, accurately aligning text priors with image features. This improves recognition accuracy to 53.6%, PSNR to 21.84, and SSIM to 0.7997, validating its effective super-resolution reconstruction performance. The visual comparison in Figure 6 shows that the text image obtained by the proposed method has the best visual quality.

4.2.2. The Role of Dynamic Inception Feature Extraction

In this study, DIFE is mainly applied to multi-scale feature extraction, similar to the work in [32]. Small convolutional kernels (e.g., 1 × 1, 3 × 3) capture fine details such as edges and textures, making them suitable for extracting low-level features. Medium convolutional kernels (e.g., 5 × 5, 7 × 7) have a moderate receptive field, capturing both local and partial global information, and are suitable for extracting medium-level features such as object shapes or silhouettes. Large convolutional kernels (e.g., 9 × 9) have a large receptive field, making them suitable for extracting high-level features such as the overall shape of objects or scene context. The dynamic weighting strategy adaptively adjusts the proportion of multi-scale feature information extracted from different convolutional kernels according to different input distributions. In this experiment, we analyze the relationship between convolutional kernel sizes and the inclusion of the dynamic weighting strategy, evaluating their impact on feature extraction and recognition accuracy, as shown in Table 2.
In this table, the combination of multi-scale convolutional kernels demonstrates significant advantages, outperforming single convolutional layer feature extraction. Small convolution kernels ignore contextual information to reduce accuracy, while large convolution kernels do not capture enough details. Experimental results indicate that DIFE achieves the best recognition accuracy (53.6%) at the eighth set of convolution kernel sizes. The visualization in Figure 7 denotes that the module is highlighted in the character region, with the best retention of global and detailed features.

4.2.3. Validity of the CCB Module

To evaluate the effectiveness of the CCB in color correction, five models—TPGSR, TATT, C3-STISR, MNTSR, LEMMA, PEAN, and SCE-STISR—were selected to compare results with and without the CCB module. Table 3 shows that all models improved their text recognition accuracy or SR quality, which indicates that this color correction module effectively improves the image reconstruction quality. Specifically, our model SCE-STISR improves the recognition accuracy from 53.9% to 55.3% and the average recognition accuracy from 53.0% to 53.6% on the medium difficulty subset after using CCB, while PSNR and SSIM improve from 21.43 and 0.7982 to 21.84 and 0.7997, respectively, which demonstrates a significant performance improvement. In contrast, other models such as LEMMA and PEAN also show some improvement in PSNR and SSIM after using the CCB, although the improvement in recognition accuracy is smaller. We also provide some qualitative comparative studies by comparing the baseline model with the SR image visualization after adding the CCB. As shown in Figure 8, we find that existing methods suffer from color defects, while our method constrains the images to maintain color consistency, resulting in higher image and visual quality.

4.2.4. Effectiveness and Efficiency of Different Components

As shown in Table 4, the impact of different module combinations on model performance is evaluated. The experiment denotes that the model’s recognition accuracy is optimal (53.6%) when three modules—DBFA, DIFA, and CCB—are integrated simultaneously. Although the performance improvement with these modules resulted in a slight decrease in inference speed (↓10%) and an increase in the number of parameters (↑66.7%), the recognition accuracy was significantly improved (↑1.5%). Given the compact input size of the STISR model, the existing architecture already excels in inference speed and parameter efficiency. Therefore, a moderate increase in complexity to improve SR performance is justified.

4.3. Comparison with State-of-the-Art Results

4.3.1. TextZoom Quantitative Research

To validate the model, the proposed SCE-STISR was compared with SISR methods (e.g., SRCNN, SRResNet, EDSR, RCAN, CARN, and HAN) and STISR models (e.g., TSRN, TBSRN, PCAN, TPGSR, Text Gestalt, TATT, C3-STISR, MNTSR, PerMR, TEAN, DPMN, LEMMA, PEAN, and TCDM).
Table 5 shows the recognition accuracy metrics based on the CRNN, MORAN, and ASTER recognizers on the TextZoom test set. Since SISR methods are designed for generalized SR, they do not account for the unique character structures or textual information in text images. The results show that the recognition accuracy of SISR methods is generally low. Most of the other STISR methods use a single convolution layer, which is unable to take into account both character details and textual context and treats the foreground and background equally. Compared with the STISR baseline TATT, we improve the average recognition accuracy of CRNN, MORAN, and ASTER by 1%, 1.4%, and 0.9%, respectively, demonstrating the effectiveness of the method. Although the TCDM and PEAN models demonstrate excellent performance in terms of average recognition accuracy for the CRNN, MORAN, and ASTER recognizers, the experimental details mentioned in the literature—such as batch size, graphics card performance, and other parameters—reveal that the characteristics of diffusion models result in high computational complexity, time-consuming training, and slow inference. This makes it challenging to run these models in real time on resource-constrained devices. Additionally, the multi-step iterative generation process of diffusion models imposes strict requirements on the quality and scale of training data, which inevitably increases training costs and the risk of data dependency. These factors collectively limit the deployment and widespread adoption of TCDM and PEAN in practical application scenarios.
The results of SSIM and PSNR, which are common evaluation metrics for SR, are shown in Table 6. Our method effectively constrains the super-resolution reconstructed image to be infinitely close to the low-resolution image in the color channel by introducing the CCB (color consistency block), thereby ensuring the consistency of the output image with the input image in terms of statistical features. This design not only significantly improves the visual quality of the image but also enhances structural similarity, addressing the issues of color drift and low reconstructed visual quality caused by the neglect of overall image quality in previous methods. Compared to existing methods, our model achieves the best PSNR on the difficult subset (20.78) and the simple subset (24.99), as well as the best average PSNR (21.84). Additionally, it achieves the best performance on the medium subset (0.6955) and the difficult subset (0.7859), as well as the best average SSIM (0.7951), which fully validates the effectiveness of the model in enhancing image quality. These results demonstrate that our innovative approach has significant advantages in improving image structural similarity and color consistency. Although the character recognition accuracy is not yet optimal, our superior performance in terms of the PSNR and SSIM metrics proves the success of the model design in improving overall image quality. This provides higher-quality inputs for downstream text recognition tasks and offers important optimization directions for subsequent research.
To evaluate the statistical significance of the model outputs, five existing benchmarks were tested. Table 7 demonstrates the average accuracy, PSNR, and SSIM metrics for each method on the test set, along with their corresponding p-values. The results show that the proposed model outperforms the existing benchmarks in most of the metrics and exhibits more stable performance (smaller variance) over multiple training sessions. The significance test results (p-values) further confirm that our method has a statistically significant improvement compared to existing methods. Based on these quantitative results, the SCE-STISR model significantly outperforms other benchmarks in terms of generation quality.

4.3.2. TextZoom Qualitative Research

As shown in Figure 9, to further validate the model’s effectiveness, the visualization results of image reconstruction are presented under different configurations, including text lengths, backgrounds, colors, and text recognition results.
The observations show that all methods outperform the bicubic interpolation method, but there is a significant difference in visual quality between SISR and STISR. The text-specific model SCE-STISR achieves the best visualization results, which are closest to the HR images. As shown in the first and second columns of the figure, although most methods can accurately recognize characters, issues such as blurred character structures and bend strokes may occur during text SR reconstruction. In addition, SCE-STISR accurately assigns semantic information to corresponding image features, reconstructs images correctly, and achieves satisfactory visual results. In addition, the model performs well in dealing with the dense connection of characters and the background influence on foreground restoration, such as ‘recycled’ and ‘artificial’, producing reconstruction results that are clear and the closest to the high-resolution image. In contrast, the other methods show the phenomenon of character misconstruction. In previous methods, the color of characters or the background is easily inconsistent with the color of LR and HR images, resulting in large errors, such as ‘bucket’ and ‘caramel’, while this method effectively corrects the color drift and significantly improves the fidelity of images.

4.3.3. Quantitative Research of Text Recognition Datasets

To further verify its generalization ability, the proposed model is tested on five STR datasets, ICDAR2013, ICDAR2015, SVT, SVTP, and CUTE80, using the parameters trained on TextZoom. These datasets are primarily derived from real-world scenes and contain text of varying lengths and complex backgrounds. LR images are selected to form the test set, and due to the high quality of most of the images, the Real-ESRGAN second-order degradation method is used to degrade the quality. As shown in Table 8, SCE-STISR aligns text priors with images more accurately, provides precise guidance, and achieves better results.

4.3.4. Research on Densely Connected Datasets

Since our model aims to mitigate the effects of complex background interference and dense text connectivity, this section compares the text recognition accuracies and image qualities of different methods on datasets featuring complex backgrounds and densely connected text. The experiments are conducted on 650 text images with complex backgrounds or densely connected characters, selected from TextZoom, ICDAR2013, ICDAR2015, SVT, SVTP, CUTE80, and COCO-Text and all with a resolution lower than 16 × 64. As shown in Table 9, our model achieves a recognition accuracy of 42.7% on the CRNN recognizer, outperforming MNTSR (38.9%), DPMN (35.4%), LEMMA (40.8%), and PEAN (39.5%). On the MORAN and ASTER recognizers, SCE-STISR achieves recognition accuracies of 54.1% and 57.3%, respectively, surpassing all other models. This performance is primarily attributed to the BDFA module, which effectively mitigates interference between complex backgrounds and adjacent characters, improves the accuracy of character separation and localization, and enhances image quality by optimizing color distribution. These innovations enable SCE-STISR to exhibit enhanced robustness in handling densely connected text and complex backgrounds. The visualization results in Figure 10 further demonstrate the advantages of our model in recovering image details.

4.3.5. Robustness Test

This section aims to evaluate the model’s robustness according to common image degradations. Experiments are conducted on 500 randomly selected samples from the STR dataset and simulate real scenarios through two levels of degradation processing: firstly, a Gaussian fuzzy kernel is applied (kernel widths of 1, 3, and 5, respectively), followed by the introduction of additive Gaussian noise (covariance values are set to 10, 30, and 50). A CRNN is used as the text recognizer to compare the baseline TATT with related methods. As shown in Table 10, the performance advantage becomes more pronounced as blur and noise intensity increase, particularly under extreme conditions (kernel width of 5, covariance of 50), where recognition accuracy improvements remain significant. These results show that the proposed model can be efficiently generalized to out-of-distribution datasets while exhibiting stronger robustness to unknown degradation types.

4.3.6. Discussion

Although the proposed SCE-STISR architecture can efficiently generate English characters and numbers, a few limitations still remain. Firstly, its recognition accuracy on the SVT dataset is only 15.1%. This dataset is characterized by drastic changes in sample scale, curved text morphology, and curved character arrangement, which pose a serious challenge to the recognition ability of the model. In addition, our model might focus too much on the extraction of local character features, thus ignoring the spatial relationship between characters to some extent. In addition, the model is highly dependent on the distribution of the training data. As shown in Figure 11, when encountering languages, characters, or symbols that are not present in the training data, the recognizer may misrecognize them, resulting in confusing results. Furthermore, the introduction of DBFA increases computational complexity and reduces inference speed. Although the feature extraction mechanism of the character recognizer differs somewhat from the design goal of our module, resulting in the recognition accuracy not surpassing some state-of-the-art algorithms, this does not diminish the innovation and effectiveness of our approach. Our core goal is to mitigate the image quality problems associated with background interference, character tightness, and color drift. Future work may focus on designing extended architectures that do not require textual prior guidance and exploring ways to balance reconstruction quality with inference speed.

5. Conclusions

In this study, we propose single-character embedding feature aggregation based on cross-attention for scene text super-resolution. To address the limitation of single convolution layers in capturing only single-scale features, we employ DIFE for multi-scale feature extraction and dynamically adjust feature maps across different receptive fields based on input variations. In addition, we used DBFA, which distinguishes the textual part from the background part by a single character and then uses a semantic prior to guide the image. Finally, a CCB is implemented to improve the color drift problem in the reconstruction process by normalizing the color channel. In this study, ablation experiments are conducted for DIFE, DBFA, and the CCB to validate the effectiveness of each model component. To further demonstrate the superiority of this study, we conducted qualitative and quantitative comparison experiments on TextZoom using 16 benchmarks, as well as quantitative evaluations on five STR datasets and text-dense connected datasets. Without a significant increase in the number of model parameters, our approach achieves superior results in both text recognition accuracy and image quality assessment metrics.

Author Contributions

M.W., H.L. and Q.L. conceived and designed the experiments; Q.L. performed the experiments and analyzed the data; M.W. provided materials and analysis tools; M.W. and Q.L. wrote the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by the National Natural Science Foundation of China (62062048) and the Yunnan Provincial Science and Technology Plan Project (202201AT070113). This work is also supported by the Faculty of Information Engineering and Automation, Kunming University of Science and Technology.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Liu, B.; Chen, K.; Peng, S.-L.; Zhao, M. Adaptive Aggregate Stereo Matching Network with Depth Map Super-Resolution. Sensors 2022, 22, 4548. [Google Scholar] [CrossRef] [PubMed]
  2. Shi, B.; Bai, X.; Yao, C. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 2298–2304. [Google Scholar] [PubMed]
  3. Luo, C.; Jin, L.; Sun, Z. Moran: A multi-object rectified attention network for scene text recognition. Pattern Recognit. 2019, 90, 109–118. [Google Scholar]
  4. Sheng, F.; Chen, Z.; Mei, T.; Xu, B. A single-shot oriented scene text detector with learnable anchors. In Proceedings of the 2019 IEEE International Conference on Multimedia and Expo (ICME), Shanghai, China, 8–12 July 2019; pp. 1516–1521. [Google Scholar]
  5. Ma, J.; Shao, W.; Ye, H.; Wang, L.; Wang, H.; Zheng, Y.; Xue, X. Arbitrary-oriented scene text detection via rotation proposals. IEEE Trans. Multimed. 2018, 20, 3111–3122. [Google Scholar]
  6. Wang, W.; Xie, E.; Sun, P.; Wang, W.; Tian, L.; Shen, C.; Luo, P. Textsr: Content-aware text super-resolution guided by recognition. arXiv 2019, arXiv:1909.07113. [Google Scholar]
  7. Wang, W.; Xie, E.; Liu, X.; Wang, W.; Liang, D.; Shen, C.; Bai, X. Scene text image super-resolution in the wild. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part X 16, 2020. pp. 650–666. [Google Scholar]
  8. Chen, J.; Li, B.; Xue, X. Scene text telescope: Text-focused scene image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 12026–12035. [Google Scholar]
  9. Ma, J.; Guo, S.; Zhang, L. Text prior guided scene text image super-resolution. IEEE Trans. Image Process. 2023, 32, 1341–1353. [Google Scholar]
  10. Ma, J.; Liang, Z.; Zhang, L. A text attention network for spatial deformation robust scene text image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 5911–5920. [Google Scholar]
  11. Chen, J.; Yu, H.; Ma, J.; Li, B.; Xue, X. Text gestalt: Stroke-aware scene text image super-resolution. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 22 February–1 March 2022; pp. 285–293. [Google Scholar]
  12. Guo, H.; Dai, T.; Meng, G.; Xia, S.-T. Towards robust scene text image super-resolution via explicit location enhancement. arXiv 2023, arXiv:2307.09749. [Google Scholar]
  13. Guo, K.; Zhu, X.; Schaefer, G.; Ding, R.; Fang, H. Self-supervised memory learning for scene text image super-resolution. Expert Syst. Appl. 2024, 258, 125247. [Google Scholar]
  14. Shi, Q.; Zhu, Y.; Liu, Y.; Ye, J.; Yang, D. Perceiving Multiple Representations for scene text image super-resolution guided by text recognizer. Eng. Appl. Artif. Intell. 2023, 124, 106551. [Google Scholar]
  15. TomyEnrique, L.; Du, X.; Liu, K.; Yuan, H.; Zhou, Z.; Jin, C. Efficient scene text image super-resolution with semantic guidance. In Proceedings of the ICASSP 2024—2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 3160–3164. [Google Scholar]
  16. Zhang, X.-G. A new kind of super-resolution reconstruction algorithm based on the ICM and the bilinear interpolation. In Proceedings of the 2008 International Seminar on Future BioMedical Information Engineering, Wuhan, China, 18–20 December 2008; pp. 183–186. [Google Scholar]
  17. Akhtar, P.; Azhar, F. A single image interpolation scheme for enhanced super resolution in bio-medical imaging. In Proceedings of the 2010 4th International Conference on Bioinformatics and Biomedical Engineering, Chengdu, China, 18–20 June 2010; pp. 1–5. [Google Scholar]
  18. Badran, Y.K.; Salama, G.I.; Mahmoud, T.A.; Mousa, A.; Moussa, A. Single Image Super Resolution Using Discrete Cosine Transform Driven Regression Tree. In Proceedings of the 2020 37th National Radio Science Conference (NRSC), Cairo, Egypt, 8–10 September 2020; pp. 128–136. [Google Scholar]
  19. Park, S.C.; Park, M.K.; Kang, M.G. Super-resolution image reconstruction: A technical overview. IEEE Signal Process. Mag. 2003, 20, 21–36. [Google Scholar]
  20. Faramarzi, A.; Ahmadyfard, A.; Khosravi, H. Adaptive image super-resolution algorithm based on fractional Fourier transform. Image Anal. Stereol. 2022, 41, 133–144. [Google Scholar]
  21. Yang, J.; Wright, J.; Huang, T.S.; Ma, Y. Image super-resolution via sparse representation. IEEE Trans. Image Process. 2010, 19, 2861–2873. [Google Scholar] [PubMed]
  22. Dong, C.; Loy, C.C.; He, K.; Tang, X. Image super-resolution using deep convolutional networks. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 38, 295–307. [Google Scholar]
  23. Kim, J.; Lee, J.K.; Lee, K.M. Accurate image super-resolution using very deep convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1646–1654. [Google Scholar]
  24. Shi, W.; Caballero, J.; Huszár, F.; Totz, J.; Aitken, A.P.; Bishop, R.; Rueckert, D.; Wang, Z. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1874–1883. [Google Scholar]
  25. Kim, J.; Lee, J.K.; Lee, K.M. Deeply-recursive convolutional network for image super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1637–1645. [Google Scholar]
  26. Lim, B.; Son, S.; Kim, H.; Nah, S.; Mu Lee, K. Enhanced deep residual networks for single image super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA, 21–26 July 2017; pp. 136–144. [Google Scholar]
  27. Li, H.; Yang, Y.; Chang, M.; Chen, S.; Feng, H.; Xu, Z.; Li, Q.; Chen, Y. Srdiff: Single image super-resolution with diffusion probabilistic models. Neurocomputing 2022, 479, 47–59. [Google Scholar]
  28. Shi, B.; Yang, M.; Wang, X.; Lyu, P.; Yao, C.; Bai, X. Aster: An attentional scene text recognizer with flexible rectification. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 41, 2035–2048. [Google Scholar]
  29. Dong, C.; Loy, C.C.; He, K.; Tang, X. Learning a deep convolutional network for image super-resolution. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Proceedings, Part IV 13, 2014. pp. 184–199. [Google Scholar]
  30. Mou, Y.; Tan, L.; Yang, H.; Chen, J.; Liu, L.; Yan, R.; Huang, Y. Plugnet: Degradation aware scene text recognition supervised by a pluggable super-resolution unit. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XV 16, 2020. pp. 158–174. [Google Scholar]
  31. Zhao, M.; Wang, M.; Bai, F.; Li, B.; Wang, J.; Zhou, S. C3-stisr: Scene text image super-resolution with triple clues. arXiv 2022, arXiv:2204.14044. [Google Scholar]
  32. Qin, R.; Wang, B. Scene text image super-resolution via content perceptual loss and criss-cross transformer blocks. In Proceedings of the 2024 International Joint Conference on Neural Networks (IJCNN), Yokohama, Japan, 30 June–5 July 2024; pp. 1–10. [Google Scholar]
  33. Liu, B.; Yang, Z.; Wang, P.; Zhou, J.; Liu, Z.; Song, Z.; Liu, Y.; Xiong, Y. Textdiff: Mask-guided residual diffusion models for scene text image super-resolution. arXiv 2023, arXiv:2308.06743. [Google Scholar]
  34. Zhou, Y.; Gao, L.; Tang, Z.; Wei, B. Recognition-guided diffusion model for scene text image super-resolution. In Proceedings of the ICASSP 2024—2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 2940–2944. [Google Scholar]
  35. Noguchi, C.; Fukuda, S.; Yamanaka, M. Scene text image super-resolution based on text-conditional diffusion models. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2024; pp. 1485–1495. [Google Scholar]
  36. Zhao, Z.; Xue, H.; Fang, P.; Zhu, S. Pean: A diffusion-based prior-enhanced attention network for scene text image super-resolution. In Proceedings of the 32nd ACM International Conference on Multimedia, Melbourne, Australia, 28 October–1 November 2024; pp. 9769–9778. [Google Scholar]
  37. Zhao, M.; Xu, Y.; Li, B.; Wang, J.; Guan, J.; Zhou, S. HiREN: Towards higher supervision quality for better scene text image super-resolution. Neurocomputing 2025, 623, 129309. [Google Scholar]
  38. Zhao, C.; Feng, S.; Zhao, B.N.; Ding, Z.; Wu, J.; Shen, F.; Shen, H.T. Scene text image super-resolution via parallelly contextual attention network. In Proceedings of the 29th ACM International Conference on Multimedia, Virtual, 20–24 October 2021; pp. 2908–2917. [Google Scholar]
  39. Zhu, S.; Zhao, Z.; Fang, P.; Xue, H. Improving scene text image super-resolution via dual prior modulation network. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; pp. 3843–3851. [Google Scholar]
  40. Fu, M.; Man, X.; Xu, Y.; Shao, J. ESTISR: Adapting efficient scene text image super-resolution for real-scenes. arXiv 2023, arXiv:2306.02443. [Google Scholar]
  41. Zhang, W.; Deng, X.; Jia, B.; Yu, X.; Chen, Y.; Ma, J.; Ding, Q.; Zhang, X. Pixel adapter: A graph-based post-processing approach for scene text image super-resolution. In Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, ON, Canada, 29 October–3 November 2023; pp. 2168–2179. [Google Scholar]
  42. Ma, J.; Liang, Z.; Xiang, W.; Yang, X.; Zhang, L. A benchmark for Chinese-English scene text image super-resolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 19452–19461. [Google Scholar]
  43. Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
  44. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Neural Information Processing Systems, Long Beach, CA, USA, 4-9 December 2017. [Google Scholar]
  45. Li, X.; Zuo, W.; Loy, C.C. Learning generative structure prior for blind text image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 10103–10113. [Google Scholar]
  46. Stamatopoulos, N.; Gatos, B.; Louloudis, G.; Pal, U.; Alaei, A. ICDAR 2013 handwriting segmentation contest. In Proceedings of the 2013 12th International Conference on Document Analysis and Recognition, Washington, DC, USA, 25–28 August 2013; pp. 1402–1406. [Google Scholar]
  47. Karatzas, D.; Gomez-Bigorda, L.; Nicolaou, A.; Ghosh, S.; Bagdanov, A.; Iwamura, M.; Matas, J.; Neumann, L.; Chandrasekhar, V.R.; Lu, S. ICDAR 2015 competition on robust reading. In Proceedings of the 2015 13th International Conference on Document Analysis and Recognition (ICDAR), Nancy, France, 23–26 August 2015; pp. 1156–1160. [Google Scholar]
  48. Wang, K.; Babenko, B.; Belongie, S. End-to-end scene text recognition. In Proceedings of the 2011 International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; pp. 1457–1464. [Google Scholar]
  49. Phan, T.Q.; Shivakumara, P.; Tian, S.; Tan, C.L. Recognizing text with perspective distortion in natural scenes. In Proceedings of the IEEE International Conference on Computer Vision, Sydney, Australia, 1–8 December 2013; pp. 569–576. [Google Scholar]
  50. Risnumawan, A.; Shivakumara, P.; Chan, C.S.; Tan, C.L. A robust arbitrary text detection system for natural scene images. Expert Syst. Appl. 2014, 41, 8027–8048. [Google Scholar]
  51. Wang, X.; Xie, L.; Dong, C.; Shan, Y. Real-esrgan: Training real-world blind super-resolution with pure synthetic data. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual, 11–17 October 2021; pp. 1905–1914. [Google Scholar]
  52. Kingma, D.P. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
  53. Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef]
  54. Wang, X.; Yu, K.; Dong, C.; Loy, C.C. Recovering realistic texture in image super-resolution by deep spatial feature transform. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 606–615. [Google Scholar]
Figure 1. Super-resolution visualization, identification results, and losses of different methods. (a) Text-dense regions cause semantic errors in character reconstruction. (b) Cluttered backgrounds hinder recognition and complicate super-resolution reconstruction. (c) Color drift during reconstruction degrades visual quality.
Figure 1. Super-resolution visualization, identification results, and losses of different methods. (a) Text-dense regions cause semantic errors in character reconstruction. (b) Cluttered backgrounds hinder recognition and complicate super-resolution reconstruction. (c) Color drift during reconstruction degrades visual quality.
Sensors 25 02228 g001
Figure 2. The overall pipeline of the proposed SCE-STISR. LR images are used as input and SR images are used as output. Preprocessing contains text recognizer (TR) single-character boundary detection (SCBD) and dynamic inception feature extraction (DIFE). Dual-branch feature aggregation (DBFA) outputs advanced a prior-guided super-resolution reconstruction.
Figure 2. The overall pipeline of the proposed SCE-STISR. LR images are used as input and SR images are used as output. Preprocessing contains text recognizer (TR) single-character boundary detection (SCBD) and dynamic inception feature extraction (DIFE). Dual-branch feature aggregation (DBFA) outputs advanced a prior-guided super-resolution reconstruction.
Sensors 25 02228 g002
Figure 3. The architecture of DIFE. The red part indicates the multi-scale convolutional kernel extraction for shallow features. The blue portion indicates the computation of the weights of the convolutional kernels corresponding to the input samples.
Figure 3. The architecture of DIFE. The red part indicates the multi-scale convolutional kernel extraction for shallow features. The blue portion indicates the computation of the weights of the convolutional kernels corresponding to the input samples.
Sensors 25 02228 g003
Figure 4. The architecture consists of two transformer encoders with shallow features as inputs and the center coordinates and width of each character as outputs.
Figure 4. The architecture consists of two transformer encoders with shallow features as inputs and the center coordinates and width of each character as outputs.
Sensors 25 02228 g004
Figure 5. The architecture of DBFA. The whole process is run sequentially from left to right and from top to bottom. The input into the left half is a textual prior, and the output is used as an input to the right cross-attention to guide global and local textual image feature aggregation.
Figure 5. The architecture of DBFA. The whole process is run sequentially from left to right and from top to bottom. The input into the left half is a textual prior, and the output is used as an input to the right cross-attention to guide global and local textual image feature aggregation.
Sensors 25 02228 g005
Figure 6. Visualization of super-resolution reconstructed images using different decoders.
Figure 6. Visualization of super-resolution reconstructed images using different decoders.
Sensors 25 02228 g006
Figure 7. Text image shallow-feature extraction map. Numbers 1–8 correspond to the eight sets of comparative experimental feature visualization results in Table 2, respectively.
Figure 7. Text image shallow-feature extraction map. Numbers 1–8 correspond to the eight sets of comparative experimental feature visualization results in Table 2, respectively.
Sensors 25 02228 g007
Figure 8. Visualization of super-resolution results with and without CCB incorporation for different models.
Figure 8. Visualization of super-resolution results with and without CCB incorporation for different models.
Sensors 25 02228 g008
Figure 9. Visualization of reconstruction results on the TextZoom dataset.
Figure 9. Visualization of reconstruction results on the TextZoom dataset.
Sensors 25 02228 g009
Figure 10. The left side of the figure shows the SR results for text-dense connected images, and the right side shows the SR results for complex background text images.
Figure 10. The left side of the figure shows the SR results for text-dense connected images, and the right side shows the SR results for complex background text images.
Sensors 25 02228 g010
Figure 11. Text recognition visualization results for unknown characters in the text label character set.
Figure 11. Text recognition visualization results for unknown characters in the text label character set.
Sensors 25 02228 g011
Table 1. Comparing several text–image feature aggregation modules, where f D denotes the deconvolution operation, f S F T is the SFT layer fusion operation, and TPI is the TP interpreter in TATT. LTFA indicates that only local features are used in DBFA to guide cross-attention.↑ indicates that the higher the performance of the metric, the better it is, and blackened indicates the optimal metric.
Table 1. Comparing several text–image feature aggregation modules, where f D denotes the deconvolution operation, f S F T is the SFT layer fusion operation, and TPI is the TP interpreter in TATT. LTFA indicates that only local features are used in DBFA to guide cross-attention.↑ indicates that the higher the performance of the metric, the better it is, and blackened indicates the optimal metric.
Fusion StrategyEasyMediumHardavgAcc ↑PSNR ↑SSIM ↑
w/o DBFA51.2%41.9%31.7%41.6%21.020.7690
f D 61.8%52.1%37.9%50.6%21.100.7819
f S F T 60.3%50.4%36.9%49.2%20.870.7783
TPI62.9%53.5%39.8%52.8%21.520.7930
LTFA63.1%53.8%39.8%53.1%21.430.7954
DBFA63.5%55.3%39.9%53.6%21.840.7997
Table 2. Effect of different convolutional kernel sizes on recognition accuracy on the TextZoom dataset. Here, the dynamic indicates that a dynamic weighting mechanism is performed.
Table 2. Effect of different convolutional kernel sizes on recognition accuracy on the TextZoom dataset. Here, the dynamic indicates that a dynamic weighting mechanism is performed.
DIFE ParameterEasyMediumHardavgAcc
19 × 962.8%53.6%38.7%52.6%
21 × 1, 1 × 1 + 5 × 562.4%52.1%38.6%52.5%
31 × 1, 1 × 1 + 7 × 763.2%53.7%38.9%52.7%
41 × 1, 1 × 1 + 9 × 963.4%53.9%39.1%52.9%
51 × 1, 1 × 1 + 3 × 3, 7 × 7 + 1 × 162.9%53.5%39.5%53.4%
61 × 1, 1 × 1 + 9 × 9, 5 × 5 + 1 × 163.6%54.6%39.7%53.2%
71 × 1, 1 × 1 + 5 × 5, 1 × 1 + 7 × 7, 1 × 1 + 9 × 9, 3 × 3 + 1 × 163.8%54.8%39.8%53.4%
81 × 1, 1 × 1 + 5 × 5, 1 × 1 + 7 × 7, 1 × 1 + 9 × 9, 3 × 3 + 1 × 1 (dynamic)63.5%55.3%39.9%53.6%
Table 3. Effect of different convolutional kernel sizes on recognition accuracy on the TextZoom dataset. Here, √ indicates the addition of the CCB, and × indicates not adding the CCB, and blackened indicates the optimal metric.
Table 3. Effect of different convolutional kernel sizes on recognition accuracy on the TextZoom dataset. Here, √ indicates the addition of the CCB, and × indicates not adding the CCB, and blackened indicates the optimal metric.
ApproachCCBEasyMediumHardavgAccPSNRSSIM
TPGSR×61.0%49.9%36.7%49.8%21.020.7690
62.1%51.6%36.7%50.4%21.320.7705
TATT×62.6%53.4%39.8%52.6%21.520.7930
62.4%54.4%39.6%52.7%20.950.7951
C3-STISR×65.2%53.6%39.8%53.7%21.510.7721
65.1%54.0%39.6%53.8%21.370.7853
MNTSR×64.3%54.5%38.7%53.3%21.530.7946
64.0%54.8%38.9%53.2%21.670.7964
SCE-STISR×63.3%53.9%39.8%53.0%21.430.7982
63.5%55.3%39.9%53.6%21.840.7997
LEMMA×67.1%58.8%40.6%56.3%21.430.7543
67.2%58.6%40.8%56.4%21.590.7623
PEAN×68.9%60.2%45.9%59.0%21.570.7946
68.8%60.3%46.0%59.1%21.780.8017
Table 4. Effect of combining different modules. TPI is used instead when the DBFA module is not included. CRNN is used to recognize the text in SR images. Frames per second (FPS) is used to evaluate the speed of model inference. Parameters are used to measure model size. √ means join the module, and blackened indicates the optimal metric.
Table 4. Effect of combining different modules. TPI is used instead when the DBFA module is not included. CRNN is used to recognize the text in SR images. Frames per second (FPS) is used to evaluate the speed of model inference. Parameters are used to measure model size. √ means join the module, and blackened indicates the optimal metric.
DBFADIFACCBRecognition AccuracyFPSParameters
EasyMediumHardavgAcc
---62.26%52.73%39.09%52.1%4430.5 M
--62.94%52.73%39.46%52.4%4050.5 M
-63.25%53.93%39.76%53.0%3950.9 M
63.53%55.31%39.95%53.6%3951.2 M
Table 5. Quantitative comparison of SCE-STISR and previous state-of-the-art methods using three recognizers, CRNN, MORAN, and ASTER. The higher the recognition accuracy, the better the text super-resolution. Boldface represents the optimal result, strikethrough represents the sub-optimal result, and avg represents the average recognition result. _ indicates a sub-optimal indicator, and blackened indicates the optimal metric.
Table 5. Quantitative comparison of SCE-STISR and previous state-of-the-art methods using three recognizers, CRNN, MORAN, and ASTER. The higher the recognition accuracy, the better the text super-resolution. Boldface represents the optimal result, strikethrough represents the sub-optimal result, and avg represents the average recognition result. _ indicates a sub-optimal indicator, and blackened indicates the optimal metric.
MethodCRNNMORANASTER
Easy
(%)
Medium
(%)
Hard
(%)
Avg
(%)
Easy
(%)
Medium
(%)
Hard
(%)
Avg
(%)
Easy
(%)
Medium
(%)
Hard
(%)
Avg
(%)
Bicubic36.421.121.126.860.637.930.844.167.442.431.248.2
SRCNN41.122.322.029.263.940.029.445.670.644.031.550.0
SRResNet45.232.625.535.166.047.133.449.969.450.535.753.0
EDSR42.729.324.132.763.645.432.248.172.348.634.353.0
RCAN46.827.926.534.563.142.933.647.567.346.635.150.7
CARN40.727.424.331.458.842.331.145.062.344.731.547.1
HAN51.635.829.039.667.448.535.451.571.152.839.055.3
TSRN52.538.331.441.470.155.337.955.475.156.340.158.3
PCAN59.645.434.847.473.757.641.058.577.560.743.161.5
TBSRN59.647.135.348.174.157.040.858.475.759.941.660.1
Gestalt61.247.635.548.975.857.841.459.477.960.242.461.3
TPGSR63.152.038.651.874.960.544.160.578.962.744.562.8
TATT62.653.439.852.672.560.243.159.578.963.445.463.6
C3-STISR65.253.639.853.774.261.043.259.579.163.346.864.1
PerMR65.150.437.852.076.758.942.960.680.862.945.564.2
MNTSR64.354.538.753.376.761.244.961.979.564.645.864.4
TEAN63.752.538.152.276.860.843.461.480.464.545.664.6
DPMN64.354.139.253.373.261.443.860.479.264.045.063.8
TCDM67.357.342.755.777.662.945.962.281.365.150.165.5
PEAN68.960.245.959.079.467.049.166.184.571.452.970.6
LEMMA67.158.840.656.377.764.444.663.281.166.347.466.0
SCE-STISR63.555.339.953.673.959.544.760.980.963.445.864.5
Table 6. Quantitative comparisons of the CRNN between SCE-STISR and previous state-of-the-art methods show that the higher the PSNR and SSIM, the better the quality of the text super-resolution image, _ indicates a sub-optimal indicator, and blackened indicates the optimal metric.
Table 6. Quantitative comparisons of the CRNN between SCE-STISR and previous state-of-the-art methods show that the higher the PSNR and SSIM, the better the quality of the text super-resolution image, _ indicates a sub-optimal indicator, and blackened indicates the optimal metric.
MethodPSNRSSIM
EasyMediumHardAvgEasyMediumHardAvg
Bicubic22.3518.9819.3920.350.78840.62540.65920.6961
SRCNN23.4819.0619.3420.780.83790.63230.67910.7227
SRResNet24.3618.8819.2921.030.86810.64060.69110.7403
EDSR24.2618.6319.1420.680.86330.64400.71080.7394
RCAN22.1518.8119.8320.260.85250.64650.72270.7406
CARN22.7019.1520.0220.620.83840.64120.71720.7323
HAN23.3019.0220.1620.950.86910.65370.73870.7596
TSRN25.0718.8619.7121.420.88970.66760.73020.7690
PCAN24.5719.1420.2621.490.88300.67810.74750.7752
TBSRN23.4619.1719.6820.910.87290.64550.74520.7603
Gestalt23.9518.5819.7420.760.86110.66210.75200.7584
TPGSR23.7318.6820.0620.970.88050.67380.74400.7719
TATT24.7219.0220.3121.520.90060.69110.77030.7930
C3-STISR24.7119.0320.0921.510.85450.66740.76390.7721
PerMR24.8918.9820.4221.430.91020.69210.76580.7894
MNTSR24.9319.2820.3821.500.91730.68600.78060.7946
TEAN---21.70---0.7850
DPMN24.8419.0820.5121.490.90130.69020.76950.7925
LEMMA24.6719.2120.3721.430.87340.67830.56010.7543
PEAN24.8919.4620.4121.750.91570.69010.78370.7946
SCE-STISR24.9919.1320.7821.840.90380.69550.78590.7951
Table 7. Statistical tests of different methods. We compute the mean accuracy, PSNR, and SSIM after five rounds of training with different models and the corresponding p-values. p-values are obtained by our method and other methods in turn.
Table 7. Statistical tests of different methods. We compute the mean accuracy, PSNR, and SSIM after five rounds of training with different models and the corresponding p-values. p-values are obtained by our method and other methods in turn.
Method SCE-STISRTPGSRTATTC3-STISRMNTSR
PSNRmean21.8420.9521.5121.5121.52
p-value-3.7 × 10−64.5 × 10−59.1 × 10−52.8 × 10−3
SSIMmean0.79510.77190.79400.77160.7941
p-value-1.4 × 10−51.4 × 10−12.3 × 10−56.7 × 10−1
avgAccmean53.651.952.653.753.4
p-value-1.5 × 10−42.8 × 10−41.6 × 10−21.8 × 10−2
Table 8. Quantitative research on five text recognition datasets including IC13, IC15, CUTE80, SVT, and SVTP. Here, CRNN is used to recognize the text in SR images. Blackened indicates the optimal metric.
Table 8. Quantitative research on five text recognition datasets including IC13, IC15, CUTE80, SVT, and SVTP. Here, CRNN is used to recognize the text in SR images. Blackened indicates the optimal metric.
MethodSTR Datasets
IC13IC15CUTE80SVTSVTP
Bicubic9.6%10.1%35.8%3.3%10.2%
SRResNet11.4%13.4%50.5%9.3%13.8%
TSRN15.6%18.6%66.9%10.0%16.4%
TBSRN17.7%21.3%75.0%12.2%17.4%
TPGSR22.7%24.2%72.6%13.7%16.5%
TATT27.6%28.6%74.7%14.2%25.9%
C3-STISR24.7%22.7%71.5%10.2%17.7%
SCE-STISR28.9%30.7%74.9%15.1%26.5%
Table 9. Comparing the text recognition accuracy of different methods on densely connected datasets. Blackened indicates the optimal metric.
Table 9. Comparing the text recognition accuracy of different methods on densely connected datasets. Blackened indicates the optimal metric.
MethodCRNNMORANASTER
MNTSR38.9%49.3%52.0%
DPMN35.4%46.2%49.6%
LEMMA40.8%53.3%55.7%
PEAN39.5%52.4%54.8%
SCE-STISR42.7%54.1%57.3%
Table 10. Results of degraded images with Gaussian noise using Gaussian fuzzy kernels with different kernel widths and different covariances. Blackened indicates the optimal metric.
Table 10. Results of degraded images with Gaussian noise using Gaussian fuzzy kernels with different kernel widths and different covariances. Blackened indicates the optimal metric.
Kernel Width r Method σ = 10 σ = 30 σ = 50
r = 1TATT58.1%51.4%47.3%
Ours58.8%52.9%49.2%
r = 3TATT47.4%42.8%37.7%
Ours48.5%44.7%40.1%
r = 5TATT39.8%35.5%31.6%
Ours41.3%37.5%34.7%
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, M.; Li, Q.; Liu, H. Single-Character-Based Embedding Feature Aggregation Using Cross-Attention for Scene Text Super-Resolution. Sensors 2025, 25, 2228. https://doi.org/10.3390/s25072228

AMA Style

Wang M, Li Q, Liu H. Single-Character-Based Embedding Feature Aggregation Using Cross-Attention for Scene Text Super-Resolution. Sensors. 2025; 25(7):2228. https://doi.org/10.3390/s25072228

Chicago/Turabian Style

Wang, Meng, Qianqian Li, and Haipeng Liu. 2025. "Single-Character-Based Embedding Feature Aggregation Using Cross-Attention for Scene Text Super-Resolution" Sensors 25, no. 7: 2228. https://doi.org/10.3390/s25072228

APA Style

Wang, M., Li, Q., & Liu, H. (2025). Single-Character-Based Embedding Feature Aggregation Using Cross-Attention for Scene Text Super-Resolution. Sensors, 25(7), 2228. https://doi.org/10.3390/s25072228

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop