Cascaded Dual-Inpainting Network for Scene Text

Liu, Chunmei

doi:10.3390/app15147742

Open AccessArticle

Cascaded Dual-Inpainting Network for Scene Text

by

Chunmei Liu

School of Computer Science and Technology, Tongji University, Shanghai 201804, China

Appl. Sci. 2025, 15(14), 7742; https://doi.org/10.3390/app15147742

Submission received: 28 May 2025 / Revised: 3 July 2025 / Accepted: 8 July 2025 / Published: 10 July 2025

Download

Browse Figures

Versions Notes

Abstract

Scene text inpainting is a significant research challenge in visual text processing, with critical applications spanning incomplete traffic sign comprehension, degraded container-code recognition, occluded vehicle license plate processing, and other incomplete scene text processing systems. In this paper, a cascaded dual-inpainting network for scene text (CDINST) is proposed. The architecture integrates two scene text inpainting models to reconstruct the text foreground: the Structure Generation Module (SGM) and Structure Reconstruction Module (SRM). The SGM primarily performs preliminary foreground text reconstruction and extracts text structures. Building upon the SGM’s guidance, the SRM subsequently enhances the foreground structure reconstruction through structure-guided refinement. The experimental results demonstrate compelling performance on the benchmark dataset, showcasing both the effectiveness of the proposed dual-inpainting network and its accuracy in incomplete scene text recognition. The proposed network achieves an average recognition accuracy improvement of 11.94% compared to baseline methods for incomplete scene text recognition tasks.

Keywords:

scene text processing; incomplete scene text; scene text inpainting; scene text structure extraction; incomplete scene text recognition

1. Introduction

Scene text in images serves as a vital visual cue for information transmission. However, scene text often suffers from incompleteness due to complex backgrounds, occlusions, or corruption, which can significantly degrade the performance of optical character recognition (OCR) systems. Scene text inpainting, an emerging subfield in computer vision, aims to reconstruct incomplete text regions into legible and structurally complete forms. It bridges the gap between incomplete text inputs and high-accuracy text recognition systems. This technology plays an indispensable role in traffic sign comprehension for autonomous vehicles, the restoration of obscured container codes, the reconstruction of degraded text on packaging, the recognition of contaminated vehicle license plates, the reconstruction of damaged urban signboards, the acquisition of high-fidelity medical data, and other critical applications. It is crucial to accurately reconstruct incomplete textual information in these fields.

Despite recent advancements in generic image inpainting, incomplete scene text restoration under complex environmental conditions remains a formidable challenge. The reason is that unique text characteristics limit the effectiveness of generic inpainting methods. There are three principal challenges. First, scene text images exhibit textual diversity, including content variations, stylistic disparities, font variations, multilingual texts, arbitrary color schemes, scale fluctuations, and orientation randomness. Despite these variations, scene text remains recognizable through its inherent structures and stroke patterns. The same text content may have various styles. This inherent variability increases the difficulties in incomplete scene text inpainting. Second, corruption in incomplete scene text images manifests multifaceted uncertainties, such as unpredictable corruption localization, corruption scale instability, irregular geometric corruption, stochastic corruption ratios, and variable corruption patterns. These stochastic and uncertain corruptions seriously impair character recognition accuracy, since fragmented strokes are unpredictable. Third, texts exhibit dual characteristics of texture commonality and structural specificity, which make texts have similar global features and different local features. These structural and textural properties prove crucial for scene text detection and recognition systems, as they establish fundamental dependencies on these dual characteristics. Severe destruction of these structural and textural attributes undermines text recognition performance. Therefore, it is essential to develop specialized scene text inpainting methods.

In this paper, a cascaded dual-inpainting network is proposed to address the incomplete scene text inpainting. The proposed framework comprises two specialized text inpainting modules to enhance downstream recognition performance: (1) the Structure Generation Module (SGM) for text structure prediction, and (2) the Structure Reconstruction Module (SRM) for text structure reconstruction. The SGM generates text structure regions and predicts missing structural content through an encoder–decoder architecture. This module achieves two key objectives: constructing complete text structure layouts and recovering missing structural patterns. Guided by the SGM-generated text structure, the SRM further refines the reconstruction through structure-guided optimization of geometric consistency and perceptual quality. Within this framework, the SGM’s structural output directly guides the SRM to enhance reconstruction fidelity. Although the two models are architecturally cascaded in design, they maintain independent operational frameworks. Specifically, the SRM is explicitly conditioned on the SGM’s output during the inference phase.

The contributions are summarized as follows. The first contribution is the proposal of a novel cascaded dual-inpainting framework for scene text structure extraction and inpainting (CDINST), consisting of two specialized components: the SGM and the SRM. The second contribution is the proposal of the SRM, which enhances foreground structure reconstruction through structure-guided refinement and attention mechanisms. The SGM-generated text structures provide guidance to the SRM for precise structural reconstruction. The attention gate selectively attends to relevant contextual information from surrounding regions, thereby improving restoration fidelity. The third contribution involves a comprehensive experimental validation comparing the proposed method with state-of-the-art approaches. Quantitative and qualitative results demonstrate the performance of the proposed method in terms of recognition accuracy and image quality metrics.

The remainder of this paper is organized as follows. Section 2 reviews related work in image inpainting, scene text inpainting, and scene text recognition. Section 3 describes the proposed framework in detail, including the SGM, the SRM, and the procedures of training and inference. Section 4 presents comparative experimental results against state-of-the-art text inpainting methods and ablation studies. Finally, Section 5 summarizes the main work, together with discussions on some open issues and future research directions.

2. Related Work

2.1. Image Inpainting

Deep learning has achieved remarkable progress in image inpainting. Existing methods can be divided into three types: (1) convolutional neural network (CNN)-based approaches, (2) generative adversarial network (GAN)-based frameworks, and (3) diffusion-model-driven techniques.

The convolutional neural network (CNN) plays an important role in the image inpainting task. CNN-based image inpainting utilizes localized feature extraction to handle diverse missing regions adaptively. It can capture image information through a multi-scale design, adapting to various types of defect areas and playing an important role in the texture continuity and semantic consistency of the restored images. Ronneberger et al. [1] proposed the U-Net architecture with a symmetric encoder–decoder design, which integrated local details and global information through skip connections that provided an effective approach for contextual capturing and multi-scale feature support in image inpainting. Liu et al. [2] embedded attention mechanisms into the U-Net architecture to enhance the ability of context modeling. The inpainting performance of the missing area was significantly improved. Guo et al. [3] introduced ResNet into image inpainting. It effectively promoted feature integration and texture prediction and further improved the detail quality of restoration. Pathak et al. [4] introduced Context Encoders, which utilized a CNN to capture the appearance information and semantic structure features of pixels around the missing area. This was an unsupervised visual feature learning method, but it was prone to blurring issues. Yan et al. [5] proposed Shift-Net, which introduced a dedicated shift connection layer. This architecture could generate the missing part with a clear structure and fine texture, thereby effectively solving the blurring problem caused by previous contextual encoders. For the image inpainting task with irregular masks, Liu et al. [6] proposed a partial convolution method. This approach applied convolution operations exclusively to valid pixels, which avoided the negative impact of filled values in masked holes on feature extraction. However, partial convolution suffered from rigid mask-updating rules, which made it hard to adaptively learn diverse mask shapes. Yu et al. [7] proposed gated convolution, which employed a dynamic feature selection mechanism to adaptively adjust feature weights for corrupted regions. This approach significantly enhanced the model’s adaptability to free-form masks and its capability to capture semantic information. The proposed method improved the detail quality of image inpainting and the capacity for complex scenes. With the development of deep learning techniques and the growing demand for diverse scenarios in image inpainting tasks, the limitations of single-model approaches have become evident. CNN-based methods struggle to effectively capture global contextual information. It is a current trend shifting toward multi-module integration frameworks that can integrate the merits of different foundational models.

GAN-based image inpainting leverages generative adversarial networks (GANs) to generate realistic content within missing image regions. The GAN frameworks, initially proposed by Goodfellow et al. [8], operate through two neural networks in adversarial competition: a generator that creates synthetic content and a discriminator that distinguishes real from generated data. This adversarial training mechanism enables the system to progressively improve the authenticity of generated outputs. Iizuka et al. [9] applied dilated convolution, globally and locally context discriminators, to image inpainting. This approach restored missing image portions of arbitrary resolutions and shapes while maintaining global semantic consistency. This approach captured broader contextual information while preserving computational efficiency without increasing parameters or computational costs. However, it was difficult to effectively inpaint images with large damaged areas or complex backgrounds. Zeng et al. [10] developed a method that integrates an auxiliary encoder and contextual reconstruction loss, ensuring enhanced coherence between inpainted regions and the global image. The auxiliary encoder optimized feature representation in the refinement network to produce more natural restoration results. Yu et al. [11] introduced the Discrete Wavelet Transform (DWT) into the generator architecture, achieving an optimal balance between preserving high-frequency details and enhancing perceptual quality. This approach successfully resolved the inherent frequency conflict issues in conventional GAN-based image inpainting methods. Reed et al. [12] proposed a GAN-based framework that integrated textual features through adversarial loss and text-matching loss, effectively guiding the image generation consistent with text descriptions. This work established the theoretical foundation for subsequent multimodal generation research. Zhang et al. [13] proposed a text-guided dual-modal attention mechanism that combined with image contextual information to infer missing content, and it enhanced the generator’s capability through an auxiliary pathway and multi-loss functions. Wu et al. [14] proposed a dual-stage generator for coarse-to-fine image inpainting, which integrated a dual-attention module to capture both sentence-level and word-level textual features. This approach effectively resolved semantic misalignment and detail deficiency in text-guided image inpainting. Li et al. [15] proposed an image inpainting approach incorporating a Visual-Textual Multimodal Fusion Module (VTM-FM), which integrated textual and visual features through cross-modal attention mechanisms. This method significantly improved GAN performance in inpainting via contextual information, frequency-domain decomposition, and cross-modal fusion. GANs have achieved improvements in generation quality, detail restoration, and semantic coherence through adversarial training while offering an adaptable generation framework applicable to diverse tasks. However, GANs suffer from inherent limitations: training instability, high computational cost, inadequate semantic comprehension capabilities in multimodal contexts, etc.

Denoising Diffusion Probabilistic Models (DDPMs) perform image inpainting through two distinct stages: the forward diffusion process (gradual noise addition) and the reverse denoising process (iterative noise removal). Ho et al. [16] proposed a framework that enabled data generation through gradual noise addition and iterative noise removal. It had a stable training process and high-quality generative capabilities. Nichol et al. [17] proposed refinements for the DDPM, which included learning reverse process variances, optimizing the noise schedule, reducing gradient noise, improving the sampling speed, and reducing diffusion steps. These improvements enhanced the model’s ability to capture complex data distributions, reduced computational costs, and increased sampling efficiency. Saharia et al. [18] introduced Imagen, a text-to-image diffusion model built on the capabilities of large transformer language models for text understanding and hinged on the strength of diffusion models in high-fidelity image generation. This approach effectively mitigated detail degradation caused by high guidance weights through dynamic thresholding and static threshold adaptation while significantly enhancing both output diversity and visual fidelity in generated images. Wang et al. [19] proposed imagen editor, a cascaded diffusion architecture that progressively refined outputs. This approach enhanced performance in high-resolution image editing and image inpainting. Rombach et al. [20] proposed Stable Diffusion, which operated the diffusion process in the latent space to reduce computational complexity, optimize perceptual loss, and improve sampling methodologies. This approach enhanced the efficiency and quality of image inpainting. Manukyan et al. [21] proposed two key innovations: the Prompt-Aware Introverted Attention layer to alleviate the prompt neglect issues of background and nearby object dominance, and the Reweighting Attention Score Guidance strategy to prevent out-of-distribution shifts. Their training-free approach enhanced Stable Diffusion for text-guided high-resolution image inpainting. Xie et al. [22] proposed a mask refinement mechanism to train on masks of different shapes. The approach first predicted precise masks based on contextual cues and then performed inpainting with the refined masks. This paradigm significantly improved the model’s performance in complex masking scenarios. Additionally, contextual guidance enhanced the naturalness and coherence of inpainting results. Lugmayr et al. [23] introduced the cyclic merging of forward-background and reverse-damaged regions via iterative refinement, achieving coherent image inpainting with the source image. This iterative refinement achieved information fusion between missing and intact areas, producing restorations with superior visual coherence and structural consistency with the original images. Global structural information is critical for image inpainting. Mokady et al. [24] introduced null-text inversion optimization, utilizing base and high-weight text conditions to steer sampling paths for enhanced control. This approach achieved a harmonious balance between reconstruction precision and visual naturalness by enhancing the guidance impact and preserving structural integrity in background regions. DDPMs have achieved remarkable progress in image inpainting, expanding into domains such as multimodal restoration, contextual guidance, and global structural refinement. From DDPMs to recent global–local optimization approaches, the diverse guidance mechanisms have enhanced generation quality. However, DDPMs still face challenges, including a slow generation speed and high computational costs.

Transformer architectures have evolved from initial global modeling to multi-level modeling that integrates both local and global contexts in image inpainting. The efficient contextual modeling provided by self-attention mechanisms provides a robust theoretical foundation for image inpainting. With the development of shifted-window transformers, axial attention, and mask-aware transformers, these architectures progressively address computational complexity challenges while enhancing their modeling capabilities for corrupted regions. Ko et al. [25] proposed the Continuous-Mask-Aware Transformer and developed a novel mask update scheme for image inpainting. This improvement enhanced the scoring mechanism of the mask-aware transformer, significantly reducing errors. Liang et al. [26] proposed SwinIR, which is based on the Swin Transformer, for image restoration. By utilizing multiple cascaded Swin Transformer blocks to extract both shallow and deep features, this approach achieved high-quality image reconstruction. The integration of multi-level features effectively enhanced both detailed representations and the global consistency of restored images. Jeevan P. et al. [27] proposed a token-mixing network, WavePaint, based on the WaveMix architecture, which used a 2D discrete wavelet transform for spatial token-mixing. The presence of spatial token-mixing enabled faster receptive field expansion in the model. It had fewer parameters, faster training and inference speeds, and required no adversarial training. Although the transformer has made progress in image inpainting, it still has many problems, including sensitivity to input dimensions, high demand for computational resources, a long training time, and poor interpretability.

2.2. Scene Text Inpainting

Recent research has shifted focus from general image inpainting to specialized tasks, particularly incomplete scene text inpainting, which has emerged as a critical area of investigation. Xue et al. [28] proposed a specialized incomplete text inpainting network (INIT) employing a fully convolutional encoder–decoder architecture. Under dual constraints of reconstruction loss and semantic loss, the INIT framework demonstrated enhanced performance in both text reconstruction fidelity and recognition accuracy. Sun et al. [29] developed a two-stage incomplete text inpainting algorithm (TSINIT). The TSINIT framework adopted a dual-branch architecture comprising a text extraction module and a text reconstruction module, both of which employed encoder–decoder architectures to minimize structural reconstruction errors. Zhu et al. [30] proposed a global structure-guided diffusion model (GSDM) comprising two modules: a structure prediction module (SPM) and a reconstruction module (RM). The SPM, constructed in a U-Net architecture, generated text structures as segmentation masks. The RM took the predicted masks and corrupted images as inputs, employing a diffusion-based reconstruction module to reconstruct missing text regions. Chen et al. [31] proposed a dual-discriminator generative adversarial network (D2GAN) for restoring missing components in ancient Yi script characters. The D2GAN framework enhanced the deep convolutional generative adversarial network (DCGAN) by incorporating a Yi-character screening discriminator and establishing a refined Yi script generator, which effectively performed text inpainting on degraded ancient books. At present, scene text inpainting is primarily based on encoder–decoder architectures and has a single-model structure. To enhance the performance of scene text inpainting, it is critical to develop more diverse multi-level models.

2.3. Scene Text Recognition

In recent years, research on scene text recognition (STR) has focused on deep learning methods and achieved significant advancements. Fang et al. [32] proposed an autonomous, bidirectional, and iterative ABINet for scene text recognition. Shi et al. [33] introduced ASTER, an end-to-end neural network model that comprised a rectification network and a recognition network. Shi et al. [34] proposed a novel neural network architecture, which integrated feature extraction, sequence modeling, and transcription into a unified framework. Yue et al. [35] proposed the RobustScanner method, which includes a novel position enhancement branch, and dynamically fused its outputs with those of the decoder attention module for scene text recognition. Li, Hui et al. [36] proposed an easy-to-implement strong baseline named SAR for irregular scene text recognition using off-the-shelf neural network components and only word-level annotations. Du Yongkun et al. [37] proposed a Single Visual model for scene text recognition (SVTR) within the patch-wise image tokenization framework, which dispensed with the sequential modeling. Although significant advancements have been made in STR, it leads to poor performance when directly applied to incomplete scene texts (particularly those with damaged or missing character components). It is necessary to implement scene text inpainting as a preprocessing step before STR.

3. Methodology

A cascaded dual-inpainting network for scene text (CDINST) is proposed to reconstruct text structures and enhance recognition accuracy in downstream recognition tasks. The proposed framework consists of two cascaded modules: the SGM and the SRM, as depicted in Figure 1. The SGM generates the preliminary text structure. It is directly fed into the SRM as the structure-guided condition to refine text structure reconstruction. The two modules are independently designed, sequentially cascaded, and jointly trained.

3.1. Structure Generation Module (SGM)

When scene text is partially missing or corrupted, the structural coherence collapses. This leads to severe degradation in recognition accuracy caused by inconsistent patterns. It is is essential to preserve structural integrity for scene text recognition. The SGM reconstructs missing text regions and generates a complete pixel-level text structure, which serves as the structure-guided condition for the SRM.

Given a corrupted scene text image

I_{c o r} \in R^{h \times w \times c}

as input, the SGM learns to predict the intact text structure,

{\hat{S}}_{S G M} \in R^{h \times w}

, through training. The generation process

F_{S G M}

of the SGM is mathematically defined as follows:

{\hat{S}}_{S G M} = F_{S G M} (I_{c o r})

(1)

The model follows an encoder–decoder architecture [29] with residual blocks as the backbone, designed for image-to-image translation (input: 3-channel RGB, output: 1-channel grayscale). The encoder reduces spatial resolution through three stages: a residual block maintains input size, followed by two down-sampling residual blocks with convolution and ELU activation function that halve the resolution sequentially. The decoder employs two up-sampling residual blocks with transposed convolution and ELU activation function to restore original resolution, ending with a 1 × 1 convolution and Tanh activation for output. Residual connections with learnable shortcuts and ELU activations are used throughout to stabilize training, ensuring efficient feature propagation while maintaining spatial integrity across the network.

SGM loss consists of two components:

L_{1}

loss and binary cross-entropy (

L_{b c e}

). These components are formally defined in Equations (2)–(4):

L_{S G M} = α_{1} \cdot L_{1} + α_{2} \cdot L_{b c e}

(2)

L_{1} = ‖S - \hat{S_{S G M}}‖

(3)

L_{b c e} = - θ_{1} \sum S \cdot \log \hat{S_{S G M}} - θ_{2} \sum (1 - S) \cdot \log (1 - \hat{S_{S G M}})

(4)

where

S

denotes the ground truth text structure, and

\hat{S_{S G M}}

represents the predicted text structure.

α

-terms are loss-balancing coefficients.

3.2. Structure Reconstruction Module (SRM)

Text inherently possesses strong structural properties that remain independent of stroke thickness, color, or other visual attributes. The stroke information provides critical guidance for restoring damaged or incomplete characters. To leverage this, the SGM generates an initially repaired foreground character structure. Under the guidance of the SGM-generated text structures, the SRM restores the missing parts of characters to enhance scene text inpainting performance.

As illustrated in Figure 1, the pipeline of the SRM operates as follows: (1) the SGM first produces an initial text structure

{\hat{S}}_{S G M}

; (2) this output

{\hat{S}}_{S G M}

is then processed as the structure-guided condition of the SRM to enhance the text inpainting:

{\hat{S}}_{S R M} = F_{S G M} (I_{c o r}, {\hat{S}}_{S G M})

(5)

where

I_{c o r}

denotes the corrupted scene text image as input,

{\hat{S}}_{S G M}

serves as the generated structural guidance condition, and

F_{S G M}

represents the SRM’s nonlinear transformation function.

{\hat{S}}_{S R M} \in R^{h \times w}

is the final refined structure, which is improved in spatial coherence and structure preservation compared with the initial SGM output.

As shown in Figure 1, an Attention U-Net [38] is employed, featuring an encoder–decoder architecture for text structure prediction (input: 4-channel RGB + grayscale, output: 1-channel grayscale). The encoder consists of four down-sampling blocks, each comprising a MaxPool2d layer followed by dual convolutional blocks. The decoder consists of four up-sampling stages, each incorporating bilinear interpolation followed by channel concatenation (combining skip connection features with up-sampled features) and dual convolutional blocks, with attention gates integrated at each skip connection junction. In the Attention U-Net architecture, skip connections fuse low-level spatial features from the encoder with high-level semantic features from the decoder for effective detail restoration through hierarchical feature fusion. The attention gate mechanism demonstrates three functionalities: (1) enhancing focus on text regions to locate missing text areas; (2) concentrating on valid contextual information from surrounding regions; (3) suppressing irrelevant features interference while reducing background noise in complex scenes. During scene text inpainting, the SRM can focus more on the edge information of missing regions to better reconstruct the text structure. In deep decoding stages, the SRM may focus more on semantic information, while shallow layers prioritize detailed textures. The attention gate can dynamically adjust feature importance according to the current stage’s requirements.

SRM loss consists of four components:

L_{1}

loss, binary cross-entropy (

L_{b c e}

), style loss (

L_{s t y l e}

), and semantic loss [34,39] (

L_{s e m}

). These components are formally defined by Equations (6)–(10):

L_{S R M} = β_{1} \cdot L_{1} + β_{2} \cdot L_{b c e} + β_{3} \cdot L_{s t y l e} + β_{4} \cdot L_{s e m}

(6)

L_{1} = ‖S - \hat{S_{S R M}}‖

(7)

L_{b c e} = - θ_{1} \sum S \cdot \log \hat{S_{S R M}} - θ_{2} \sum (1 - S) \cdot \log (1 - \hat{S_{S R M}})

(8)

L_{s t y l e} = E [{‖G_{j}^{ϕ} (S) - G_{j}^{ϕ} (\hat{S_{S R M}})‖}_{1}]

(9)

L_{s e m} = E [\sum_{i} \frac{1}{N_{i}} {‖ϕ_{i} (S) - ϕ_{i} (\hat{S_{S R M}})‖}_{1}]

(10)

where

ϕ_{i}

denotes the feature map from the i-th perceptual layer,

N_{i}

is the number of elements in feature map obtained by the the i-th layer, and

G_{j}^{ϕ}

represents the Gram matrix [40].

β

-terms are loss balancing coefficients.

3.3. Training and Inference Procedure

A two-stage training strategy [29] is adopted in order to fully leverage the structural guidance of text components during the inpainting process. (1) Independent Pre-training: the SGM and the SRM are initially trained separately. (2) Joint Fine-tuning: after 10 epochs of independent pre-training, both modules are jointly optimized with a composite loss function by Equation (11):

L = γ_{1} \cdot L_{S G M} + γ_{2} \cdot L_{S R M}

(11)

where

γ

-terms are balancing weights. This phased approach ensures stable convergence: the SGM first learns preliminary text structure priors to provide structure guidance, while the SRM focuses on detailed refinement.

To standardize inputs, all images are resized to 64 × 256 pixels during inference, ensuring consistent spatial dimensions for batch processing. This preprocessing step eliminates scale variance and aligns with the model’s trained resolution.

4. Experiments and Discussion

Some comparative experiments were conducted to evaluate the effectiveness of the proposed framework against the state-of-the-art incomplete text inpainting methods. Meanwhile, ablation studies were performed to demonstrate the impact of individual components in the proposed method.

4.1. Datasets and Evaluation Metrics

Dataset: To evaluate the performance of the proposed network, experiments were conducted on the TII-ST database [30], a synthetic scene text database containing 80,000 training images and 20,000 test images. Each sample comprised a corrupted color image paired with its ground truth segmentation grayscale mask. All images were resized to

64 \times 256

pixels to maintain uniform input dimensions during evaluation. As this dataset was built for text image inpainting, it only contains English texts. It uses corrosion forms to simulate real-life corrosion, i.e., a convex hull, irregular region, and quick draw. It contains various fonts and colors, except for graphic styles.

Metrics: Since the output of the proposed CDINST model is a foreground grayscale image of restored text, conventional metrics like PSNR and SSIM are significantly affected by grayscale intensity variations. They prove inadequate for accurately evaluating the model’s actual restoration capability. To establish a more precise assessment of restoration performance, text recognition accuracy was adopted as the primary evaluation metric while still reporting PSNR and SSIM values as supplementary references. The text recognition accuracy metric consists of two components: word-level accuracy

A_{w}

and character-level accuracy

A_{c}

, defined as follows:

A_{w} = \frac{N_{c o r r e c t_r e c_w o r d s}}{N_{t o t a l_w o r d s}} \times 100 %

(12)

A_{c} = \frac{N_{c o r r e c t_r e c_c h a r s}}{N_{t o t a l_c h a r s}} \times 100 %

(13)

{R e c a l l}_{c} = \frac{N_{c o r r e c t_r e c_c h a r s}}{N_{c o r r e c t_c h a r s}} \times 100 %

(14)

where

N_{c o r r e c t_r e c_w o r d s}

is the number of correctly recognized words,

N_{t o t a l_w o r d s}

denotes the total words count,

N_{c o r r e c t_r e c_c h a r s}

indicates the correctly recognized characters,

N_{t o t a l_c h a r s}

represents the total character count, and

N_{c o r r e c t_c h a r s}

represents the total correct character count.

{R e c a l l}_{c}

denotes character-level recall.

Implementation setting: The proposed model was implemented using PyTorch 2.6 on an NVIDIA RTX 4080 GPU. The backbone network requires no pre-training on external datasets. During initialization, the Adam optimizer was employed with a base learning rate of

2 \times 10^{- 4}

. The loss balancing coefficients were set to

(α_{1}, α_{2}) = (1,10)

,

(β_{1}, β_{2}, β_{3}, β_{4}) = (10,10,10,10)

,

(γ_{1}, γ_{2}) = (1,1)

.

4.2. Comparison with State-of-the-Art Approaches

Comparative experiments were conducted on the TII-STsyn20k test set [30], benchmarking the proposed CDINST model against two state-of-the-art incomplete text inpainting methods: TWINST [29] and GSDM-SPM [30], which specifically focus on reconstructing text foregrounds.

Table 1, Table 2 and Table 3 present comparative evaluations of text recognition accuracy (

A_{w}

and

A_{c}

) and image quality metrics (PSNR and SSIM) between existing methods and the proposed CDINST model on the test set. The inpainting effectiveness was benchmarked using multiple text recognition algorithms: CRNN [34], ASTER [33], ABINet [32], RobustScanner [35], SAR [36], and SVTR [37]. The experimental results demonstrate that CDINST achieves an average improvement of +11.94% in

A_{w}

, +4.79% in

A_{c}

, and +5.69% in character-level recognition recall over state-of-the-art methods, along with average PSNR gains of +0.075 dB PSNR and SSIM average improvements of +0.0742. Given the comparative results, CDINST can significantly enhance the inpainting capability for incomplete scene text.

Several examples are presented to demonstrate the performance of the proposed method. Figure 2 illustrates scene text inpainting and recognition results, highlighting the capability of the proposed method to reconstruct the binary foreground of incomplete scene text. On the test set, it was demonstrated that comparative text inpainting results were achieved across different methods (TWINST, GSDM-SPM, and the proposed module CDINST). The CRNN recognition model was employed to evaluate the performance of these methods. Figure 2a–d, respectively, show corrupted scene text images; ground truth structure images; inpainting results generated by TWINST, GSDM-SPM, and the proposed CDINST method; and corresponding recognition results. The comparative experiments demonstrate that the proposed method achieves more accurate text structures than TWINST and GSDM-SPM. This confirms the merit of the proposed model, which can improve the performance of scene text inpainting.

4.3. Ablation Experiments

Ablation studies were conducted to systematically evaluate the impact of individual components within the proposed framework. To ensure evaluation consistency, all models were trained on the TII-ST80k training set and tested on the TII-ST20k test set. To quantify component contributions, the word-level recognition accuracy metric (

A_{w}

) was adopted, combined with PSNR and SSIM. Performance comparisons of framework components are presented in Table 4 and Table 5.

loss

L_{1}

and

L_{b c e}

in CDINST-SRM: L1 loss

L_{1}

balances accuracy and clarity in the generated results by directly minimizing pixel-level discrepancies. The binary cross-entropy loss

L_{b c e}

ensures global structural consistency between the inpainted structure and the ground truth structure through direct constraints on absolute pixel space differences. As foundational components in image inpainting, these two losses jointly serve as baselines for evaluating the performance of individual components in the proposed model.

sem-loss

L_{s e m}

in CDINST-SRM: As shown in Table 4, when

L_{s e m}

is removed, the model’s scene text reconstruction capability is reduced. The word-level recognition accuracy drops by 2.56%. The semantic loss quantifies the discrepancy between the generated output and the ground truth through deep semantic feature comparisons. This loss function promotes semantic space alignment between the generated results and the ground truth instead of focusing solely on pixel-level similarity, thus improving the semantic coherence of reconstructed scene text.

style-loss

L_{s t y l e}

in CDINST-SRM: The style loss

L_{s t y l e}

critically governs texture preservation in the proposed framework. Its removal degrades word-level recognition accuracy by 1.32% (Table 4), primarily due to misaligned feature statistics between inpainted regions and the ground truth. By enforcing Gram matrix consistency across convolutional layers,

L_{s t y l e}

ensures the coherent integration of color, texture, and structural patterns.

SRM in CDINST: The pivotal component in CDINST performs both local detail restoration and global structural reconstruction in character-missing regions. As evidenced in Table 5, the ablation of the SRM results in performance deterioration, manifesting as an 18.81% decrease in character recognition accuracy.

SGM in CDINST: The SGM guides the SRM to improve reconstruction fidelity. Removing the SGM from the framework diminishes structural guidance for text restoration, thereby degrading the system’s text inpainting and recognition performance. As shown in Table 5, the SGM’s lack of structural guidance reduces word-level recognition accuracy by 1.46%.

Computational complexity: To evaluate the performance of the proposed models, Table 6 presents their computational performance, including parameter count, GFLOPs, and inference time.

4.4. Experiments on Real-World Scene Text Images

To validate the effectiveness of the proposed method in practical scenarios, experiments were performed on real-world scene text images in two ways:

Controlled degradation: A subset of TII-ST real images [30] (originally collected from the ICDAR2015 benchmark) was randomly selected and subjected to structural degradation patterns. Figure 3 shows representative inpainting results generated by the proposed method, along with corresponding recognition results.

Wild degradation: The proposed method was further evaluated on naturally degraded scene text images obtained from open-source web repositories. The reconstruction and recognition results for these challenging scenarios are demonstrated in Figure 4.

These experiments demonstrate that the proposed method can effectively reconstruct incomplete scene text across diverse real-world scenarios while maintaining compatibility with mainstream text recognition architectures.

5. Conclusions

In this study, a cascaded dual-inpainting network is proposed for scene text restoration, which consists of two specialized modules designed to enhance downstream text recognition: the SGM and the SRM. The SGM initially extracts structural representations and performs preliminary restoration of degraded text regions. Subsequently, the SRM refines the foreground structure reconstruction through structure-guided refinement based on the guidance generated by the SGM. The experimental results demonstrate the framework’s effectiveness in reconstructing degraded scene text across various degradation patterns. The proposed network achieves an improvement in recognition accuracy compared to baseline methods when applied to incomplete scene text recognition tasks.

In future research, studies will focus on improving the scene text inpainting performance, which mainly involves two aspects. One is to develop a comprehensive scene text dataset that covers different languages, various complex styles, and real-world text images. The other is to design inpainting strategies integrated with more powerful models, such as transformers, GANs, and DDPMs.

Funding

This research is funded by the Innovation Program of Shanghai Municipal Education Commission under grant 202101070007E00098 and the National Natural Science Foundation of China under grant 624723130.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Publicly available datasets were analyzed in this study. TII-ST datasets can be found here: https://github.com/blackprotoss/GSDM (accessed on 18 January 2025) [30].

Conflicts of Interest

The author declares no conflicts of interest.

References

Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar]
Liu, H.; Jiang, B.; Xiao, Y.; Yang, C. Coherent semantic attention for image inpainting. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 4170–4179. [Google Scholar]
Guo, Z.; Chen, Z.; Yu, T.; Chen, J.; Liu, S. Progressive image inpainting with full-resolution residual network. In Proceedings of the 27th Acm International Conference on Multimedia, Nice, France, 21–25 October 2019; pp. 2496–2504. [Google Scholar]
Pathak, D.; Krahenbuhl, P.; Donahue, J.; Darrell, T.; Efros, A.A. Context encoders: Feature learning by inpainting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2536–2544. [Google Scholar]
Yan, Z.; Li, X.; Li, M.; Zuo, W.; Shan, S. Shift-net: Image inpainting via deep feature rearrangement. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 1–17. [Google Scholar]
Liu, G.; Reda, F.A.; Shih, K.J.; Wang, T.C.; Tao, A.; Catanzaro, B. Image inpainting for irregular holes using partial convolutions. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 85–100. [Google Scholar]
Yu, J.; Lin, Z.; Yang, J.; Shen, X.; Lu, X.; Huang, T.S. Free-form image inpainting with gated convolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 4471–4480. [Google Scholar]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Bengio, Y. Generative adversarial nets. arXiv 2014, arXiv:1406.2661. [Google Scholar]
Iizuka, S.; Simo-Serra, E.; Ishikawa, H. Globally and locally consistent image completion. ACM Trans. Graph. (ToG) 2017, 36, 1–14. [Google Scholar] [CrossRef]
Zeng, Y.; Lin, Z.; Lu, H.; Patel, V.M. Cr-fill: Generative image inpainting with auxiliary contextual reconstruction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 14164–14173. [Google Scholar]
Yu, Y.; Zhan, F.; Lu, S.; Pan, J.; Ma, F.; Xie, X.; Miao, C. Wavefill: A wavelet-based generation network for image inpainting. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 14114–14123. [Google Scholar]
Reed, S.; Akata, Z.; Yan, X.; Logeswaran, L.; Schiele, B.; Lee, H. Generative adversarial text to image synthesis. In Proceedings of the International Conference on Machine Learning, New York City, NY, USA, 19–24 June 2016; pp. 1060–1069. [Google Scholar]
Zhang, L.; Chen, Q.; Hu, B.; Jiang, S. Text-guided neural image inpainting. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 1302–1310. [Google Scholar]
Wu, X.; Xie, Y.; Zeng, J.; Yang, Z.; Yu, Y.; Li, Q.; Liu, W. Adversarial learning with mask reconstruction for text-guided image inpainting. In Proceedings of the 29th ACM International Conference on Multimedia, Chengdu, China, 20–24 October 2021; pp. 3464–3472. [Google Scholar]
Li, A.; Zhao, L.; Zuo, Z.; Wang, Z.; Xing, W.; Lu, D. MIGT: Multi-modal image inpainting guided with text. Neurocomputing 2023, 520, 376–385. [Google Scholar] [CrossRef]
Ho, J.; Jain, A.; Abbeel, P. Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst. 2020, 33, 6840–6851. [Google Scholar]
Nichol, A.Q.; Dhariwal, P. Improved denoising diffusion probabilistic models. In Proceedings of the International Conference on Machine Learning, Vienna, Austria, 18–24 July2021; pp. 8162–8171. [Google Scholar]
Saharia, C.; Chan, W.; Saxena, S.; Li, L.; Whang, J.; Denton, E.L.; Ghasemipour, K.; Lopes, R.G.; Ayan, B.K.; Salimans, T.; et al. Photorealistic text-to-image diffusion models with deep language understanding. Adv. Neural Inf. Process. Syst. 2022, 35, 36479–36494. [Google Scholar]
Wang, S.; Saharia, C.; Montgomery, C.; Pont-Tuset, J.; Noy, S.; Pellegrini, S.; Chan, W. Imagen editor and editbench: Advancing and evaluating text-guided image inpainting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 18359–18369. [Google Scholar]
Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 10684–10695. [Google Scholar]
Manukyan, H.; Sargsyan, A.; Atanyan, B.; Wang, Z.; Navasardyan, S.; Shi, H. HD-Painter: High-resolution and prompt-faithful text-guided image inpainting with diffusion models. arXiv 2023, arXiv:2312.14091. [Google Scholar]
Xie, S.; Zhang, Z.; Lin, Z.; Hinz, T.; Zhang, K. Smartbrush: Text and shape guided object inpainting with diffusion model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 22428–22437. [Google Scholar]
Lugmayr, A.; Danelljan, M.; Romero, A.; Yu, F.; Timofte, R.; Van Gool, L. Repaint: Inpainting using denoising diffusion probabilistic models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11461–11471. [Google Scholar]
Mokady, R.; Hertz, A.; Aberman, K.; Pritch, Y.; Cohen-Or, D. Null-text inversion for editing real images using guided diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 6038–6047. [Google Scholar]
Ko, K.; Kim, C.S. Continuously masked transformer for image inpainting. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 13169–13178. [Google Scholar]
Liang, J.; Cao, J.; Sun, G.; Zhang, K.; Van Gool, L.; Timofte, R. Swinir: Image restoration using swin transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 1833–1844. [Google Scholar]
Jeevan, P.; Kumar, D.S.; Sethi, A. WavePaint: Resource-Efficient Token-Mixer for Self-Supervised Inpainting. arXiv 2023, arXiv:2307.00407. [Google Scholar]
Xue, F.; Zhang, J.; Sun, J.; Yin, J.; Zou, L.; Li, J. INIT: Inpainting Network for Incomplete Text. In Proceedings of the 2022 IEEE International Symposium on Circuits and Systems (ISCAS2022), Austin, TX, USA, 27 May–1 June 2022. [Google Scholar]
Sun, J.; Xue, F.; Li, J.; Zhu, L.; Zhang, H.; Zhang, J. TSINIT: A Two-stage Inpainting Network for Incomplete Network. IEEE Trans. Multim. 2023, 25, 5166–5177. [Google Scholar] [CrossRef]
Zhu, S.; Fang, P.; Zhu, C.; Zhao, Z.; Xu, Q.; Xue, H. Text image inpainting via global structure-guided diffusion models. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 7775–7783. [Google Scholar]
Chen, S.-X.; Zhu, S.-Y.; Xiong, H.-L.; Zhao, F.-J.; Wang, D.-W.; Liu, Y. A Method of Inpainting Ancient Yi Characters Based on Dual Discriminator Generative Adversarial Networks. Acta Autom. 2022, 48, 12. [Google Scholar]
Fang, S.; Xie, H.; Wang, Y.; Mao, Z.; Zhang, Y. Read Like Humans: Autonomous, Bidirectional and Iterative Language Modeling for Scene Text Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 7094–7103. [Google Scholar]
Shi, B.; Yang, M.; Wang, X.; Lyu, P.; Yao, C.; Bai, X. Aster: An attentional scene text recognizer with flexible rectification. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 41, 2035–2048. [Google Scholar] [CrossRef] [PubMed]
Shi, B.; Bai, X.; Yao, C. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 2298–2304. [Google Scholar] [CrossRef] [PubMed]
Yue, X.; Kuang, Z.; Lin, C.; Sun, H.; Zhang, W. RobustScanner: Dynamically Enhancing Positional Clues for Robust Text Recognition. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020. [Google Scholar]
Li, H.; Wang, P.; Shen, C.; Zhang, G. Show, Attend and Read: A simple and strong baseline for irregular text recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 8610–8617. [Google Scholar]
Du, Y.; Chen, Z.; Jia, C.; Yin, X.; Zheng, T.; Li, C.; Du, Y.; Jiang, Y.-G. SVTR: Scene Text Recognition with a Single Visual Model. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence(IJCAI-22), Vienna, Austria, 23–29 July 2022; pp. 884–890. [Google Scholar]
Oktay, O.; Schlemper, J.; Le Folgoc, L.; Lee, M.; Heinrich, M.; Misawa, K.; Mori, K.; McDonagh, S.; Hammerla, N.Y.; Kainz, B.; et al. Attention U-Net: Learning Where to Look for the Pancreas. arXiv 2018, arXiv:1804.03999. [Google Scholar]
Wang, Y.; Chen, Y.; Tao, X.; Jia, J. VCNet: A Robust Approach to Blind Image Inpainting. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; pp. 752–768. [Google Scholar]
Gatys, L.A.; Ecker, A.S.; Bethge, M. A neural algorithm of artistic style. J. Vis. 2016, 16, 326. [Google Scholar] [CrossRef]

Figure 1. General framework of the cascaded dual-inpainting network for scene text.

Figure 2. Comparative scene text inpainting results. Red characters indicate recognition errors. (a) Corrupted images; (b) ground truth structure; (c) TWINST inpainting results; (d) GSDM-SPM inpainting results; (e) CDINST inpainting results.

Figure 3. Qualitative results of scene text inpainting and recognition on the ICDAR 2015 dataset. (a) Corrupted images; (b) ground truth text; (c) reconstruction results achieved using CDINST; (d) CRNN-based recognition results achieved using CDINST.

Figure 4. Qualitative results of scene text inpainting and recognition results on naturally degraded real-world images. (a) Corrupted images; (b) ground truth text; (c) reconstruction results achieved using CDINST; (d) CRNN-based recognition results achieved using CDINST.

Table 1. Word-level recognition accuracy (

A_{w}