A Three-Stage Generative Adversarial Image Inpainting Framework for Broken-Stroke Restoration in Historical Rubbings: A Case Study on Oracle Bone Rubbings

Shen, Wenhan; Xu, Yubo; Zhang, Chaoqing; Yan, Juan; Wang, Shibin

doi:10.3390/app16094306

Open AccessArticle

A Three-Stage Generative Adversarial Image Inpainting Framework for Broken-Stroke Restoration in Historical Rubbings: A Case Study on Oracle Bone Rubbings

by

Wenhan Shen

^1,2,

Yubo Xu

²,

Chaoqing Zhang

³,

Juan Yan

² and

Shibin Wang

^2,*

¹

College of Software, Henan Normal University, Xinxiang 453000, China

²

School of Computer and Information Engineering, Henan Normal University, Xinxiang 453007, China

³

Oracle Bone Intelligent Computing Laboratory, Henan Normal University, Xinxiang 453000, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(9), 4306; https://doi.org/10.3390/app16094306

Submission received: 8 March 2026 / Revised: 1 April 2026 / Accepted: 26 April 2026 / Published: 28 April 2026

Download

Browse Figures

Versions Notes

Abstract

Historical rubbing images often suffer from stroke breakage, material loss, uneven background interference, and age-related degradation, which make broken-stroke restoration in masked regions difficult. This challenge is particularly severe for oracle bone rubbings, where sparse strokes and damaged radicals require both structural continuity and local texture realism. To address this problem, we propose a three-stage generative adversarial image inpainting framework and evaluate it on oracle bone rubbing images as a focused case study. Stage I employs an LBP-guided coarse completion network to recover local binary texture priors in missing regions. Stage II introduces spatial-attention refinement and a dual-discriminator strategy to improve stroke continuity and local realism. Stage III uses a Swin-based refinement network to model long-range dependencies and enhance global consistency. A composite optimization objective combining reconstruction, weighted hole, perceptual, style, total-variation, and adversarial terms is used to coordinate the three stages. Experiments on oracle bone rubbing images with masking ratios from 10% to 40% show that the proposed framework produces visually coherent restorations and competitive quantitative results, reaching up to 35.18 dB PSNR and 0.9906 SSIM under the 10–20% masking setting. Because oracle bone glyph morphology is highly specialized, the current validation is intentionally restricted to this domain rather than overstating cross-domain generalization. The results show that the proposed framework can support digital conservation and recognition-oriented analysis of damaged oracle bone rubbing images.

Keywords:

image inpainting; image restoration; generative adversarial networks; oracle bone rubbings; local binary patterns; Swin Transformer

1. Introduction

Historical document and inscription images frequently contain missing strokes, fractured contours, low-contrast textures, and irregular background interference introduced by aging, rubbing, scanning, and storage. For digital humanities applications, these degradations do not merely reduce visual quality; they also weaken downstream tasks such as character recognition, epigraphic analysis, and archival preservation. Oracle bone rubbings are a representative and challenging case because their strokes are thin, structurally sparse, and highly sensitive to local breakage. A restoration model must therefore recover both global structural coherence and local stroke realism rather than only fill missing pixels.

Image inpainting provides a natural formulation for this problem. Early exemplar- and diffusion-based methods can preserve local appearance in simple scenes, but they often struggle when large missing regions destroy structural continuity or when the surrounding context is highly nonuniform [1,2]. Deep learning has substantially advanced image inpainting by learning semantic priors from data. Context-encoding, gated-convolution, structure-aware, and attention-based models improve the plausibility of filled regions, while recurrent or coarse-to-fine designs can better infer large holes and damaged boundaries [3,4,5,6,7,8,9,10,11]. Recent surveys also show that modern inpainting systems increasingly rely on the coordinated use of reconstruction, attention, adversarial learning, and global-context modeling rather than a single completion operator [12,13]. Nevertheless, single-stage models often oversmooth thin strokes and may produce artifacts when the damaged region occupies semantically important image areas.

Recent transformer-based methods further improve long-range dependency modeling, which is crucial for restoration tasks that require distant stroke fragments to be reconnected consistently. Representative examples include incremental transformer enhancement for inpainting [14], lightweight stripe-window transformer designs [15], the Continuously Masked Transformer (CMT) [16], the reduced-information-loss transformer framework [17], and the HINT architecture with mask-aware encoding [18]. Survey work in this direction further confirms that transformer-based restoration is particularly effective when structural cues must be aggregated over large spatial ranges [13]. These methods demonstrate the importance of global context modeling, but transformer-only solutions may still underutilize fine local structural priors that are especially valuable for historical rubbing images with sparse binary strokes.

For oracle bone rubbings, the difficulty is even more specific. In addition to ordinary image corruption, the image foreground is composed of ancient inscription strokes whose continuity directly affects readability. Existing oracle-related studies have explored recognition, representation, denoising, and restoration from different perspectives, including oracle character modeling, cascade restoration, unsupervised structure-texture separation, coarse-to-fine generation, and attentive denoising [19,20,21,22,23,24,25]. Related restoration studies on historical documents and murals likewise indicate that cultural-heritage images benefit from structure-aware and texture-aware restoration strategies [26,27]. However, a generally applicable restoration framework that simultaneously leverages local binary structural cues, attention-guided refinement, and long-range transformer reasoning remains insufficiently developed.

Motivated by these observations, this paper formulates oracle bone rubbing restoration as a general structural and texture-aware image inpainting problem and proposes a three-stage generative adversarial framework. The rationale for using generative adversarial learning is that historical image restoration requires not only low reconstruction error but also perceptual realism around repaired strokes and damaged boundaries. Pixel-wise losses alone tend to generate oversmoothed results, whereas adversarial supervision can better preserve visual sharpness and realism. At the same time, adversarial learning is complemented here by LBP priors, spatial attention, and transformer-based global refinement so that the framework can balance local texture recovery with long-range structural consistency.

The main contributions of this paper are summarized as follows:

We reformulate oracle bone rubbing restoration as a structural and texture-aware image inpainting problem and propose a three-stage generative adversarial framework specialized for oracle bone rubbing completion while retaining design ideas that may be informative for other degraded inscription images.
We design a staged restoration pipeline in which an LBP-guided coarse completion module supplies local structural priors, a spatial-attention refinement module improves damaged-region feature aggregation, and a Swin-based refinement network enhances long-range consistency.
We introduce a dual-discriminator supervision strategy that evaluates restored results from complementary global and local perspectives, thereby reducing artifacts and improving the realism of repaired stroke regions.
We provide qualitative, quantitative, and ablation-based evaluations on oracle bone rubbing images, and we further report model size and inference cost to support a more transparent experimental discussion.

2. Related Work

2.1. CNN- and GAN-Based Image Inpainting

Classical image inpainting methods rely on non-local or exemplar-based propagation, which can work well for small defects but have limited capability when semantic structure is missing [1]. With the development of deep learning, encoder–decoder models and GAN-based inpainting methods became dominant because they can learn richer structural priors from large collections of intact images. Context Encoders [2] introduced one of the earliest deep generative formulations for inpainting. Subsequent methods improved structural reasoning and region consistency by using contextual attention, gated convolution, pyramid context encoding, structure-aware flow, recurrent feature reasoning, auxiliary reconstruction, and dual-generation strategies [3,4,5,6,7,8,9]. More recent CNN/GAN methods have adopted LBP-guided feature learning, coarse-to-fine restoration, local–global refinement, and multi-stage supervision to better handle large irregular masks [10,11,28]. These advances show that adversarial learning remains effective for improving perceptual quality, but the restoration of thin, discontinuous strokes is still difficult when only generic image priors are used. Broader reviews have also noted that balancing structural fidelity and texture realism remains a central challenge in deep inpainting [12].

2.2. Transformer-Based Inpainting and Image Restoration

Transformers have recently become an important tool for image inpainting because self-attention can capture long-range dependencies across visible and missing regions. Dong et al. [14] introduced transformer structure enhancement for inpainting with masking positional encoding. Ko and Kim [16] proposed the CMT model, which updates a continuous mask representation during restoration and improves token reliability in damaged regions. Liu et al. [15] explored lightweight stripe-window transformer reasoning, Liu et al. [17] reduced information loss in transformer-based pluralistic image completion, and Chen et al. [18] further improved mask-aware downsampling and attention design for high-quality inpainting. Very recent work has also explored adaptive priors for stronger multi-scale restoration [29]. Review studies have summarized these trends and highlighted the continuing importance of combining global transformer reasoning with explicit local-detail constraints [13]. These studies indicate that transformer-based inpainting is especially effective when large missing regions require global reasoning, but such models still benefit from explicit local-texture guidance.

2.3. Oracle Bone and Historical Inscription Image Analysis

Research on oracle bone images has focused on recognition, representation learning, restoration, and denoising. OraclePoints [21] and hierarchical oracle character representations [19] demonstrate that oracle characters benefit from multi-scale geometric modeling. Sundial-GAN [22] explores cascade restoration for deciphering oracle bone inscriptions; Wang et al. [23] propose a coarse-to-fine generative model specialized for oracle bone inpainting; and recent denoising/restoration studies further emphasize the value of attentive oracle-specific priors [24,25]. In recognition-oriented settings, structure–texture separation has also been shown to improve oracle character understanding [20]. More broadly, cultural-heritage restoration studies on degraded murals and historical documents similarly show that structure-aware and texture-aware strategies are necessary for visually convincing recovery [26,27]. Compared with generic natural-image inpainting, oracle bone rubbings pose a distinctive challenge: incomplete radicals and fractured, thin strokes must be reconstructed without destroying the historical appearance of the rubbing background. This motivates the present work, which integrates local binary texture priors, attention-guided refinement, and transformer-based global reasoning within a unified restoration framework.

3. Method

3.1. Problem Formulation

Let

I_{gt} \in R^{H \times W \times C}

denote the intact rubbing image and let

M \in {0, 1}^{H \times W}

denote the binary mask, where

M (p) = 1

indicates a missing pixel at location p. The damaged input image is defined as

I_{d} = I_{gt} ⊙ (1 - M),

(1)

where ⊙ denotes element-wise multiplication. The restoration objective is to learn a mapping

\hat{I} = G (I_{d}, M),

(2)

such that the restored image

\hat{I}

is structurally consistent with

I_{gt}

and visually realistic in both missing and visible regions. Unlike generic natural-image inpainting, the target images in our case contain sparse and semantically meaningful inscription strokes; accordingly, the restoration process must preserve broken-stroke topology, local binary texture, and the overall appearance of the rubbing background.

3.2. Overall Architecture and Inter-Stage Workflow

Figure 1 summarizes the proposed pipeline. The method is organized as three consecutive stages so that coarse structure recovery, region-focused refinement, and long-range consistency modeling are handled separately rather than by a single monolithic network. Given the damaged input image

I_{d}

and the binary mask M, Stage I first extracts LBP-guided structural cues and outputs a coarse restoration

I^{(1)}

together with an LBP-aware feature prior

F_{lbp}

. Stage II then takes

I^{(1)}

,

F_{lbp}

, and M as inputs, applies spatial-attention refinement, and produces an improved intermediate restoration

I^{(2)}

and refined features

F_{sar}

. Finally, Stage III uses a Swin-based encoder–decoder to refine

I^{(2)}

and generate the final result

\hat{I}

.

This sequential design is important for oracle bone rubbings because the restoration task involves three coupled requirements: recovering thin broken strokes, suppressing local artifacts inside damaged regions, and preserving global glyph continuity over relatively long spatial distances. The revised architecture figure is therefore presented as an activity-style workflow diagram. Each block denotes a processing module, each arrow denotes data or feature transfer, and each stage output is an explicit artifact consumed by the next stage.

3.3. Three-Stage Restoration Pipeline

3.3.1. Stage I: LBP-Guided Coarse Completion

The first stage is the LBP-Feature Learning Network (LBP-FL Net). Its main role is to recover coarse structure in a representation that is sensitive to local binary texture and boundary transitions. Let

L (\cdot)

denote the LBP extraction operator. From the damaged image

I_{d}

we obtain an incomplete LBP map

L_{d} = L (I_{d})

, which is concatenated with the mask and forwarded to generator

G_{1}

. The generator follows a U-Net-style encoder–decoder with skip connections so that low-level edge information can be preserved while missing regions are inferred using broader contextual evidence.

The output of Stage I serves two purposes. First, it provides a coarse structural restoration that already approximates the topology of missing strokes. Second, it produces an LBP-aware feature prior that guides the subsequent refinement stage toward historically plausible local textures. The accompanying discriminator

D_{1}

follows the PatchGAN principle and evaluates patch-level realism, which is appropriate for rubbing images where local stroke segments and boundaries carry important perceptual cues.

3.3.2. Stage II: Spatial-Attention Refinement with Dual Discriminators

The second stage is the spatial-attention refinement network (SAR Net). While Stage I supplies a useful coarse prior, local stroke fragments and boundaries can still be discontinuous. Stage II, therefore, fuses the broken input image, the Stage I output, and the mask, and injects a spatial-attention module into the decoder to emphasize correlations between known regions and missing regions. The attention mechanism adaptively highlights features that are relevant to damaged stroke segments and suppresses background interference introduced by rubbing noise and uneven illumination.

A dual-discriminator strategy is employed to stabilize adversarial learning and to enforce complementary restoration constraints. The global discriminator evaluates the coherence of the entire restored image conditioned on the mask, whereas the local discriminator focuses on patch-level realism around the repaired regions. This design is motivated by the observation that historical rubbing restoration requires two types of fidelity simultaneously: global consistency of character structure and local realism of individual stroke fragments. The detailed dual-discriminator architecture is illustrated in Figure 2.

3.3.3. Stage III: Swin-Based Global Refinement

The third stage further improves the intermediate result using a customized Swin-based refinement network. Compared with conventional convolutional refinement, the hierarchical Swin Transformer is better suited to aggregating long-range evidence from dispersed visible stroke fragments. This is particularly helpful when the damaged region interrupts a radical or when the visible supporting strokes are separated by a relatively large spatial distance.

The encoder progressively tokenizes and merges patches to obtain multi-scale contextual features, while the decoder upsamples the features and fuses them with encoder-side skip information. Through shifted-window self-attention and hierarchical representation learning, this stage reduces artifacts left by earlier stages and improves the global continuity of restored strokes. The detailed Swin-based refinement network is shown in Figure 3, and the final output of this stage is the restored image

\hat{I}

.

3.4. Objective Functions and Optimization

The optimization strategy is organized stage by stage so that each module is trained according to its functional role. Let

{\hat{I}}^{(2)}

denote the Stage II output and

\hat{I}

denote the final restored image. We first define a weighted reconstruction term that separates errors in visible and missing regions:

L_{valid} = \frac{{∥ (1 - M) ⊙ (\hat{I} - I_{gt}) ∥}_{1}}{\sum (1 - M)},

(3)

L_{hole} = \frac{{∥ M ⊙ (\hat{I} - I_{gt}) ∥}_{1}}{\sum M},

(4)

L_{pwr} = L_{valid} + λ_{h} L_{hole},

(5)

where

λ_{h}

controls the relative importance of masked-region reconstruction.

3.4.1. Stage I Loss

Stage I focuses on coarse structure inference in the LBP domain. Its reconstruction loss is written as

L_{rec}^{(1)} = {∥\hat{L} - L_{gt}∥}_{2}^{2},

(6)

where

\hat{L}

is the predicted LBP map and

L_{gt} = L (I_{gt})

is the ground-truth LBP map. The corresponding adversarial supervision is

L_{adv}^{(1)} = E [log D_{1} (L_{gt})] + E [log (1 - D_{1} (\hat{L}))],

(7)

and the Stage I total loss is

L^{(1)} = λ_{g} L_{rec}^{(1)} + λ_{a} L_{adv}^{(1)} + L_{pwr} + L_{tv},

(8)

where

L_{tv}

is the total-variation regularizer used to suppress isolated noisy fluctuations.

3.4.2. Stage II Loss

Stage II emphasizes structural refinement and local realism. We therefore use a multi-scale feature consistency term

L_{ms} = \sum_{h \in H} {∥Φ_{h} ({\hat{I}}^{(2)}) - Φ_{h} (I_{gt})∥}_{2}^{2},

(9)

where

Φ_{h} (\cdot)

denotes the feature map extracted at layer h and

H

is the set of selected layers. To preserve both semantic similarity and texture statistics, we further introduce perceptual and style losses:

L_{per} = \sum_{i} {∥Ψ_{i} ({\hat{I}}^{(2)}) - Ψ_{i} (I_{gt})∥}_{1},

(10)

L_{style} = \sum_{i} {∥Γ (Ψ_{i} ({\hat{I}}^{(2)})) - Γ (Ψ_{i} (I_{gt}))∥}_{1},

(11)

where

Ψ_{i} (\cdot)

denotes the i-th perceptual feature extractor and

Γ (\cdot)

denotes the Gram-matrix operator. The style term is especially useful for suppressing local checkerboard-like artifacts in repaired regions [30,31].

The dual-discriminator adversarial term is defined as

L_{adv}^{dual} = \sum_{k \in {g, l}} (E [log D_{k} (I_{gt}, M)] + E [log (1 - D_{k} ({\hat{I}}^{(2)}, M))]),

(12)

where

D_{g}

and

D_{l}

denote the global and local discriminators, respectively. Using

L_{rec} = {∥ {\hat{I}}^{(2)} - I_{gt} ∥}_{2}^{2}

, the Stage II objective becomes

L^{(2)} = λ_{m} L_{ms} + λ_{r} L_{rec} + λ_{a 1} L_{adv}^{dual} + L_{per} + λ_{s} L_{style} + L_{pwr} + L_{tv},

(13)

where the coefficients are set to

λ_{m} = 10

,

λ_{r} = 0.01

, and

λ_{a 1} = 0.2

. The relatively large

λ_{m}

emphasizes structural reconnection of broken strokes, the small

λ_{r}

avoids excessive smoothing, and the moderate adversarial weight improves local realism without destabilizing training.

3.4.3. Stage III Loss and Training Schedule

The third stage mainly performs detail refinement and global consistency correction. Its objective is therefore lighter than that of Stage II:

L^{(3)} = L_{pwr} + λ_{a 1} L_{adv}^{dual} + L_{tv} + λ_{r} L_{rec},

(14)

During training, the generator and discriminators are updated alternately. Here,

E [\cdot]

denotes the empirical expectation over mini-batch samples, ∑ denotes summation, and ⊙ denotes element-wise multiplication. In this manuscript, M is the binary mask,

I_{gt}

is the intact target image,

\hat{L}

is the predicted LBP map, and

L_{gt}

is the ground-truth LBP map. In addition,

Φ_{h}

denotes the feature map extracted from layer h,

Ψ_{i}

denotes the perceptual feature extractor at level i, and

Γ

denotes the Gram-matrix operator.

4. Experiment

4.1. Dataset and Experimental Protocol

The experiments were conducted on an oracle bone inscription rubbing dataset collected by the Key Laboratory of Oracle Bone Inscriptions Information Processing, Henan Provincial Department of Education. The dataset used in this study contains 20,000 oracle bone rubbing images and was divided into training, validation, and test subsets in an 8:1:1 ratio, yielding 16,000 training images, 2000 validation images, and 2000 test images. Here, the term intact image refers to a relatively complete oracle bone rubbing sample retained after preprocessing and quality screening; it serves as the reference image before synthetic masking and does not imply that the archaeological carrier itself is physically perfect.

To avoid data leakage, these intact reference images were first split at the image level and only then converted into damaged samples through random-mask operations inside each subset. As a result, different masked variants generated from the same intact image were never distributed across different subsets. Three masking-rate ranges, namely 10–20%, 20–30%, and 30–40%, were used to simulate increasingly severe damage conditions. The damaged images served as network inputs and the corresponding intact images were used as ground truth.

4.2. Qualitative Comparison

Figure 4 and Figure 5 compare representative restoration results under moderate and relatively severe masking settings. The blue, pink, and white pseudo-colored regions visible in some comparison panels are not manual annotations or visualization overlays. Rather, they are abnormal restoration artifacts produced by several baseline methods when the damaged regions are not correctly reconstructed under challenging masking conditions. We retain these outputs because they reflect the actual behavior of the compared methods and therefore provide a faithful qualitative comparison. Qualitative assessment is based mainly on four criteria: stroke continuity across the masked region, radical completeness, boundary smoothness between restored and visible areas, and the absence of obvious artifacts in the filled region. Under these criteria, the proposed method recovers missing regions more faithfully than the compared models, especially in terms of stroke continuity and boundary smoothness. This advantage is most visible in thin radicals and partially missing vertical or oblique stroke segments, where continuity must be inferred from sparse contextual evidence.

Among the compared methods, PEN exhibits the most severe structural collapse under large missing regions, while RFR improves the rough outline but still struggles to maintain coherent stroke endings and local shape integrity. LBPLSA benefits from LBP-guided features but may still generate local artifacts when the damaged region intersects semantically important radicals. By contrast, the proposed framework better preserves thin-stroke topology and produces smoother transitions between restored and visible regions. Under heavier corruption, the advantage of the staged design becomes more apparent because the LBP prior, spatial-attention refinement, and Swin-based global reasoning complement one another.

4.3. Quantitative Comparison

For quantitative evaluation, we report PSNR, SSIM, and

L 1

distance. The comparison includes PEN, RFR, LBPLSA, LG, CFGM, and SWT-CNN as representative baselines covering context-based, recurrent, LBP-enhanced, local–global, coarse-to-fine, and transformer-assisted inpainting paradigms. The metric directions are indicated explicitly in the table headers so that the relative advantage of each method can be read directly from the corresponding table.

Table 1 reports the results under the lightest masking setting (10–20%). In this case, the proposed method achieves the best values on all three metrics, indicating that the three-stage design can effectively recover local detail when the amount of missing information is still moderate.

Table 2 shows the comparison under the 20–30% masking range. The proposed method yields the highest PSNR and remains very close to the strongest baselines in SSIM and

L 1

, which suggests that the staged refinement process remains effective as the damage level increases.

Table 3 summarizes the most difficult setting (30–40% masking). Although pixel-level restoration becomes harder in this regime, the proposed method still achieves the best SSIM among the reported models, indicating that it preserves structural consistency well even under severe corruption.

Taken together, the three tables show that the proposed framework performs best overall under the 10–20% masking setting and remains competitive as the masking ratio increases. These trends support the motivation of the staged design: local structural priors are especially helpful at moderate corruption levels, while global transformer refinement becomes increasingly important as missing regions enlarge.

4.4. Ablation Study

Ablation experiments were conducted under the 20–30% masking setting to evaluate the contribution of the main design components. The comparison includes a baseline without the complete refinement pipeline (denoted “NO”), the CN variant, and the full DP+Swin model.

Figure 6 shows that the CN variant already improves structural coherence relative to the simpler baseline, while the full DP+Swin model further improves local detail, stroke continuity, and the transition quality near damaged boundaries. This observation is consistent with the role of the full framework: CN provides stronger intermediate restoration, and the combination of dual-discriminator supervision with the Swin refinement stage reduces artifacts and yields more stable final reconstructions.

Table 4 confirms that the complete DP+Swin configuration yields the strongest overall performance among the compared variants. The PSNR gain over the NO baseline demonstrates that the later refinement stages contribute substantially to reconstruction quality. The improved SSIM and reduced

L 1

value further indicate that the complete model offers a better balance between structural similarity and local fidelity.

4.5. Computational Cost and Limitation Analysis

Table 5 compares parameter count, inference time, and MAE. The proposed method has the largest parameter count and the highest inference time among the listed models, which is expected because the third stage introduces an additional transformer-based refinement module.

This increase in complexity is the cost of improved restoration quality. In practical digital-humanities scenarios, however, the reported inference time remains acceptable for offline restoration and archival processing.

Table 6 reports the software/hardware configuration and the training cost under the 30–40% masking setting. The revised implementation was executed on a Linux workstation using Python 3.10, PyTorch 1.13.1 with CUDA 11.7, torchvision 0.14.1, and torchaudio 0.13.1. Training was performed on a single NVIDIA GeForce RTX 3090 GPU with 23.69 GB VRAM, using a batch size of 1 and full-precision training (mixed precision disabled). Under this setting, the final training schedule is configured for 500 epochs. Assuming the same iteration schedule as the earlier 350-epoch configuration, this corresponds to approximately 1,000,000 iterations and an estimated wall-clock training time of about 292,062.83 s (approximately 81.13 h). The peak allocated GPU memory was 10.00 GB, while the peak reserved memory reached 14.79 GB. These statistics clarify the practical computational budget required to reproduce the proposed GAN-based framework.

Despite its advantages, the method still becomes less reliable under very high mask ratios or when semantically critical radicals are heavily damaged in the central image region. In such cases, even transformer-based long-range reasoning may be insufficient because the available contextual evidence is too sparse. Future work may incorporate radical-level priors, stronger semantic supervision, or diffusion-style post-refinement to further improve restoration robustness under extreme corruption.

5. Conclusions

This paper presents a three-stage generative adversarial framework for structural and texture-aware image inpainting and validates it on oracle bone rubbing images. The framework combines LBP-guided coarse completion, spatial-attention refinement with dual-discriminator supervision, and Swin-based global refinement. By organizing restoration into explicit stages and passing intermediate artifacts between them, the proposed method improves both local stroke realism and global structural consistency.

Experimental results demonstrate that the proposed framework performs favorably against several representative baselines in both qualitative and quantitative comparisons. The ablation study further verifies that the staged design and the combination of dual-discriminator supervision with transformer-based refinement contribute meaningfully to the final performance. Because oracle bone glyph morphology is highly specialized, the present work should be understood as a domain-focused restoration study rather than as a demonstrated cross-domain completion model. The three-stage design may still provide useful ideas for other sparse historical inscription data, but such transfer requires dedicated future validation.

In future work, the model can be extended with richer semantic priors, better uncertainty modeling for severe damage, and more complete reproducibility reporting of optimizer settings and multi-seed statistics. These directions will help make the framework more robust for digital conservation and recognition-oriented analysis.

Author Contributions

Conceptualization, W.S. and S.W.; methodology, W.S.; software, W.S.; validation, W.S., Y.X. and C.Z.; formal analysis, W.S.; investigation, W.S., Y.X., C.Z. and J.Y.; resources, S.W.; data curation, W.S.; writing—original draft preparation, W.S.; writing—review and editing, S.W.; supervision, S.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are available from the corresponding author upon reasonable request.

Acknowledgments

The authors thank the related institutions and collaborators for their support of this study.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Buades, A.; Coll, B.; Morel, J.M. A non-local algorithm for image denoising. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05); IEEE: Piscataway, NJ, USA, 2005; Volume 2, pp. 60–65. [Google Scholar]
Pathak, D.; Krahenbuhl, P.; Donahue, J.; Darrell, T.; Efros, A.A. Context encoders: Feature learning by inpainting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2016; pp. 2536–2544. [Google Scholar]
Yu, J.; Lin, Z.; Yang, J.; Shen, X.; Lu, X.; Huang, T.S. Generative image inpainting with contextual attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2018; pp. 5505–5514. [Google Scholar]
Yu, J.; Lin, Z.; Yang, J.; Shen, X.; Lu, X.; Huang, T.S. Free-form image inpainting with gated convolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: Piscataway, NJ, USA, 2019; pp. 4471–4480. [Google Scholar]
Zeng, Y.; Fu, J.; Chao, H.; Guo, B. Learning pyramid-context encoder network for high-quality image inpainting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2019; pp. 1486–1494. [Google Scholar]
Ren, Y.; Yu, X.; Zhang, R.; Li, T.H.; Liu, S.; Li, G. Structureflow: Image inpainting via structure-aware appearance flow. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: Piscataway, NJ, USA, 2019; pp. 181–190. [Google Scholar]
Li, J.; Wang, N.; Zhang, L.; Du, B.; Tao, D. Recurrent feature reasoning for image inpainting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2020; pp. 7760–7768. [Google Scholar]
Zeng, Y.; Lin, Z.; Lu, H.; Patel, V.M. Cr-fill: Generative image inpainting with auxiliary contextual reconstruction. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: Piscataway, NJ, USA, 2021; pp. 14164–14173. [Google Scholar]
Guo, X.; Yang, H.; Huang, D. Image inpainting via conditional texture and structure dual generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: Piscataway, NJ, USA, 2021; pp. 14134–14143. [Google Scholar]
Quan, W.; Zhang, R.; Zhang, Y.; Li, Z.; Wang, J.; Yan, D.M. Image inpainting with local and global refinement. IEEE Trans. Image Process. 2022, 31, 2405–2420. [Google Scholar] [CrossRef] [PubMed]
Wu, H.; Zhou, J.; Li, Y. Deep generative model for image inpainting with local binary pattern learning and spatial attention. IEEE Trans. Multimed. 2021, 24, 4016–4027. [Google Scholar]
Xu, Z.; Zhang, X.; Chen, W.; Yao, M.; Liu, J.; Xu, T.; Wang, Z. A Review of Image Inpainting Methods Based on Deep Learning. Appl. Sci. 2023, 13, 11189. [Google Scholar] [CrossRef]
Elharrouss, O.; Damseh, R.; Belkacem, A.N.; Badidi, E.; Lakas, A. Transformer-based image and video inpainting: Current challenges and future directions. Artif. Intell. Rev. 2025, 58, 124. [Google Scholar] [CrossRef]
Dong, Q.; Cao, C.; Fu, Y. Incremental transformer structure enhanced image inpainting with masking positional encoding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2022; pp. 11358–11368. [Google Scholar]
Liu, T.J.; Chen, B.W.; Liu, K.H. Lightweight image inpainting by stripe window transformer with joint attention to CNN. arXiv 2023, arXiv:2301.00553. [Google Scholar] [CrossRef]
Ko, K.; Kim, C.S. Continuously masked transformer for image inpainting. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: Piscataway, NJ, USA, 2023; pp. 13169–13178. [Google Scholar]
Liu, Q.; Jiang, Y.; Tan, Z.; Chen, D.; Fu, Y.; Chu, Q.; Hua, G.; Yu, N. Transformer Based Pluralistic Image Completion With Reduced Information Loss. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 6652–6668. [Google Scholar] [CrossRef] [PubMed]
Chen, S.; Atapour-Abarghouei, A.; Shum, H.P.H. HINT: High-Quality INPainting Transformer with Mask-Aware Encoding and Enhanced Attention. IEEE Trans. Multimed. 2024, 26, 7649–7660. [Google Scholar] [CrossRef]
Guo, J.; Wang, C.; Roman-Rangel, E.; Chao, H.; Rui, Y. Building hierarchical representations for oracle character and sketch recognition. IEEE Trans. Image Process. 2015, 25, 104–118. [Google Scholar] [CrossRef] [PubMed]
Wang, M.; Deng, W.; Liu, C.L. Unsupervised structure-texture separation network for oracle character recognition. IEEE Trans. Image Process. 2022, 31, 3137–3150. [Google Scholar] [CrossRef] [PubMed]
Jiang, R.; Liu, Y.; Zhang, B.; Chen, X.; Li, D.; Han, Y. Oraclepoints: A hybrid neural representation for oracle character. In Proceedings of the 31st ACM International Conference on Multimedia; Association for Computing Machinery: New York, NY, USA, 2023; pp. 7901–7911. [Google Scholar]
Chang, X.; Chao, F.; Shang, C.; Shen, Q. Sundial-gan: A cascade generative adversarial networks framework for deciphering oracle bone inscriptions. In Proceedings of the 30th ACM International Conference on Multimedia; Association for Computing Machinery: New York, NY, USA, 2022; pp. 1195–1203. [Google Scholar]
Wang, S.; Guo, W.; Xu, Y.; Liu, D.; Li, X. Coarse-to-Fine Generative Model for Oracle Bone Inscriptions Inpainting. In Proceedings of the 1st Workshop on Machine Learning for Ancient Languages (ML4AL 2024); Association for Computational Linguistics: Austin, TX, USA, 2024; pp. 107–114. [Google Scholar]
Li, J.; Chen, Z.; Chen, T.; Liu, Z.; Wang, C. OBIFormer: A fast attentive denoising framework for oracle bone inscriptions. Displays 2025, 89, 103059. [Google Scholar] [CrossRef]
Ma, Y.; Li, F.; Chen, S. OraGAN: A deep learning based model for restoring Oracle Bone Script Images. npj Herit. Sci. 2025, 13, 681. [Google Scholar] [CrossRef]
Ren, H.; Sun, K.; Zhao, F.; Zhu, X. Dunhuang murals image restoration method based on generative adversarial network. Herit. Sci. 2024, 12, 39. [Google Scholar] [CrossRef]
Ziran, Z.; Mecella, M.; Marinai, S. AI-driven enhancement of historical documents. Int. J. Digit. Libr. 2025, 26, 22. [Google Scholar] [CrossRef]
Wang, N.; Ma, S.; Li, J.; Zhang, Y.; Zhang, L. Multistage attention network for image inpainting. Pattern Recognit. 2020, 106, 107448. [Google Scholar] [CrossRef]
Wang, Y.; Guo, D.; Zhao, H.; Yang, M.; Zheng, H. Image inpainting via Multi-scale Adaptive Priors. Pattern Recognit. 2025, 162, 111410. [Google Scholar] [CrossRef]
Johnson, J.; Alahi, A.; Fei-Fei, L. Perceptual losses for real-time style transfer and super-resolution. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 694–711. [Google Scholar]
Gatys, L.A.; Ecker, A.S.; Bethge, M. Image style transfer using convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2016; pp. 2414–2423. [Google Scholar]

Figure 1. Revised activity-style architecture of the proposed model. Stage I is the LBP-FL Net, Stage II is the SAR Net, and Stage III is the Swin Refinement Net. Dashed arrows denote skip connections. The figure avoids inserting specific oracle characters into the general pipeline and instead highlights the intermediate artifacts passed from one stage to the next: coarse restoration, attention-refined restoration, and final globally consistent restoration.

Figure 2. Dual-discriminator architecture used in the refinement stage. The global branch operates on image–mask pairs and constrains overall structural consistency, whereas the local branch focuses on patch-level stroke realism in restored regions.

Figure 3. Customized Swin-based refinement network. The network adopts a hierarchical encoder–decoder design with tokenization, progressive patch merging, multi-scale Swin blocks, and symmetric upsampling for final restoration refinement.

Figure 4. Qualitative comparison under a masking rate of 20% to 30%. Columns from (left) to (right) are input, PEN, RFR, LBPLSA, LG, SWT-CNN, and the proposed method. Some baseline methods produce abnormal pseudo-colored artifacts in damaged regions, indicating unstable restoration under moderate corruption. The bottom strip provides zoom-ins highlighting stroke continuity in restored regions.

Figure 5. Qualitative comparison under a masking rate of 30% to 40%. Columns from (left) to (right) are input, PEN, RFR, LBPLSA, LG, CFGM, SWT-CNN, and the proposed method. Under heavier corruption, several baseline methods exhibit more obvious pseudo-colored restoration artifacts, revealing degraded robustness when restoring sparse inscription strokes. The bottom strip provides zoom-ins highlighting stroke continuity in restored regions.

Figure 6. Ablation comparison at a masking rate of 20% to 30%: (a) ground truth, (b) masked input, (c) LBPLSA baseline, (d) CN, and (e) the full model (DP+Swin).

Table 1. Comparison of models with irregular masks at a masking rate of 10% to 20%.

Method	PSNR ↑	SSIM ↑	L1 ↓
PEN	11.46	0.7445	0.0890
RFR	12.55	0.7812	0.0823
LBPLSA	30.89	0.7812	0.0050
LG	33.42	0.9898	0.0521
SWT-CNN	28.03	0.9819	0.0104
Ours	35.18	0.9906	0.0030

Table 2. Comparison of models with irregular masks at a masking rate of 20% to 30%.

Method	PSNR ↑	SSIM ↑	L1 ↓
PEN	8.67	0.6377	0.1628
RFR	14.67	0.8057	0.0567
LBPLSA	25.67	0.9719	0.0091
LG	27.01	0.9781	0.0066
CFGM	29.61	0.9826	0.0058
SWT-CNN	26.02	0.9610	0.0169
Ours	29.90	0.9807	0.0059

Table 3. Comparison of models with irregular masks at a masking rate of 30% to 40%.

Method	PSNR ↑	SSIM ↑	L1 ↓
PEN	9.29	0.5979	0.1427
RFR	15.94	0.8802	0.0441
LBPLSA	22.74	0.9232	0.0169
LG	29.46	0.9818	0.0053
CFGM	23.43	0.9567	0.0143
SWT-CNN	20.40	0.9055	0.0499
Ours	25.68	0.9829	0.0064

Table 4. Quantitative ablation results on the oracle dataset.

Metric	NO	CN	DP+Swin
PSNR ↑	25.67	26.59	29.90
SSIM ↑	0.9719	0.8841	0.9857
L1 ↓	0.0091	0.0126	0.0059

Table 5. Model size, average inference time, and MAE under a masking rate of 20% to 30%.

Method	Parameters (M)	Inference Time (s/Image)	MAE
LBPLSA	114.87	0.0908	0.0060
CFGM	120.82	0.0915	0.0043
Ours	194.07	0.1164	0.0190

Table 6. Software/hardware requirements and training cost of the proposed method under the 30–40% masking setting.

Item	Value
Operating system/framework	Linux, Python 3.10, PyTorch 1.13.1 (CUDA 11.7)
Additional libraries	torchvision 0.14.1, torchaudio 0.13.1
GPU	NVIDIA GeForce RTX 3090
VRAM	23.69 GB
Batch size	1
Precision	FP32 (mixed precision disabled)
Training epochs	500
Total iterations	≈1,000,000
Total training time	≈292,062.83 s (≈81.13 h)
Peak allocated memory	10.00 GB
Peak reserved memory	14.79 GB

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Shen, W.; Xu, Y.; Zhang, C.; Yan, J.; Wang, S. A Three-Stage Generative Adversarial Image Inpainting Framework for Broken-Stroke Restoration in Historical Rubbings: A Case Study on Oracle Bone Rubbings. Appl. Sci. 2026, 16, 4306. https://doi.org/10.3390/app16094306

AMA Style

Shen W, Xu Y, Zhang C, Yan J, Wang S. A Three-Stage Generative Adversarial Image Inpainting Framework for Broken-Stroke Restoration in Historical Rubbings: A Case Study on Oracle Bone Rubbings. Applied Sciences. 2026; 16(9):4306. https://doi.org/10.3390/app16094306

Chicago/Turabian Style

Shen, Wenhan, Yubo Xu, Chaoqing Zhang, Juan Yan, and Shibin Wang. 2026. "A Three-Stage Generative Adversarial Image Inpainting Framework for Broken-Stroke Restoration in Historical Rubbings: A Case Study on Oracle Bone Rubbings" Applied Sciences 16, no. 9: 4306. https://doi.org/10.3390/app16094306

APA Style

Shen, W., Xu, Y., Zhang, C., Yan, J., & Wang, S. (2026). A Three-Stage Generative Adversarial Image Inpainting Framework for Broken-Stroke Restoration in Historical Rubbings: A Case Study on Oracle Bone Rubbings. Applied Sciences, 16(9), 4306. https://doi.org/10.3390/app16094306

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Three-Stage Generative Adversarial Image Inpainting Framework for Broken-Stroke Restoration in Historical Rubbings: A Case Study on Oracle Bone Rubbings

Abstract

1. Introduction

2. Related Work

2.1. CNN- and GAN-Based Image Inpainting

2.2. Transformer-Based Inpainting and Image Restoration

2.3. Oracle Bone and Historical Inscription Image Analysis

3. Method

3.1. Problem Formulation

3.2. Overall Architecture and Inter-Stage Workflow

3.3. Three-Stage Restoration Pipeline

3.3.1. Stage I: LBP-Guided Coarse Completion

3.3.2. Stage II: Spatial-Attention Refinement with Dual Discriminators

3.3.3. Stage III: Swin-Based Global Refinement

3.4. Objective Functions and Optimization

3.4.1. Stage I Loss

3.4.2. Stage II Loss

3.4.3. Stage III Loss and Training Schedule

4. Experiment

4.1. Dataset and Experimental Protocol

4.2. Qualitative Comparison

4.3. Quantitative Comparison

4.4. Ablation Study

4.5. Computational Cost and Limitation Analysis

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI