1. Introduction
Long-term monitoring of structural conditions and spatial dynamics is essential for protecting World Cultural Heritage sites [
1,
2,
3,
4,
5,
6]. To be effective, such monitoring requires imagery that captures fine spatial detail while remaining consistent over time, allowing slow structural changes and environmental impacts to be identified. Ground-based techniques such as terrestrial laser scanning and UAV photogrammetry can provide highly detailed information, but they are typically collected during limited field campaigns and therefore cannot support regular, long-term observation across entire heritage landscapes.
High-resolution satellite imagery helps address this limitation by enabling repeated, large-area observations over extended periods. In practice, however, high-quality satellite images are not always available. Sensor characteristics, atmospheric conditions, and revisit intervals often result in missing high-resolution data for certain years or key periods [
7,
8,
9,
10,
11,
12]. These gaps make it difficult to construct reliable time series for long-term analysis. Super-resolution reconstruction offers a practical way to mitigate these limitations by enhancing lower-resolution satellite imagery. By producing more complete and consistent image records, this approach facilitates the analysis of long-term change and supports landscape-scale heritage monitoring.
With the advancement of deep learning techniques, data-driven super-resolution reconstruction methods have shown great potential in the field of remote sensing [
13,
14,
15,
16,
17]. For instance, Lei et al. proposed a new single-image super-resolution algorithm named local–global combined networks (LGCNet) for public remote sensing data set (UC Merced) [
18]. Mei et al. developed a novel three-dimensional full CNN (3D-FCNN) for super-resolution reconstruction and conducted experiments on four benchmark datasets from two well-known hyperspectral sensors, namely the hyperspectral digital imagery collection experiment (HYDICE) and reflective optics system imaging spectrometer (ROSIS) sensors [
19]. Wang et al. introduced a multiscale fast Fourier transform (FFT)-based attention network (MSFFTAN) to achieve accurate remote sensing image super-resolution reconstruction [
20].
However, most existing super-resolution methods are typically based on an idealized assumption: that the low-resolution (LR) and high-resolution (HR) images used for training are fully consistent in their spatiotemporal distribution. In practical cultural heritage monitoring, due to factors such as satellite sensor updates, changes in data acquisition plans, and cloud occlusion, it is extremely difficult to obtain perfectly matched HR-LR image pairs of the same area across multiple time periods [
21,
22,
23]. And models are often trained on data from one or several years but applied to another unseen year for inference, leading to a distribution shift between the training and testing data.
This temporal inconsistency between training and testing data poses a critical challenge: the mapping relationships learned by the model during training may become overly dependent on image characteristics specific to certain years, such as particular illumination conditions [
24,
25,
26]. When reconstructing images from other years, systematic differences in radiometric properties and textural features—caused by variations in imaging conditions, surface cover changes, and sensor characteristics across different years—can significantly degrade the model’s generalization performance [
27,
28,
29]. In remote sensing image analysis, such temporal distribution shifts are particularly pronounced due to the dynamic nature of land cover and environmental changes [
30,
31].
Although recent years have seen progress in adaptive methods in remote sensing, most studies have focused on spatial or sensor domain adaptation [
32,
33,
34,
35,
36]. For example, Liu et al. designed a self-attentive pyramid structure to capture interspectral self-similarity in the spatial domain, thereby increasing the receptive range of attention and improving the feature representation [
37]. Zheng et al. designed a neighboring spectral attention module to explicitly constrain the reconstructed hyperspectral image (HSI) to maintain the correlation among neighboring spectral bands [
38]. Zhang et al. proposed a Multi-Scale Feature Mapping Network (MSFMNet) based on cascaded residual learning to adaptively learn prior information of HSIs and spatial-spectral characteristics among different spectral segments [
39]. Nevertheless, research addressing feature distribution shifts caused by temporal evolution remains relatively limited [
40]. Traditional data augmentation methods, while capable of improving model robustness to geometric and radiometric variations to some extent, primarily increase data diversity through geometric transformations and color adjustments, yet they struggle to fundamentally model the intrinsic distribution shifts induced by temporal changes [
41,
42].
An important characteristic of the landscape in the study area is that it contains both changing and invariant elements over time. While natural features such as vegetation and water bodies vary with seasons and years, core elements—such as major architectural structures and road networks—typically exhibit high temporal stability. This observation offers an important insight: if a super-resolution model can be guided to focus more on these temporally stable structural features rather than components sensitive to transient appearance changes, and learn distribution shifts—such as those caused by sensor and illumination variations—from these stable features, it may be possible to enhance the model’s generalization capability when applied to data from new years.
Based on the above analysis, this paper proposes a training strategy incorporating temporal invariance constraints, aiming to enhance the adaptability of super-resolution models to temporal distribution shifts. Unlike traditional data augmentation methods that primarily aim to increase sample quantity, the core idea of this framework is to construct cross-year masked samples by automatically identifying stable ground object regions in multi-temporal images. These are used to build auxiliary training samples, thereby introducing temporal consistency constraints as a form of regularization. This encourages the model to learn feature representations that are insensitive to temporal variations, thus improving its robustness in cross-year reconstruction. Furthermore, building upon the classic Residual Dense Network (RDN) architecture, we integrate channel and spatial attention mechanisms to further enhance feature reuse and reconstruction performance [
43,
44,
45,
46].
To validate the effectiveness of the proposed method, we conducted experiments using the Bin County Cave Temple (BCCT) Heritage Site in Shaanxi Province as the study area. Images from 2013, 2014, and 2017 were used as the training set to perform super-resolution reconstruction on images from 2019. Experimental results show that the proposed training strategy leads to significant performance improvements across multiple super-resolution models in reconstructing images from new years. Combined with the improved RDN structure (RDN_2_M), the reconstruction results achieved better performance in both objective evaluation metrics and visual authenticity, providing a feasible technical solution to the problem of data gaps in heritage site monitoring.
2. Materials and Methods
2.1. Study Area and Data
This study selects the buffer zone of the Bin County Cave Temple (BCCT) Heritage Site in Shaanxi Province as the research area (
Figure 1). As a key component of the UNESCO World Heritage site “Silk Roads: the Routes Network of Chang’an-Tianshan Corridor”, the BCCT possesses significant cultural value, with its grottoes, stone carvings, and surrounding historical landscape requiring careful conservation. The area’s surface landscape comprises both relatively stable artificial structures, such as the main grotto and auxiliary halls, and dynamic natural elements, such as vegetation cover, which vary with seasons and years. This coexistence of “stable” and “changing” features makes the region an ideal testbed for validating the performance of cross-year super-resolution reconstruction models. Accurate monitoring of the heritage structures and surrounding environmental changes using high-resolution remote sensing imagery is of great importance for preventive conservation.
All remote sensing images used in this study were obtained from the historical archive of the Google Earth platform [
1,
47,
48]. A systematic inventory of image availability from 2011 to 2020 was conducted for the study area (
Table 1). The results indicate that, due to factors such as cloud cover and limitations in data acquisition planning, complete cloud-free images with a spatial resolution better than 0.5 m are available only for four years: 2013, 2014, 2017, and 2019.
Based on this data availability, we designed the following data partitioning scheme to simulate and address the core challenge in heritage monitoring—missing data across years:
Training Set: Images from 2013, 2014, and 2017 were selected. For each year, high-resolution images with a spatial resolution of approximately 0.5 m were available, along with corresponding low-resolution (4-m) images generated by downsampling. These constitute the LR–HR image pairs used for model training.
Test Set: Images from 2019 were reserved as an independent test set. This year is completely separate from the training set in the temporal sequence and is used to evaluate the model’s generalization capability and reconstruction performance on a completely unseen year.
This “leave-one-year-out” validation strategy effectively simulates a common real-world scenario: leveraging sporadically acquired high-quality data to train a model, with the goal of reconstructing high-quality images for a year with missing data (e.g., 2019). This approach enhances the practical relevance of the study.
During the preprocessing stage, all selected images underwent rigorous geometric precision correction and radiometric normalization to ensure comparability across different years and minimize systematic errors caused by differences in imaging conditions.
The geographical location of the buffer zone of the BCCT heritage site and the corresponding high-resolution images for the respective years are shown in
Figure 1.
2.2. Data Processing and Methodology
2.2.1. Data Preprocessing
The data preprocessing pipeline in this study is designed to ensure spatial and radiometric consistency across multi-temporal images and to construct high-quality paired samples for subsequent deep learning model training. The entire workflow consists of three key steps: spatial resampling and cropping, radiometric normalization, and training-test set partitioning.
2.2.2. Spatial Resampling and Cropping
First, based on the vector boundaries of the BCCT heritage site, historical images from four years—2013, 2014, 2017, and 2019—were downloaded from the Google Earth platform. The initial spatial resolution of the images was approximately 0.5 m. To construct a precisely paired high- and low-resolution dataset, all images were uniformly resampled to standard spatial resolutions of 0.5 m and 4 m using bilinear interpolation. This step aimed to eliminate minor resolution discrepancies in the original data and provide the model with strictly spatially registered training sample pairs.
Subsequently, all images were cropped based on the bounding rectangle of the study area vector to ensure complete spatial overlap across different years. This step establishes the geographic foundation for subsequent cross-year consistency analysis and model performance evaluation.
2.2.3. Radiometric Normalization
Due to differences in solar elevation angle, atmospheric conditions, and sensor performance during image acquisition, significant radiometric variations exist among images from different years. To mitigate the effects while preserving spectral discriminability among different land cover types, min–max normalization was independently applied to each image. This process linearly stretches the pixel values of each image to the [0, 1] range, calculated as follows:
where
represents the original image, and
and
denote the maximum and minimum pixel values of the image, respectively. This per-scene normalization strategy effectively suppresses systematic biases caused by inter-annual variations in illumination and sensor characteristics, while maximizing the retention of texture and contrast information within each image.
2.2.4. Patch Generation and Dataset Partitioning
To meet the input requirements of deep learning models, a sliding window approach was used to extract small image patches from the processed full-scene images. Specifically, a fixed ground area of 256 × 256 m was used to define the patches for both the high- and low-resolution images. A sliding window with a stride of 128 m was then applied synchronously across both image sets to generate the co-registered training samples. This stride setting ensures partial overlap between samples, which not only increases the sample size but also helps the model learn richer spatial contextual information.
The core objective of this study is to address a highly practical challenge: how to use multi-temporal data to train a super-resolution model capable of reconstructing high-quality images for a completely new, unseen year. To this end, a strict “leave-one-year-out” validation strategy was designed for dataset partitioning:
Training Set: Includes all image patches from the years 2013, 2014, and 2017. The samples for each year consist of strictly registered 4-m low-resolution images and their corresponding 0.5-m high-resolution counterparts.
Test Set: Only image patches from the year 2019 are used as an independent test set.
This partitioning strategy fundamentally differs from the commonly used approach of randomly splitting mixed multi-year data into training and test sets. Our method ensures that the test set (2019) is completely independent of the training set in the temporal dimension, thereby enabling a realistic and unbiased evaluation of the model’s generalization and reconstruction performance when faced with data from an unknown year. This approach effectively prevents performance overestimation that may arise from the model having been exposed to similar features of the test year during training, making the experimental results more aligned with real-world heritage monitoring scenarios.
2.3. Cross-Year Masked Sample Generation Algorithm
To enhance the generalization capability of super-resolution models in cross-year applications and mitigate performance degradation caused by temporal distribution shifts, we designed a cross-year masked sample generation algorithm based on local structural similarity (Structural Similarity Index Measure, SSIM) [
49]. The core idea of this algorithm is as follows: although remote sensing images from different years may exhibit significant differences in overall radiometric characteristics and certain land cover types, the core elements of cultural heritage sites—such as major buildings and roads—typically maintain high temporal stability in their spatial structures. By automatically identifying these stable regions and constructing cross-year “masked sample pairs,” we introduce explicit temporal invariance constraints into model training. By guiding the model to learn from stable structural features that are invariant to temporal changes, it becomes robust to distribution shifts (e.g., from sensor or lighting variations across years), thereby enhancing its generalization to new yearly data.
2.3.1. Stable Region Identification
The algorithm takes a pair of precisely co-registered HR images from two different years,
and
, as input. To quantitatively assess the temporal stability of local regions, a sliding window strategy is employed to compute SSIM between the image pair. This index comprehensively compares the luminance, contrast, and structural information of local regions, making it more sensitive to structural stability of ground objects. Specifically, we set a sliding window size of 8 × 8 pixels and traverse the entire image with a stride of 4 pixels. For each window position, the SSIM value is calculated, resulting in a full-image SSIM response map. This response map is then binarized using a threshold
to generate a stable region mask
:
In this study, based on preliminary experiments on a validation set, was set to 0.5. This threshold effectively selects regions with high structural consistency while tolerating noise caused by minor misregistration or radiometric differences.
2.3.2. Masked Sample Pair Generation
After obtaining the stable region mask , instead of directly using all marked regions, we introduce an area ratio threshold as a quality control step. Only when the total area of stable regions exceeds of the entire image area is the image pair used to generate masked samples.
For eligible image pairs, we perform the following operations to construct new training samples:
High-resolution sample: Apply the mask to the high-resolution image of year A, , to obtain an image patch containing only stable regions .
Low-resolution sample: Apply the same mask to the low-resolution (LR) image of year B, , to obtain the corresponding low-resolution input .
Thus, we construct a cross-year masked sample pair .
This sample pair conveys a critical training signal: although the low-resolution input is from year B and its overall image characteristics may differ from year A, the model should be guided to reconstruct structural details similar to the high-resolution reference of year A in these masked stable structure regions. Similarly, to further enhance temporal invariance constraints and fully utilize the data, we construct a second masked sample pair in the opposite direction using the same principle: . This pair requires the model to reconstruct stable regions consistent with the high-resolution reference of year B from the low-resolution input of year A.
Through this bidirectional sample pair construction, the model is provided with more comprehensive cross-year mapping relationships, which help guide the model to understand that for the same stable structures in geographic space, regardless of which year the low-resolution input comes from, the high-resolution reconstruction results should converge to consistent essential features. This significantly enhances the model’s robustness to temporal variations.
2.3.3. Mixed Training Strategy
During the model training phase, these newly generated cross-year masked samples are merged with traditional paired samples from the same year to form the final training set.
This mixed training strategy introduces a powerful regularization constraint for model optimization. The model must not only minimize the reconstruction error (e.g., MSE or SSIM loss) on same-year samples but also learn to produce consistent high-resolution outputs for the same stable objects imaged in different years. This mechanism helps guide the model’s feature representation space to align with time-invariant, robust structural information, fundamentally enhancing its ability to handle temporal distribution shifts. As shown in
Figure 2, the samples generated by this algorithm effectively focus on the core stable objects of the heritage site, providing high-quality cross-temporal consistency priors for the model.
2.4. Improved RDN Model
The RDN network is a representative model in the field of image super-resolution. By combining dense connections with residual learning, it effectively alleviates the gradient vanishing problem in deep networks and promotes the reuse of features across different layers. The traditional RDN model primarily relies on stacking numerous residual dense blocks to build a deep network and employs both local and global residual connections to learn high-frequency information [
43]. However, when processing complex remote sensing images—such as those containing abundant detailed textures, object boundaries, and heterogeneous land surfaces—the traditional model’s uniform treatment of all channels and spatial locations may lead to insufficient perception of critical features.
To address the above issues, this paper improves upon the traditional RDN and proposes an enhanced RDN model (RDN_2) that incorporates dual attention mechanisms and multi-level feature fusion, along with other architectural refinements. Our enhancements mainly focus on the following two aspects:
(1) Integration of Dual Attention Mechanisms for Adaptive Feature Refinement:
We sequentially integrate a channel attention module and a spatial attention module at the end of the residual dense block stack [
45,
46]. The channel attention module aggregates channel-wise information from the feature maps by simultaneously utilizing global average pooling (GAP) and global max pooling (GMP). The outputs are then fed into a shared multilayer perceptron to generate channel-wise weights. This dual-path “GAP + GMP” design captures the importance of different channels more comprehensively, enabling the model to enhance responses to task-critical features while suppressing irrelevant or redundant ones. The spatial attention module performs average pooling and max pooling along the channel dimension, concatenates the results, and processes them through a convolutional layer to generate a spatial weight map. This allows the model to autonomously learn which spatial locations in the image contain more critical detail information—such as building edges and road contours—and enhance them accordingly.
(2) Multi-Level Feature Fusion via the Refined Feature Pathway:
The processed features, now refined by the dual attention modules, are then combined with the shallow features extracted at the beginning of the network via the existing global residual connection (long skip connection) [
44]. This fusion strategy leverages the advantages of both feature types: the shallow features, rich in low-frequency structural information (e.g., edges and contours), are preserved through the skip connection to maintain structural fidelity. Simultaneously, the deep features, which have been adaptively weighted by the attention mechanisms to emphasize critical channels and spatial details, provide enhanced high-frequency and semantic information for texture reconstruction. Thus, the model’s key innovation lies in using attention to refine the deep feature branch before fusion, ensuring that the combined feature map is optimally focused on the most salient information for accurate super-resolution.
The structure of the improved model is illustrated in
Figure 3.
2.5. Experimental Setup and Evaluation Metrics
To comprehensively validate the effectiveness of the proposed cross-year training strategy and improved model, we designed rigorous comparative experiments. This section details the baseline models, training configurations, loss functions, and evaluation metrics employed.
2.5.1. Baseline Models
We selected three representative deep learning models with diverse architectures from the super-resolution domain as baselines to ensure the generalizability of our conclusions:
ESPCN: Achieves efficient reconstruction through a feed-forward structure, with its core innovation being the sub-pixel convolution layer at the terminal stage. This layer aggregates high-resolution results directly from low-resolution feature maps, ensuring high computational efficiency [
50].
DRCN: Introduces a deep recursive convolutional network that expands the receptive field through parameter-sharing recursive structures, thereby enhancing the model’s feature representation capacity without significantly increasing the number of parameters [
51].
EDSR: Constructs a powerful deep residual network by removing redundant modules (e.g., batch normalization layers) and expanding model width, maintaining leading performance on multiple public benchmarks.
Additionally, to fairly evaluate the effectiveness of the improved RDN model (RDN_2), we compared it directly with the traditional RDN. The improved RDN incorporates both channel and spatial attention mechanisms, aiming to improve its feature utilization efficiency [
44].
2.5.2. Training Strategy and Configuration
To ensure a fair comparison of the proposed cross-year masked sample strategy, each model was trained under two distinct modes:
Baseline Mode: Models were trained using only traditional paired samples from the same year, i.e., .
Cross-year Mode: In addition to the baseline training set, cross-year masked sample pairs generated by the algorithm described in the Methodology section were incorporated. This mode aims to verify whether introducing temporal invariance constraints can effectively enhance model generalization.
A composite loss function combining pixel-level fidelity and structural similarity was used to supervise the training process:
where
and
denote the real high-resolution image and the model’s predicted output, respectively. The MSE loss
ensures pixel-level accuracy, while the SSIM loss
focuses on preserving structural integrity. This combined loss guides the model to produce visually realistic reconstruction results while maintaining numerical precision.
For optimization and training details, we employed the Adam optimizer with default momentum parameters. The batch size was set to 25. To optimize training dynamics, we implemented a dynamic learning rate scheduling strategy: the learning rate was halved if the validation loss did not decrease for three consecutive epochs, with a lower bound of 1×10−6. An early stopping mechanism was also applied: training was terminated if no improvement in validation loss was observed for ten consecutive epochs, preventing overfitting and improving training efficiency.
To objectively quantify reconstruction quality from multiple perspectives, we adopted three widely recognized metrics:
Peak Signal-to-Noise Ratio (PSNR): Computed based on mean squared error (MSE), emphasizing pixel-level absolute error. It serves as a fundamental metric for evaluating the fidelity of reconstructed images.
Structural Similarity Index Measure (SSIM): Comprehensively assesses image quality from three dimensions—luminance, contrast, and structure—aligning more closely with human visual perception and evaluating the preservation of structural information.
Mean Squared Error (MSE): Directly calculates the average squared error of all pixels, reflecting the overall reconstruction error level.
By combining PSNR, SSIM, and MSE, we can comprehensively and objectively assess the model’s reconstruction performance.
4. Discussion
The primary value of this methodology lies in its ability to generate a more consistent and reliable series of high-resolution images for heritage monitoring, effectively filling data gaps in years where only lower-resolution imagery is available. This mitigates a practical challenge frequently faced by site managers and conservation scientists. By providing reconstructed images with improved structural fidelity across different years, the method supports more robust time-series comparison. This facilitates several practical applications: more reliable detection of changes in architectural structures or the surrounding landscape, clearer tracking of long-term trends such as surface erosion or vegetation growth, and the compilation of a more continuous data record for evaluating the outcomes of past conservation work. Ultimately, by mitigating data interruptions and enhancing comparability across years, this approach aids in moving from a reactive to a more informed and proactive mode of heritage stewardship.
However, this study has certain limitations that point to directions for future research. First, our validation was conducted on a single heritage site using imagery from a single platform (Google Earth), with low-resolution data simulated via downsampling. While the “leave-one-year-out” protocol effectively tests temporal generalization, it does not fully encompass the complexities of real-world, long-term monitoring scenarios involving heterogeneous, multi-sensor, and multi-mission data streams (e.g., combining Landsat, Sentinel-2, and commercial satellite archives). Future work must therefore validate the proposed strategy on such authentic, multi-source data pairs to assess its robustness to more pronounced spectral and geometric variations. Second, our current method primarily relies on local structural similarity to identify stable regions, which may be affected by misregistration in images captured by different sensors or at different times. Future work could explore incorporating advanced vision tasks such as semantic segmentation to more precisely define temporal invariance. Additionally, adaptively determining optimal similarity thresholds for different regions remains an open question worthy of further investigation.
From a broader perspective, this study offers a new direction for super-resolution in remote sensing: shifting from continuously modifying model architectures to pursue optimal performance on idealized datasets, toward developing diverse training strategies that enhance robustness in real-world scenarios. This problem-oriented approach is particularly relevant for application domains that depend on long-term, consistent Earth observation data series, including but not limited to cultural heritage conservation, agricultural monitoring, urban planning, and disaster response. In these fields, reliable data continuity is often as critical as absolute spatial resolution. We believe that the integration of such application-aware algorithmic strategies with advanced deep learning models will enhance the practical value of remote sensing technology, contributing to more effective monitoring and sustainable management of vital resources and heritage assets.
5. Conclusions
This study mitigates a common yet under-explored challenge in heritage site remote sensing super-resolution: the frequent unavailability of fully paired LR and HR samples from the same time period due to factors such as sensor limitations and cloud cover. And when training and test data originate from different years, temporal distribution shifts can significantly degrade model generalization. Unlike previous studies that primarily focus on network architecture improvements or conventional data augmentation, we propose a novel approach—introducing masked samples of invariant regions across different years—to explicitly guide the model in learning stable features that remain unchanged over time, while mitigating the impact of sensor and illumination variations. Compared to designing complex new network architectures, our method offers distinct practical advantages: as a training framework, it can be easily integrated into various existing super-resolution models without modifying their base structures or incurring significant computational overhead during inference.
Furthermore, this study enhances the traditional RDN architecture by integrating dual attention mechanisms into its powerful backbone. Experimental results demonstrate that the proposed training strategy combined with the enhanced RDN model (RDN_2_M) achieves higher reconstruction accuracy and superior visual quality compared to traditional deep learning models and training approaches. This improvement is of significant value for long-term monitoring applications such as cultural heritage site management.
The proposed training strategy and enhanced model together provide a practical solution for generating high-resolution imagery in years with missing data. By bridging these temporal gaps and improving image consistency, our approach directly supports the long-term monitoring needs of heritage site managers. It enables more reliable analysis of spatial and structural changes over time, thereby contributing to evidence-based conservation planning and sustainable management. This methodology is not limited to cultural heritage but is applicable to any domain requiring continuous, high-resolution Earth observation.