Remote Sensing Image Super-Resolution for Heritage Sites Using a Temporal Invariance-Aware Training Strategy

Chen, Caiyan; Chen, Fulong; Gao, Sheng; Li, Hongqiang; Zhang, Xinru; Cheng, Yanni

doi:10.3390/rs18010118

Open AccessArticle

Remote Sensing Image Super-Resolution for Heritage Sites Using a Temporal Invariance-Aware Training Strategy

by

Caiyan Chen

^1,2,3,

Fulong Chen

^1,2,4,*

,

Sheng Gao

^1,2,3,

Hongqiang Li

^1,2,3,

Xinru Zhang

^1,2,3 and

Yanni Cheng

^1,2,3

¹

Key Laboratory of Digital Earth Science, Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100094, China

²

International Research Center of Big Data for Sustainable Development Goals, Beijing 100094, China

³

University of Chinese Academy of Sciences, Beijing 100049, China

⁴

Kashi Key Lab of Big Earth Data and Sustainable Development Goal, Kashi Aerospace Information Research Institute, Kashi 844000, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2026, 18(1), 118; https://doi.org/10.3390/rs18010118 (registering DOI)

Submission received: 14 November 2025 / Revised: 16 December 2025 / Accepted: 25 December 2025 / Published: 29 December 2025

(This article belongs to the Special Issue GIS and RS for Spatial Documentation, Analysis and Interpretation in Multi-Scale Archaeological Applications)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

We propose a novel cross-year masked sample generation algorithm to counteract temporal distribution shifts, which automatically identifies temporally stable regions for constructing training pairs with explicit invariance.
The integration of our training strategy with an enhanced RDN model (RDN_2) significantly improves both quantitative metrics and visual quality in cross-year super-resolution reconstruction.

What are the implications of the main findings?

Our approach provides a practical solution to the critical challenge of temporal distribution shifts in multi-temporal remote sensing data, particularly for heritage monitoring.
The framework offers a plug-and-play training strategy that enhances model generalization without modifying base architectures or increasing inference costs.

Abstract

Effective spatial and structural monitoring of World Heritage sites often relies on continuous high-spatial-resolution remote sensing imagery, which is often unavailable for specific years due to sensor, atmospheric, and revisit constraints. Super-resolution reconstruction thus becomes crucial for maintaining data continuity for such analyses. Traditional methods are trained on temporally aligned LR-HR pairs; however, their performance significantly declines when applied to unseen years due to temporal distribution shifts. To address this, we propose a temporal invariance-aware training strategy combined with an improved Residual Dense Network (RDN_2_M). We introduce a cross-year masked sample generation algorithm that identifies temporally stable regions via local structural similarity. This constructs explicit invariance-guided training pairs, which helps guide the model to focus on persistent structural features rather than transient appearances and to learn robust representations against inter-annual variations. Experiments on the Bin County Cave Temple (BCCT) Heritage Site dataset show our method, integrating the proposed strategy with the enhanced RDN model (RDN_2_M), significantly improves both the objective metrics and visual quality of reconstructed images. This offers a practical solution to filling temporal data gaps, thereby supporting long-term spatial and structural heritage monitoring.

Keywords:

remote sensing; deep learning; super-resolution; cultural heritage monitoring

1. Introduction

Long-term monitoring of structural conditions and spatial dynamics is essential for protecting World Cultural Heritage sites [1,2,3,4,5,6]. To be effective, such monitoring requires imagery that captures fine spatial detail while remaining consistent over time, allowing slow structural changes and environmental impacts to be identified. Ground-based techniques such as terrestrial laser scanning and UAV photogrammetry can provide highly detailed information, but they are typically collected during limited field campaigns and therefore cannot support regular, long-term observation across entire heritage landscapes.

High-resolution satellite imagery helps address this limitation by enabling repeated, large-area observations over extended periods. In practice, however, high-quality satellite images are not always available. Sensor characteristics, atmospheric conditions, and revisit intervals often result in missing high-resolution data for certain years or key periods [7,8,9,10,11,12]. These gaps make it difficult to construct reliable time series for long-term analysis. Super-resolution reconstruction offers a practical way to mitigate these limitations by enhancing lower-resolution satellite imagery. By producing more complete and consistent image records, this approach facilitates the analysis of long-term change and supports landscape-scale heritage monitoring.

With the advancement of deep learning techniques, data-driven super-resolution reconstruction methods have shown great potential in the field of remote sensing [13,14,15,16,17]. For instance, Lei et al. proposed a new single-image super-resolution algorithm named local–global combined networks (LGCNet) for public remote sensing data set (UC Merced) [18]. Mei et al. developed a novel three-dimensional full CNN (3D-FCNN) for super-resolution reconstruction and conducted experiments on four benchmark datasets from two well-known hyperspectral sensors, namely the hyperspectral digital imagery collection experiment (HYDICE) and reflective optics system imaging spectrometer (ROSIS) sensors [19]. Wang et al. introduced a multiscale fast Fourier transform (FFT)-based attention network (MSFFTAN) to achieve accurate remote sensing image super-resolution reconstruction [20].

However, most existing super-resolution methods are typically based on an idealized assumption: that the low-resolution (LR) and high-resolution (HR) images used for training are fully consistent in their spatiotemporal distribution. In practical cultural heritage monitoring, due to factors such as satellite sensor updates, changes in data acquisition plans, and cloud occlusion, it is extremely difficult to obtain perfectly matched HR-LR image pairs of the same area across multiple time periods [21,22,23]. And models are often trained on data from one or several years but applied to another unseen year for inference, leading to a distribution shift between the training and testing data.

This temporal inconsistency between training and testing data poses a critical challenge: the mapping relationships learned by the model during training may become overly dependent on image characteristics specific to certain years, such as particular illumination conditions [24,25,26]. When reconstructing images from other years, systematic differences in radiometric properties and textural features—caused by variations in imaging conditions, surface cover changes, and sensor characteristics across different years—can significantly degrade the model’s generalization performance [27,28,29]. In remote sensing image analysis, such temporal distribution shifts are particularly pronounced due to the dynamic nature of land cover and environmental changes [30,31].

Although recent years have seen progress in adaptive methods in remote sensing, most studies have focused on spatial or sensor domain adaptation [32,33,34,35,36]. For example, Liu et al. designed a self-attentive pyramid structure to capture interspectral self-similarity in the spatial domain, thereby increasing the receptive range of attention and improving the feature representation [37]. Zheng et al. designed a neighboring spectral attention module to explicitly constrain the reconstructed hyperspectral image (HSI) to maintain the correlation among neighboring spectral bands [38]. Zhang et al. proposed a Multi-Scale Feature Mapping Network (MSFMNet) based on cascaded residual learning to adaptively learn prior information of HSIs and spatial-spectral characteristics among different spectral segments [39]. Nevertheless, research addressing feature distribution shifts caused by temporal evolution remains relatively limited [40]. Traditional data augmentation methods, while capable of improving model robustness to geometric and radiometric variations to some extent, primarily increase data diversity through geometric transformations and color adjustments, yet they struggle to fundamentally model the intrinsic distribution shifts induced by temporal changes [41,42].

An important characteristic of the landscape in the study area is that it contains both changing and invariant elements over time. While natural features such as vegetation and water bodies vary with seasons and years, core elements—such as major architectural structures and road networks—typically exhibit high temporal stability. This observation offers an important insight: if a super-resolution model can be guided to focus more on these temporally stable structural features rather than components sensitive to transient appearance changes, and learn distribution shifts—such as those caused by sensor and illumination variations—from these stable features, it may be possible to enhance the model’s generalization capability when applied to data from new years.

Based on the above analysis, this paper proposes a training strategy incorporating temporal invariance constraints, aiming to enhance the adaptability of super-resolution models to temporal distribution shifts. Unlike traditional data augmentation methods that primarily aim to increase sample quantity, the core idea of this framework is to construct cross-year masked samples by automatically identifying stable ground object regions in multi-temporal images. These are used to build auxiliary training samples, thereby introducing temporal consistency constraints as a form of regularization. This encourages the model to learn feature representations that are insensitive to temporal variations, thus improving its robustness in cross-year reconstruction. Furthermore, building upon the classic Residual Dense Network (RDN) architecture, we integrate channel and spatial attention mechanisms to further enhance feature reuse and reconstruction performance [43,44,45,46].

To validate the effectiveness of the proposed method, we conducted experiments using the Bin County Cave Temple (BCCT) Heritage Site in Shaanxi Province as the study area. Images from 2013, 2014, and 2017 were used as the training set to perform super-resolution reconstruction on images from 2019. Experimental results show that the proposed training strategy leads to significant performance improvements across multiple super-resolution models in reconstructing images from new years. Combined with the improved RDN structure (RDN_2_M), the reconstruction results achieved better performance in both objective evaluation metrics and visual authenticity, providing a feasible technical solution to the problem of data gaps in heritage site monitoring.

2. Materials and Methods

2.1. Study Area and Data

This study selects the buffer zone of the Bin County Cave Temple (BCCT) Heritage Site in Shaanxi Province as the research area (Figure 1). As a key component of the UNESCO World Heritage site “Silk Roads: the Routes Network of Chang’an-Tianshan Corridor”, the BCCT possesses significant cultural value, with its grottoes, stone carvings, and surrounding historical landscape requiring careful conservation. The area’s surface landscape comprises both relatively stable artificial structures, such as the main grotto and auxiliary halls, and dynamic natural elements, such as vegetation cover, which vary with seasons and years. This coexistence of “stable” and “changing” features makes the region an ideal testbed for validating the performance of cross-year super-resolution reconstruction models. Accurate monitoring of the heritage structures and surrounding environmental changes using high-resolution remote sensing imagery is of great importance for preventive conservation.

All remote sensing images used in this study were obtained from the historical archive of the Google Earth platform [1,47,48]. A systematic inventory of image availability from 2011 to 2020 was conducted for the study area (Table 1). The results indicate that, due to factors such as cloud cover and limitations in data acquisition planning, complete cloud-free images with a spatial resolution better than 0.5 m are available only for four years: 2013, 2014, 2017, and 2019.

Based on this data availability, we designed the following data partitioning scheme to simulate and address the core challenge in heritage monitoring—missing data across years:

Training Set: Images from 2013, 2014, and 2017 were selected. For each year, high-resolution images with a spatial resolution of approximately 0.5 m were available, along with corresponding low-resolution (4-m) images generated by downsampling. These constitute the LR–HR image pairs used for model training.

Test Set: Images from 2019 were reserved as an independent test set. This year is completely separate from the training set in the temporal sequence and is used to evaluate the model’s generalization capability and reconstruction performance on a completely unseen year.

This “leave-one-year-out” validation strategy effectively simulates a common real-world scenario: leveraging sporadically acquired high-quality data to train a model, with the goal of reconstructing high-quality images for a year with missing data (e.g., 2019). This approach enhances the practical relevance of the study.

During the preprocessing stage, all selected images underwent rigorous geometric precision correction and radiometric normalization to ensure comparability across different years and minimize systematic errors caused by differences in imaging conditions.

The geographical location of the buffer zone of the BCCT heritage site and the corresponding high-resolution images for the respective years are shown in Figure 1.

2.2. Data Processing and Methodology

2.2.1. Data Preprocessing

The data preprocessing pipeline in this study is designed to ensure spatial and radiometric consistency across multi-temporal images and to construct high-quality paired samples for subsequent deep learning model training. The entire workflow consists of three key steps: spatial resampling and cropping, radiometric normalization, and training-test set partitioning.

2.2.2. Spatial Resampling and Cropping

First, based on the vector boundaries of the BCCT heritage site, historical images from four years—2013, 2014, 2017, and 2019—were downloaded from the Google Earth platform. The initial spatial resolution of the images was approximately 0.5 m. To construct a precisely paired high- and low-resolution dataset, all images were uniformly resampled to standard spatial resolutions of 0.5 m and 4 m using bilinear interpolation. This step aimed to eliminate minor resolution discrepancies in the original data and provide the model with strictly spatially registered training sample pairs.

Subsequently, all images were cropped based on the bounding rectangle of the study area vector to ensure complete spatial overlap across different years. This step establishes the geographic foundation for subsequent cross-year consistency analysis and model performance evaluation.

2.2.3. Radiometric Normalization

Due to differences in solar elevation angle, atmospheric conditions, and sensor performance during image acquisition, significant radiometric variations exist among images from different years. To mitigate the effects while preserving spectral discriminability among different land cover types, min–max normalization was independently applied to each image. This process linearly stretches the pixel values of each image to the [0, 1] range, calculated as follows:

I_{n o r m a l i z e d} = \frac{I - I_{m i n}}{I_{m a x} - I_{m i n}}

where

I

represents the original image, and

I_{m a x}

and

I_{m i n}

denote the maximum and minimum pixel values of the image, respectively. This per-scene normalization strategy effectively suppresses systematic biases caused by inter-annual variations in illumination and sensor characteristics, while maximizing the retention of texture and contrast information within each image.

2.2.4. Patch Generation and Dataset Partitioning

To meet the input requirements of deep learning models, a sliding window approach was used to extract small image patches from the processed full-scene images. Specifically, a fixed ground area of 256 × 256 m was used to define the patches for both the high- and low-resolution images. A sliding window with a stride of 128 m was then applied synchronously across both image sets to generate the co-registered training samples. This stride setting ensures partial overlap between samples, which not only increases the sample size but also helps the model learn richer spatial contextual information.

The core objective of this study is to address a highly practical challenge: how to use multi-temporal data to train a super-resolution model capable of reconstructing high-quality images for a completely new, unseen year. To this end, a strict “leave-one-year-out” validation strategy was designed for dataset partitioning:

Training Set: Includes all image patches from the years 2013, 2014, and 2017. The samples for each year consist of strictly registered 4-m low-resolution images and their corresponding 0.5-m high-resolution counterparts.

Test Set: Only image patches from the year 2019 are used as an independent test set.

This partitioning strategy fundamentally differs from the commonly used approach of randomly splitting mixed multi-year data into training and test sets. Our method ensures that the test set (2019) is completely independent of the training set in the temporal dimension, thereby enabling a realistic and unbiased evaluation of the model’s generalization and reconstruction performance when faced with data from an unknown year. This approach effectively prevents performance overestimation that may arise from the model having been exposed to similar features of the test year during training, making the experimental results more aligned with real-world heritage monitoring scenarios.

2.3. Cross-Year Masked Sample Generation Algorithm

To enhance the generalization capability of super-resolution models in cross-year applications and mitigate performance degradation caused by temporal distribution shifts, we designed a cross-year masked sample generation algorithm based on local structural similarity (Structural Similarity Index Measure, SSIM) [49]. The core idea of this algorithm is as follows: although remote sensing images from different years may exhibit significant differences in overall radiometric characteristics and certain land cover types, the core elements of cultural heritage sites—such as major buildings and roads—typically maintain high temporal stability in their spatial structures. By automatically identifying these stable regions and constructing cross-year “masked sample pairs,” we introduce explicit temporal invariance constraints into model training. By guiding the model to learn from stable structural features that are invariant to temporal changes, it becomes robust to distribution shifts (e.g., from sensor or lighting variations across years), thereby enhancing its generalization to new yearly data.

2.3.1. Stable Region Identification

The algorithm takes a pair of precisely co-registered HR images from two different years,

I_{y e a r A}^{H R}

and

I_{y e a r B}^{H R}

, as input. To quantitatively assess the temporal stability of local regions, a sliding window strategy is employed to compute SSIM between the image pair. This index comprehensively compares the luminance, contrast, and structural information of local regions, making it more sensitive to structural stability of ground objects. Specifically, we set a sliding window size of 8 × 8 pixels and traverse the entire image with a stride of 4 pixels. For each window position, the SSIM value is calculated, resulting in a full-image SSIM response map. This response map is then binarized using a threshold

T_{s t a b l e}

to generate a stable region mask

M_{s t a b l e}

:

M_{s t a b l e} (x, y) = \{\begin{matrix} 1 i f M_{S S I M} (x, y) > T_{s t a b l e} \\ 0 \end{matrix}

In this study, based on preliminary experiments on a validation set,

T_{s t a b l e}

was set to 0.5. This threshold effectively selects regions with high structural consistency while tolerating noise caused by minor misregistration or radiometric differences.

2.3.2. Masked Sample Pair Generation

After obtaining the stable region mask

M_{s t a b l e}

, instead of directly using all marked regions, we introduce an area ratio threshold

R_{m i n} = 0.1

as a quality control step. Only when the total area of stable regions exceeds

R_{m i n}

of the entire image area is the image pair used to generate masked samples.

For eligible image pairs, we perform the following operations to construct new training samples:

High-resolution sample: Apply the mask

M_{s t a b l e}

to the high-resolution image of year A,

I_{y e a r A}^{H R}

, to obtain an image patch containing only stable regions

I_{y e a r A}^{H R} ⊙ M_{s t a b l e}

.

Low-resolution sample: Apply the same mask

M_{s t a b l e}

to the low-resolution (LR) image of year B,

I_{y e a r B}^{L R}

, to obtain the corresponding low-resolution input

I_{y e a r B}^{L R} ⊙ M_{s t a b l e}

.

Thus, we construct a cross-year masked sample pair

(I_{y e a r B}^{L R} ⊙ M_{s t a b l e}, I_{y e a r A}^{H R} ⊙ M_{s t a b l e})

.

This sample pair conveys a critical training signal: although the low-resolution input is from year B and its overall image characteristics may differ from year A, the model should be guided to reconstruct structural details similar to the high-resolution reference of year A in these masked stable structure regions. Similarly, to further enhance temporal invariance constraints and fully utilize the data, we construct a second masked sample pair in the opposite direction using the same principle:

(I_{y e a r A}^{L R} ⊙ M_{s t a b l e}, I_{y e a r B}^{H R} ⊙ M_{s t a b l e})

. This pair requires the model to reconstruct stable regions consistent with the high-resolution reference of year B from the low-resolution input of year A.

Through this bidirectional sample pair construction, the model is provided with more comprehensive cross-year mapping relationships, which help guide the model to understand that for the same stable structures in geographic space, regardless of which year the low-resolution input comes from, the high-resolution reconstruction results should converge to consistent essential features. This significantly enhances the model’s robustness to temporal variations.

2.3.3. Mixed Training Strategy

During the model training phase, these newly generated cross-year masked samples are merged with traditional paired samples from the same year

(I_{y e a r}^{L R}, I_{y e a r}^{H R})

to form the final training set.

This mixed training strategy introduces a powerful regularization constraint for model optimization. The model must not only minimize the reconstruction error (e.g., MSE or SSIM loss) on same-year samples but also learn to produce consistent high-resolution outputs for the same stable objects imaged in different years. This mechanism helps guide the model’s feature representation space to align with time-invariant, robust structural information, fundamentally enhancing its ability to handle temporal distribution shifts. As shown in Figure 2, the samples generated by this algorithm effectively focus on the core stable objects of the heritage site, providing high-quality cross-temporal consistency priors for the model.

2.4. Improved RDN Model

The RDN network is a representative model in the field of image super-resolution. By combining dense connections with residual learning, it effectively alleviates the gradient vanishing problem in deep networks and promotes the reuse of features across different layers. The traditional RDN model primarily relies on stacking numerous residual dense blocks to build a deep network and employs both local and global residual connections to learn high-frequency information [43]. However, when processing complex remote sensing images—such as those containing abundant detailed textures, object boundaries, and heterogeneous land surfaces—the traditional model’s uniform treatment of all channels and spatial locations may lead to insufficient perception of critical features.

To address the above issues, this paper improves upon the traditional RDN and proposes an enhanced RDN model (RDN_2) that incorporates dual attention mechanisms and multi-level feature fusion, along with other architectural refinements. Our enhancements mainly focus on the following two aspects:

(1) Integration of Dual Attention Mechanisms for Adaptive Feature Refinement:

We sequentially integrate a channel attention module and a spatial attention module at the end of the residual dense block stack [45,46]. The channel attention module aggregates channel-wise information from the feature maps by simultaneously utilizing global average pooling (GAP) and global max pooling (GMP). The outputs are then fed into a shared multilayer perceptron to generate channel-wise weights. This dual-path “GAP + GMP” design captures the importance of different channels more comprehensively, enabling the model to enhance responses to task-critical features while suppressing irrelevant or redundant ones. The spatial attention module performs average pooling and max pooling along the channel dimension, concatenates the results, and processes them through a convolutional layer to generate a spatial weight map. This allows the model to autonomously learn which spatial locations in the image contain more critical detail information—such as building edges and road contours—and enhance them accordingly.

(2) Multi-Level Feature Fusion via the Refined Feature Pathway:

The processed features, now refined by the dual attention modules, are then combined with the shallow features extracted at the beginning of the network via the existing global residual connection (long skip connection) [44]. This fusion strategy leverages the advantages of both feature types: the shallow features, rich in low-frequency structural information (e.g., edges and contours), are preserved through the skip connection to maintain structural fidelity. Simultaneously, the deep features, which have been adaptively weighted by the attention mechanisms to emphasize critical channels and spatial details, provide enhanced high-frequency and semantic information for texture reconstruction. Thus, the model’s key innovation lies in using attention to refine the deep feature branch before fusion, ensuring that the combined feature map is optimally focused on the most salient information for accurate super-resolution.

The structure of the improved model is illustrated in Figure 3.

2.5. Experimental Setup and Evaluation Metrics

To comprehensively validate the effectiveness of the proposed cross-year training strategy and improved model, we designed rigorous comparative experiments. This section details the baseline models, training configurations, loss functions, and evaluation metrics employed.

2.5.1. Baseline Models

We selected three representative deep learning models with diverse architectures from the super-resolution domain as baselines to ensure the generalizability of our conclusions:

ESPCN: Achieves efficient reconstruction through a feed-forward structure, with its core innovation being the sub-pixel convolution layer at the terminal stage. This layer aggregates high-resolution results directly from low-resolution feature maps, ensuring high computational efficiency [50].

DRCN: Introduces a deep recursive convolutional network that expands the receptive field through parameter-sharing recursive structures, thereby enhancing the model’s feature representation capacity without significantly increasing the number of parameters [51].

EDSR: Constructs a powerful deep residual network by removing redundant modules (e.g., batch normalization layers) and expanding model width, maintaining leading performance on multiple public benchmarks.

Additionally, to fairly evaluate the effectiveness of the improved RDN model (RDN_2), we compared it directly with the traditional RDN. The improved RDN incorporates both channel and spatial attention mechanisms, aiming to improve its feature utilization efficiency [44].

2.5.2. Training Strategy and Configuration

To ensure a fair comparison of the proposed cross-year masked sample strategy, each model was trained under two distinct modes:

Baseline Mode: Models were trained using only traditional paired samples from the same year, i.e.,

(I_{y e a r}^{L R}, I_{y e a r}^{H R})

.

Cross-year Mode: In addition to the baseline training set, cross-year masked sample pairs generated by the algorithm described in the Methodology section were incorporated. This mode aims to verify whether introducing temporal invariance constraints can effectively enhance model generalization.

A composite loss function combining pixel-level fidelity and structural similarity was used to supervise the training process:

L_{t o t a l} = L_{M S E} (Y_{h r_t u r e}, Y_{h r_p r e d}) + (1 - L_{S S I M} (Y_{h r_t u r e}, Y_{h r_p r e d}))

where

Y_{h r_t u r e}

and

Y_{h r_p r e d}

denote the real high-resolution image and the model’s predicted output, respectively. The MSE loss

L_{M S E}

ensures pixel-level accuracy, while the SSIM loss

L_{S S I M}

focuses on preserving structural integrity. This combined loss guides the model to produce visually realistic reconstruction results while maintaining numerical precision.

For optimization and training details, we employed the Adam optimizer with default momentum parameters. The batch size was set to 25. To optimize training dynamics, we implemented a dynamic learning rate scheduling strategy: the learning rate was halved if the validation loss did not decrease for three consecutive epochs, with a lower bound of 1×10⁻⁶. An early stopping mechanism was also applied: training was terminated if no improvement in validation loss was observed for ten consecutive epochs, preventing overfitting and improving training efficiency.

To objectively quantify reconstruction quality from multiple perspectives, we adopted three widely recognized metrics:

Peak Signal-to-Noise Ratio (PSNR): Computed based on mean squared error (MSE), emphasizing pixel-level absolute error. It serves as a fundamental metric for evaluating the fidelity of reconstructed images.

Structural Similarity Index Measure (SSIM): Comprehensively assesses image quality from three dimensions—luminance, contrast, and structure—aligning more closely with human visual perception and evaluating the preservation of structural information.

Mean Squared Error (MSE): Directly calculates the average squared error of all pixels, reflecting the overall reconstruction error level.

By combining PSNR, SSIM, and MSE, we can comprehensively and objectively assess the model’s reconstruction performance.

3. Results

3.1. Quantitative Analysis of Reconstruction Accuracy

To systematically evaluate the effectiveness of the proposed cross-year training strategy and improved model, this section provides a quantitative analysis and discussion of the experimental results from two perspectives: reconstruction accuracy on the training set and generalization capability on the test set.

3.1.1. Performance on the Training Set

Using the training strategies described above, we obtained the reconstruction accuracy of different models under the two training strategies on the training set, as shown in Table 2. In the table, models without a suffix represent the baseline training mode using only same-year paired samples, while the suffix “_M” indicates the cross-year training mode incorporating cross-year masked samples.

From the analysis of Table 2, we observe that when training the same model, using both same-year paired samples and masked images of invariant regions from different years generally yields better accuracy compared to using only same-year paired samples. Under the same training strategy, our proposed improved RDN (RDN_2) achieved better performance in terms of SSIM in both training modes. In the cross-year mode, RDN_2_M reached an SSIM of 0.7513, outperforming RDN_M (0.7317) and EDSR_M (0.7389). In the baseline training mode, RDN_2 achieved an SSIM of 0.7438, surpassing RDN (0.7076) and EDSR (0.7253). These results demonstrate the effectiveness of our architectural improvements to the RDN model—such as the integration of attention mechanisms and robust skip connections—in enhancing its feature representation capability. When cross-year masked samples were introduced, all models exhibited performance improvements to varying degrees. For instance, the SSIM of RDN_2 increased from 0.7438 to 0.7513, while that of RDN improved from 0.7076 to 0.7317. This improvement may be attributed to the temporal invariance constraint guiding the model to learn more robust feature representations.

3.1.2. Generalization Performance on the Test Set

The core value of a model lies in its ability to handle unseen data. Table 3 presents the performance of each model on the completely unseen 2019 test set, which serves as key evidence for evaluating their cross-year generalization capability.

The test set results reveal a more distinct trend than those on the training set: the introduction of the cross-year masked sample strategy (_M) improved the generalization performance of models. Taking the SSIM metric as an example, RDN_2_M (0.7267) outperformed RDN_2 (0.7093), EDSR_M (0.7120) surpassed EDSR (0.7000), and ESPCN_M (0.6839) exceeded ESPCN (0.6449). This demonstrates that the core function of the temporal invariance training strategy is to guide the model to learn the stable, intrinsic structural features of the heritage site, rather than memorizing year-specific appearance details. As a result, the model exhibits superior and more robust reconstruction capabilities when confronted with entirely new year data.

Furthermore, it is worth noting that in some comparisons (e.g., ESPCN_M vs. ESPCN), we observe an increase in SSIM accompanied by a slight rise in MSE. This apparent discrepancy can be attributed to the different sensitivities of the evaluation metrics. SSIM is designed to assess perceived structural fidelity by comparing luminance, contrast, and structure between images, making it more aligned with human visual perception. An improvement in SSIM suggests that the proposed strategy helps the model better reconstruct the overall layout and structural patterns of the heritage site in the unseen year. Conversely, MSE is a pixel-wise intensity error metric that is highly sensitive to absolute radiometric differences, such as slight global brightness shifts or local contrast variations that may have little impact on structural recognizability. Therefore, the observed trade-off in some cases indicates that the temporal invariance training strategy prioritizes the enhancement of structural consistency—a more critical aspect for heritage monitoring—even if it introduces minor, perceptually less significant deviations in pixel intensity.

On the test set, our improved RDN model (RDN_2) achieved superior results compared to other competing models, particularly in terms of SSIM, under both baseline and cross-year modes. The RDN_2_M model achieved the best SSIM (0.7267), the highest PSNR (27.63 dB), and the lowest MSE (0.0019) among all models. This fully validates the synergistic effect between the improved model architecture and the proposed training strategy: a stronger base model can more effectively leverage the information provided by cross-year constraints, ultimately achieving better cross-year reconstruction performance.

3.2. Qualitative Analysis of Visual Reconstruction Results

Beyond quantitative metrics, the visual authenticity and the restoration of spatial details in reconstructed images are critical for practical applications such as cultural heritage monitoring.

3.2.1. Overall Visual Fidelity and Residual Analysis

For qualitative evaluation, we present the reconstruction results of the 2019 test image using different models and training strategies, along with their corresponding residual maps compared to the real high-resolution image, as shown in Figure 4 and Figure 5.

By comparing the reconstruction results under the baseline mode and the cross-year mode in Figure 4 and Figure 5, it can be visually observed that models trained with our cross-year masked sample strategy demonstrate improved robustness and superior visual quality in reconstructing the study area, compared to those trained solely with same-year sample pairs. The reconstructed images appear sharper and more detailed. Correspondingly, the residuals generated by models using our strategy are significantly smaller in both magnitude and spatial extent than those under the baseline mode.

Furthermore, the images reconstructed by the RDN_2 model combined with our proposed training strategy (RDN_2_M) exhibit better visual quality among all compared models, with the least residual errors. The residual maps of this model are predominantly close to black, indicating that the reconstructed results are highly consistent with the real image across most regions. In contrast, residual maps of other models contain more bright areas, reflecting larger reconstruction errors.

Visual analysis confirms that the proposed cross-year training strategy, when integrated with the RDN_2 model (RDN_2_M), can provide reliable and usable data products for cultural heritage monitoring. These visual improvements—sharper images and reduced residuals—are of direct and practical significance for heritage conservation. Based on such high-quality reconstructions, monitoring personnel can more accurately identify minor deteriorations (e.g., wall cracks, surface weathering) and assess macroscopic conditions, enabling timely management decisions even in years with missing data.

3.2.2. Restoration of Local Spatial Details

To thoroughly evaluate the capability of different models in restoring spatial details, we conducted a fine-grained visual comparison of key artificial features in the reconstructed results—such as roads and buildings—as shown in Figure 6. These linear and areal features are critical for understanding the spatial pattern of cultural heritage sites and monitoring their changes. The clarity of their edges and the fidelity of their morphology serve as key indicators for assessing the performance of super-resolution reconstruction algorithms.

Analysis of Figure 6 yields the following observations:

Refinement of Road Edges: Under the baseline training strategy, most models produced reconstructed road edges with noticeable blurring, distortion, jagged artifacts, and varying degrees of checkerboard noise. In contrast, after applying a cross-year training strategy, all models exhibited improvements in reconstructing road details to varying extents. Among them, the improved RDN_2 model combined with our training strategy (RDN_2_M) demonstrated the best performance. The roads reconstructed by this model exhibited continuous and smooth lines with sharp edges, closely matching the road morphology in the real high-resolution image. This indicates that our method effectively guides the model to learn more robust structural priors of roads and suppresses artifacts commonly found in upsampling-based models.

Fidelity of Building Contours: A similar improvement trend was observed for buildings. The boundaries of buildings reconstructed by traditional methods often appeared irregular, distorted, or exhibited burring artifacts. In comparison, models trained with our strategy—particularly RDN_2_M—produced building contours with higher geometric integrity and visual authenticity. The building boundaries were reconstructed as straight and regular, significantly reducing jagged distortions along the edges and the overall noise level in the image. This effect demonstrates that temporal invariance constraints help the model stably learn the typical structural characteristics of artificial features, rather than overfitting to noise or specific textures present in the training year’s imagery.

Overall, the detailed analysis confirms that the proposed method not only leads in global statistical metrics but also exhibits significant advantages in restoring local spatial details. By helping guide the model to focus on structurally stable elements across time, our approach generates reconstructed results that are visually more reliable and metrically more accurate. This has important practical value for applications relying on high-resolution remote sensing imagery, such as precision mapping and structural health assessment of heritage objects.

4. Discussion

The primary value of this methodology lies in its ability to generate a more consistent and reliable series of high-resolution images for heritage monitoring, effectively filling data gaps in years where only lower-resolution imagery is available. This mitigates a practical challenge frequently faced by site managers and conservation scientists. By providing reconstructed images with improved structural fidelity across different years, the method supports more robust time-series comparison. This facilitates several practical applications: more reliable detection of changes in architectural structures or the surrounding landscape, clearer tracking of long-term trends such as surface erosion or vegetation growth, and the compilation of a more continuous data record for evaluating the outcomes of past conservation work. Ultimately, by mitigating data interruptions and enhancing comparability across years, this approach aids in moving from a reactive to a more informed and proactive mode of heritage stewardship.

However, this study has certain limitations that point to directions for future research. First, our validation was conducted on a single heritage site using imagery from a single platform (Google Earth), with low-resolution data simulated via downsampling. While the “leave-one-year-out” protocol effectively tests temporal generalization, it does not fully encompass the complexities of real-world, long-term monitoring scenarios involving heterogeneous, multi-sensor, and multi-mission data streams (e.g., combining Landsat, Sentinel-2, and commercial satellite archives). Future work must therefore validate the proposed strategy on such authentic, multi-source data pairs to assess its robustness to more pronounced spectral and geometric variations. Second, our current method primarily relies on local structural similarity to identify stable regions, which may be affected by misregistration in images captured by different sensors or at different times. Future work could explore incorporating advanced vision tasks such as semantic segmentation to more precisely define temporal invariance. Additionally, adaptively determining optimal similarity thresholds for different regions remains an open question worthy of further investigation.

From a broader perspective, this study offers a new direction for super-resolution in remote sensing: shifting from continuously modifying model architectures to pursue optimal performance on idealized datasets, toward developing diverse training strategies that enhance robustness in real-world scenarios. This problem-oriented approach is particularly relevant for application domains that depend on long-term, consistent Earth observation data series, including but not limited to cultural heritage conservation, agricultural monitoring, urban planning, and disaster response. In these fields, reliable data continuity is often as critical as absolute spatial resolution. We believe that the integration of such application-aware algorithmic strategies with advanced deep learning models will enhance the practical value of remote sensing technology, contributing to more effective monitoring and sustainable management of vital resources and heritage assets.

5. Conclusions

This study mitigates a common yet under-explored challenge in heritage site remote sensing super-resolution: the frequent unavailability of fully paired LR and HR samples from the same time period due to factors such as sensor limitations and cloud cover. And when training and test data originate from different years, temporal distribution shifts can significantly degrade model generalization. Unlike previous studies that primarily focus on network architecture improvements or conventional data augmentation, we propose a novel approach—introducing masked samples of invariant regions across different years—to explicitly guide the model in learning stable features that remain unchanged over time, while mitigating the impact of sensor and illumination variations. Compared to designing complex new network architectures, our method offers distinct practical advantages: as a training framework, it can be easily integrated into various existing super-resolution models without modifying their base structures or incurring significant computational overhead during inference.

Furthermore, this study enhances the traditional RDN architecture by integrating dual attention mechanisms into its powerful backbone. Experimental results demonstrate that the proposed training strategy combined with the enhanced RDN model (RDN_2_M) achieves higher reconstruction accuracy and superior visual quality compared to traditional deep learning models and training approaches. This improvement is of significant value for long-term monitoring applications such as cultural heritage site management.

The proposed training strategy and enhanced model together provide a practical solution for generating high-resolution imagery in years with missing data. By bridging these temporal gaps and improving image consistency, our approach directly supports the long-term monitoring needs of heritage site managers. It enables more reliable analysis of spatial and structural changes over time, thereby contributing to evidence-based conservation planning and sustainable management. This methodology is not limited to cultural heritage but is applicable to any domain requiring continuous, high-resolution Earth observation.

Author Contributions

C.C. designed and completed the experiment; F.C., S.G., H.L., X.Z. and Y.C. revised the manuscript and provided feedback on the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This study was supported by the National Natural Science Foundation of China (42271327).

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

Luo, L.; Wang, X.Y.; Guo, H.D.; Lasaponara, R.; Zong, X.; Masini, N.; Wang, G.Z.; Shi, P.L.; Khatteli, H.; Chen, F.L.; et al. Airborne and spaceborne remote sensing for archaeological and cultural heritage applications: A review of the century (1907–2017). Remote Sens. Environ. 2019, 232, 111280. [Google Scholar] [CrossRef]
Agapiou, A. Remote sensing heritage in a petabyte-scale: Satellite data and heritage Earth Engine© applications. Int. J. Digit. Earth 2017, 10, 85–102. [Google Scholar] [CrossRef]
Tang, P.P.; Chen, F.L.; Zhu, X.K.; Zhou, W. Monitoring Cultural Heritage Sites with Advanced Multi-Temporal InSAR Technique: The Case Study of the Summer Palace. Remote Sens. 2016, 8, 432. [Google Scholar] [CrossRef]
Sesana, E.; Gagnon, A.S.; Ciantelli, C.; Cassar, J.; Hughes, J.J. Climate change impacts on cultural heritage: A literature review. Wiley Interdiscip. Rev.-Clim. Change 2021, 12, e710. [Google Scholar] [CrossRef]
Nguyen, K.N.; Baker, S. Climate Change Impacts on UNESCO World Heritage-Listed Cultural Properties in the Asia-Pacific Region: A Systematic Review of State of Conservation Reports, 1979–2021. Sustainability 2023, 15, 14141. [Google Scholar] [CrossRef]
Argyrou, A.; Agapiou, A. A Review of Artificial Intelligence and Remote Sensing for Archaeological Research. Remote Sens. 2022, 14, 6000. [Google Scholar] [CrossRef]
Park, S.C.; Park, M.K.; Kang, M.G. Super-resolution image reconstruction: A technical overview. IEEE Signal Process. Mag. 2003, 20, 21–36. [Google Scholar] [CrossRef]
Yue, L.W.; Shen, H.F.; Li, J.; Yuan, Q.Q.; Zhang, H.Y.; Zhang, L.P. Image super-resolution: The techniques, applications, and future. Signal Process. 2016, 128, 389–408. [Google Scholar] [CrossRef]
Tapete, D.; Cigna, F. Appraisal of Opportunities and Perspectives for the Systematic Condition Assessment of Heritage Sites with Copernicus Sentinel-2 High-Resolution Multispectral Imagery. Remote Sens. 2018, 10, 561. [Google Scholar] [CrossRef]
Trinks, I.; Neubauer, W.; Hinterleitner, A. First High-resolution GPR and Magnetic Archaeological Prospection at the Viking Age Settlement of Birka in Sweden. Archaeol. Prospect. 2014, 21, 185–199. [Google Scholar] [CrossRef]
Mallinis, G.; Mitsopoulos, I.; Beltran, E.; Goldammer, J.G. Assessing Wildfire Risk in Cultural Heritage Properties Using High Spatial and Temporal Resolution Satellite Imagery and Spatially Explicit Fire Simulations: The Case of Holy Mount Athos, Greece. Forests 2016, 7, 46. [Google Scholar] [CrossRef]
Jorayev, G.; Wehr, K.; Benito-Calvo, A.; Njau, J.; de la Torre, I. Imaging and photogrammetry models of Olduvai Gorge (Tanzania) by Unmanned Aerial Vehicles: A high-resolution digital database for research and conservation of Early Stone Age sites. J. Archaeol. Sci. 2016, 75, 40–56. [Google Scholar] [CrossRef]
Wang, X.; Yi, J.L.; Guo, J.; Song, Y.C.; Lyu, J.; Xu, J.D.; Yan, W.Q.; Zhao, J.D.; Cai, Q.; Min, H.G. A Review of Image Super-Resolution Approaches Based on Deep Learning and Applications in Remote Sensing. Remote Sens. 2022, 14, 5423. [Google Scholar] [CrossRef]
Wang, X.; Sun, L.J.; Chehri, A.; Song, Y.C. A Review of GAN-Based Super-Resolution Reconstruction for Optical Remote Sensing Images. Remote Sens. 2023, 15, 5062. [Google Scholar] [CrossRef]
Pang, B.Y.; Zhao, S.W.; Liu, Y.N. The Use of a Stable Super-Resolution Generative Adversarial Network (SSRGAN) on Remote Sensing Images. Remote Sens. 2023, 15, 5064. [Google Scholar] [CrossRef]
Kwon, D.H.; Hong, S.M.; Abbas, A.; Park, S.; Nam, G.; Yoo, J.H.; Kim, K.; Kim, H.T.; Pyo, J.; Cho, K.H. Deep learning-based super-resolution for harmful algal bloom monitoring of inland water. Gisci. Remote Sens. 2023, 60, 2249753. [Google Scholar] [CrossRef]
Zhang, T.L.; Bian, C.J.; Zhang, X.M.; Chen, H.Z.; Chen, S. Lightweight Remote-Sensing Image Super-Resolution via Re-Parameterized Feature Distillation Network. IEEE Geosci. Remote Sens. Lett. 2023, 20, 1–5. [Google Scholar] [CrossRef]
Lei, S.; Shi, Z.W.; Zou, Z.X. Super-Resolution for Remote Sensing Images via Local-Global Combined Network. IEEE Geosci. Remote Sens. Lett. 2017, 14, 1243–1247. [Google Scholar] [CrossRef]
Mei, S.; Yuan, X.; Ji, J.; Zhang, Y.; Wan, S.; Du, Q. Hyperspectral Image Spatial Super-Resolution via 3D Full Convolutional Neural Network. Remote Sens. 2017, 9, 1139. [Google Scholar] [CrossRef]
Wang, Z.; Zhao, Y.W.; Chen, J.C. Multi-Scale Fast Fourier Transform Based Attention Network for Remote-Sensing Image Super-Resolution. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 2728–2740. [Google Scholar] [CrossRef]
Jia, X.F.; Li, X.Y.; Wang, Z.R.; Hao, Z.; Ren, D.; Liu, H.; Du, Y.; Ling, F. Enhancing Cropland Mapping with Spatial Super-Resolution Reconstruction by Optimizing Training Samples for Image Super-Resolution Models. Remote Sens. 2024, 16, 4678. [Google Scholar] [CrossRef]
Li, X.Y.; Zhang, L.F.; You, J.N. Domain Transfer Learning for Hyperspectral Image Super-Resolution. Remote Sens. 2019, 11, 694. [Google Scholar] [CrossRef]
Ren, Z.P.; Zhao, J.P.; Chen, C.Y.; Lou, Y.; Ma, X.C. Dual-Path Adversarial Generation Network for Super-Resolution Reconstruction of Remote Sensing Images. Appl. Sci. 2023, 13, 1245. [Google Scholar] [CrossRef]
Chen, H.G.; He, X.H.; Qing, L.B.; Wu, Y.Y.; Ren, C.; Sheriff, R.E.; Zhu, C. Real-world single image super-resolution: A brief review. Inf. Fusion 2022, 79, 124–145. [Google Scholar] [CrossRef]
Li, K.; Yang, S.H.; Dong, R.T.; Wang, X.Y.; Huang, J.Q. Survey of single image super-resolution reconstruction. IET Image Process. 2020, 14, 2273–2290. [Google Scholar] [CrossRef]
Liu, H.; Qian, Y.R.; Zhong, X.W.; Chen, L.; Yang, G.Q. Research on super-resolution reconstruction of remote sensing images: A comprehensive review. Opt. Eng. 2021, 60, 100901. [Google Scholar] [CrossRef]
Lepcha, D.C.; Goyal, B.; Dogra, A.; Goyal, V. Image super-resolution: A comprehensive review, recent trends, challenges and applications. Inf. Fusion 2023, 91, 230–260. [Google Scholar] [CrossRef]
Maiseli, B.; Abdalla, A.T. Seven decades of image super-resolution: Achievements, challenges, and opportunities. EURASIP J. Adv. Signal Process. 2024, 2024, 78. [Google Scholar] [CrossRef]
Chang, Y.L.; Chen, G.; Chen, J.F. Pixel-Wise Attention Residual Network for Super-Resolution of Optical Remote Sensing Images. Remote Sens. 2023, 15, 3139. [Google Scholar] [CrossRef]
Mou, L.C.; Bruzzone, L.; Zhu, X.X. Learning Spectral-Spatial-Temporal Features via a Recurrent Convolutional Neural Network for Change Detection in Multispectral Imagery. IEEE Trans. Geosci. Remote Sens. 2019, 57, 924–935. [Google Scholar] [CrossRef]
Jiang, H.W.; Peng, M.; Zhong, Y.J.; Xie, H.F.; Hao, Z.M.; Lin, J.M.; Ma, X.L.; Hu, X.Y. A Survey on Deep Learning-Based Change Detection from High-Resolution Remote Sensing Images. Remote Sens. 2022, 14, 1552. [Google Scholar] [CrossRef]
Chauhan, K.; Patel, S.N.; Kumhar, M.; Bhatia, J.; Tanwar, S.; Davidson, I.E.; Mazibuko, T.F.; Sharma, R. Deep Learning-Based Single-Image Super-Resolution: A Comprehensive Review. IEEE Access 2023, 11, 21811–21830. [Google Scholar] [CrossRef]
Yin, H.T.; Li, S.T.; Fang, L.Y. Simultaneous image fusion and super-resolution using sparse representation. Inf. Fusion 2013, 14, 229–240. [Google Scholar] [CrossRef]
Wang, Y.T.; Zhao, L.; Liu, L.M.; Hu, H.F.; Tao, W.B. URNet: A U-Shaped Residual Network for Lightweight Image Super-Resolution. Remote Sens. 2021, 13, 3848. [Google Scholar] [CrossRef]
Wang, B.Q.; Chen, J.H.; Wang, H.J.; Tang, Y.P.; Chen, J.L.; Jiang, Y. A spectral and spatial transformer for hyperspectral remote sensing image super-resolution. Int. J. Digit. Earth 2024, 17, 2313102. [Google Scholar] [CrossRef]
Han, H.; Du, W.; Feng, Z.Y.; Guo, Z.H.; Xu, T.Y. An Effective Res-Progressive Growing Generative Adversarial Network-Based Cross-Platform Super-Resolution Reconstruction Method for Drone and Satellite Images. Drones 2024, 8, 452. [Google Scholar] [CrossRef]
Liu, T.T.; Liu, Y.; Zhang, C.C.; Yuan, L.Y.; Sui, X.B.; Chen, Q. Hyperspectral Image Super-Resolution via Dual-Domain Network Based on Hybrid Convolution. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–18. [Google Scholar] [CrossRef]
Zheng, X.T.; Chen, W.J.; Lu, X.Q. Spectral Super-Resolution of Multispectral Images Using Spatial-Spectral Residual Attention Network. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–14. [Google Scholar] [CrossRef]
Zhang, J.; Shao, M.H.; Wan, Z.K.; Li, Y.S. Multi-Scale Feature Mapping Network for Hyperspectral Image Super-Resolution. Remote Sens. 2021, 13, 4180. [Google Scholar] [CrossRef]
Shi, Y.J.; Ying, X.H.; Yang, J.F. Deep Unsupervised Domain Adaptation with Time Series Sensor Data: A Survey. Sensors 2022, 22, 5507. [Google Scholar] [CrossRef]
Oubara, A.; Wu, F.L.; Amamra, A.; Yang, G.L. Survey on Remote Sensing Data Augmentation: Advances, Challenges, and Future Perspectives. In Proceedings of the Advances in Computing Systems And Applications, Algiers, Algeria, 17–18 May 2022; pp. 95–104. [Google Scholar]
Perez, L.; Wang, J.J.A. The Effectiveness of Data Augmentation in Image Classification using Deep Learning. arXiv 2017, arXiv:1712.04621. [Google Scholar] [CrossRef]
Zhang, Y.; Tian, Y.; Kong, Y.; Zhong, B.; Fu, Y.J.I. Residual Dense Network for Image Super-Resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 2472–2481. [Google Scholar]
Lim, B.; Son, S.; Kim, H.; Nah, S.; Lee, K.M.J.I. Enhanced Deep Residual Networks for Single Image Super-Resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Honolulu, HI, USA, 21–26 July 2017; pp. 1132–1140. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Jaderberg, M.; Simonyan, K.; Zisserman, A.; Kavukcuoglu, K.J.M.P. Spatial Transformer Networks. arXiv 2015, arXiv:1506.02025. [Google Scholar] [CrossRef]
Yu, L.; Gong, P. Google Earth as a virtual globe tool for Earth science applications at the global scale: Progress and perspectives. Int. J. Remote Sens. 2012, 33, 3966–3986. [Google Scholar] [CrossRef]
Zhao, Q.; Yu, L.; Li, X.C.; Peng, D.L.; Zhang, Y.G.; Gong, P. Progress and Trends in the Application of Google Earth and Google Earth Engine. Remote Sens. 2021, 13, 3778. [Google Scholar] [CrossRef]
Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef] [PubMed]
Shi, W.; Caballero, J.; Huszár, F.; Totz, J.; Wang, Z.J.I. Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 1874–1883. [Google Scholar] [CrossRef]
Kim, J.; Lee, J.K.; Lee, K.M.J.I. Deeply-Recursive Convolutional Network for Image Super-Resolution. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 1637–1645. [Google Scholar] [CrossRef]

Figure 1. Geographical location of the study area and corresponding high-resolution imagery.

Figure 2. Illustration of the generated cross-year masked samples. (a) HR image from 2014; (b) HR image from 2017; (c) masked LR image derived from 2014; (d) masked HR image derived from 2017; (e) masked LR image derived from 2017; (f) masked HR image derived from 2014.

Figure 3. Schematic diagram of the improved RDN (RDN_2) architecture.

Figure 4. Case Study 1: Visual comparison of super-resolution results from different models.

Figure 5. Case Study 2: Visual comparison of super-resolution results from different models.

Figure 6. Detailed comparison of reconstructed spatial details.

Table 1. Temporal Inventory of Historical Imagery (2011–2020) for the Buffer Zone of the Bin County Cave Temple (BCCT) Heritage Site.

Site	Year	Date	Resolution
Bin County Cave Temple (BCCT)	2011	2011-12-31	15.64 (Full coverage)
	2012	2012-12-31	15.64 (Full coverage)
	2013	2013-12-16	0.49 (Full coverage)
	2014	2014-05-20	0.49 (Full coverage)
	2015	2015-07-18	0.49(Partial coverage)
	2015	2015-12-31	15.64 (Full coverage)
	2016	2016-12-31	15.64 (Full coverage)
	2017	2017-03-01	0.49 (Full coverage)
	2018	2018-03-14	0.49 (Partial coverage)
	2018	2018-04-01	3.91 (Partial coverage)
	2019	2019-10-12	0.49 (Full coverage)
	2020	2020-01-29	10 (Full coverage)
	2020	2020-10-23	0.49 (Partial coverage)

Table 2. Reconstruction Accuracy of Different Models Under Varied Training Strategies on the Training Set.

Model	SSIM	PSNR	MSE
RDN_2	0.7438	26.1266	0.0030
RDN_2_M	0.7513	26.7660	0.0025
RDN	0.7076	25.3624	0.0035
RDN_M	0.7317	25.6637	0.0033
DRCN	0.6923	26.5137	0.0025
DRCN_M	0.7343	27.3228	0.0022
EDSR	0.7253	27.3738	0.0020
EDSR_M	0.7389	27.9311	0.0019
ESPCN	0.6690	21.2794	0.0107
ESPCN_M	0.6953	21.3010	0.0102

Table 3. Reconstruction Accuracy of Different Models Under Varied Training Strategies on the Test Set.

Model	SSIM	PSNR	MSE
RDN_2	0.7093	24.9060	0.0035
RDN_2_M	0.7267	27.6336	0.0019
RDN	0.6957	23.4044	0.0054
RDN_M	0.7114	25.3188	0.0033
DRCN	0.6739	20.3910	0.0095
DRCN_M	0.7034	26.6546	0.0023
EDSR	0.7000	25.7406	0.0028
EDSR_M	0.7120	27.1905	0.0021
ESPCN	0.6449	21.4774	0.0081
ESPCN_M	0.6839	21.6259	0.0095

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Chen, C.; Chen, F.; Gao, S.; Li, H.; Zhang, X.; Cheng, Y. Remote Sensing Image Super-Resolution for Heritage Sites Using a Temporal Invariance-Aware Training Strategy. Remote Sens. 2026, 18, 118. https://doi.org/10.3390/rs18010118

AMA Style

Chen C, Chen F, Gao S, Li H, Zhang X, Cheng Y. Remote Sensing Image Super-Resolution for Heritage Sites Using a Temporal Invariance-Aware Training Strategy. Remote Sensing. 2026; 18(1):118. https://doi.org/10.3390/rs18010118

Chicago/Turabian Style

Chen, Caiyan, Fulong Chen, Sheng Gao, Hongqiang Li, Xinru Zhang, and Yanni Cheng. 2026. "Remote Sensing Image Super-Resolution for Heritage Sites Using a Temporal Invariance-Aware Training Strategy" Remote Sensing 18, no. 1: 118. https://doi.org/10.3390/rs18010118

APA Style

Chen, C., Chen, F., Gao, S., Li, H., Zhang, X., & Cheng, Y. (2026). Remote Sensing Image Super-Resolution for Heritage Sites Using a Temporal Invariance-Aware Training Strategy. Remote Sensing, 18(1), 118. https://doi.org/10.3390/rs18010118

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Remote Sensing Image Super-Resolution for Heritage Sites Using a Temporal Invariance-Aware Training Strategy

Highlights

Abstract

1. Introduction

2. Materials and Methods

2.1. Study Area and Data

2.2. Data Processing and Methodology

2.2.1. Data Preprocessing

2.2.2. Spatial Resampling and Cropping

2.2.3. Radiometric Normalization

2.2.4. Patch Generation and Dataset Partitioning

2.3. Cross-Year Masked Sample Generation Algorithm

2.3.1. Stable Region Identification

2.3.2. Masked Sample Pair Generation

2.3.3. Mixed Training Strategy

2.4. Improved RDN Model

2.5. Experimental Setup and Evaluation Metrics

2.5.1. Baseline Models

2.5.2. Training Strategy and Configuration

3. Results

3.1. Quantitative Analysis of Reconstruction Accuracy

3.1.1. Performance on the Training Set

3.1.2. Generalization Performance on the Test Set

3.2. Qualitative Analysis of Visual Reconstruction Results

3.2.1. Overall Visual Fidelity and Residual Analysis

3.2.2. Restoration of Local Spatial Details

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI