1. Introduction
The normalized difference vegetation index (NDVI) is a critical metric that effectively reflects vegetation growth status and is closely correlated with aboveground net primary productivity [
1]. It is widely employed in diverse applications, such as vegetation phenology extraction [
2], environmental monitoring [
3], and agricultural management [
4]. Furthermore, the NDVI plays a pivotal role in detecting anomalous events within tropical rainforests, including deforestation [
5], flooding [
6], and droughts [
7]. However, deriving continuous NDVI time series data from optical remote sensing data entails significant challenges. Persistent cloud cover [
8] and atmospheric interference [
9] frequently lead to data gaps, severely compromising the continuity and reliability of satellite observations, particularly over tropical regions.
To solve the discontinuity problem of NDVI time series derived from optical remote sensing data, researchers have proposed a variety of NDVI time series smoothing and reconstruction methods. In accordance with the principles of the methods and the characteristics of the data source, these methods can be divided into two categories [
10]: temporal-based methods and frequency-based methods. Temporal-based methods involve the smoothing and gap-filling of time series data primarily through the use of asymmetric Gaussian function fitting [
11], weighted least squares linear regression [
12] or Savitzky–Golay (SG) filters [
13]. The frequency-based methods often involve the use of Fourier functions [
14], the harmonic analysis of time series [
15] and wavelet transform [
16]. Contemporary smoothing and fitting methods are capable of attenuating noise in time series to a certain extent. Nonetheless, traditional techniques for reconstructing the NDVI are characterized by certain limitations, including the loss of detail regarding vegetation changes [
17], suboptimal performance during critical phenological stages [
15], and overparameterization [
18].
In response to the shortcomings of these traditional methods, researchers have developed machine learning and deep learning-based approaches for reconstructing long-term NDVI series. Song et al. [
19] proposed a CNN-based MODIS-to-Landsat image fusion framework, which improved the spatial details of low-resolution MODIS images through end-to-end mapping to generate high-resolution Landsat NDVI images. However, owing to inherent information loss, the framework failed to restore severely flooded areas. Yu et al. [
20] developed LSTM- and RNN-based networks to fuse MODIS-Sentinel-2 NDVI time series for monitoring short-term vegetation dynamics, but accumulated errors in prolonged series compromised interannual cyclical modeling. Benzenati et al. [
21] pioneered the integration of Transformer architectures into remote sensing reconstruction, achieving spatiotemporal joint modeling via multihead attention mechanisms to generate high-resolution NDVI images in partially cloud-covered regions. Nevertheless, the method’s heavy reliance on cloud-free training data limits its adaptability to dynamic environments. Qin et al. [
22] introduced the ReCoff framework, which combines multiscale feature fusion with physics-driven residual constraints (e.g., temperature-NDVI correlations) to maintain spatial consistency under extreme cloud cover. However, cross-platform heterogeneity reduced the fusion efficiency, and the method exhibited sensitivity to high-latitude snow interference.
Although compared with traditional methods, deep learning-based methods have improved the reconstruction of spatiotemporal NDVI data, they still cannot overcome the inherent limitations of optical remote sensing. The core limitation is that optical sensors cannot penetrate clouds to obtain valid data. This defect directly leads to two serious problems: first, frequent cloud cover causes a large amount of data loss, disrupting the continuity of long-term time series; second, the model becomes very unstable because of its heavy reliance on high-quality cloud-free imagery, especially in extreme environments such as tropical rainforests and monsoon regions, where data loss rates can reach as high as 80%. In summary, relying solely on optical data cannot meet the needs of vegetation monitoring in key areas and important time periods. Research urgently needs to turn to multisource data fusion.
Synthetic aperture radar (SAR) [
23] is a remote sensing technology that can penetrate clouds and collect data under all weather conditions and at all times of the day. It overcomes optical limitations and has unique advantages in obtaining long-term surface data series in cloudy and rainy areas. In recent years, many studies have proposed ways to solve the problem of optical data reconstruction under cloud interference by utilizing the microwave observation advantages of SAR. He et al. [
24] used generative adversarial networks (cGAN) to achieve the fusion of multiphase SAR and optical image data but were limited by the physical mechanism differences between SAR and optical data, which resulted in blurred time-series transitions. To improve the ability to model data continuity and solve the data gap caused by clouds, Zhao et al. [
25] proposed the MCNN-Seq model by combining convolutional neural networks (CNNs) and sequence-to-sequence learning and introduced an attention mechanism to optimize the decoding process. This significantly improved the NDVI reconstruction accuracy under short-term cloud interference; however, this method still has the problem of performance degradation in scenarios where long-term series data are missing. Li et al. [
26] used the Transformer architecture to capture global spatiotemporal dependencies by fusing SAR data and optical data to reconstruct the main sentence of the vegetation NDVI time series in cloudy areas but faced the challenge of high computational cost. Mao et al. [
27] turned to lightweight linear regression (ST-MLR) to achieve faster fusion of SAR and optical data and obtain single-band NDVI long-term data, which improved efficiency but sacrificed the ability to represent complex vegetation dynamics in a nonlinear way. The Bidirectional Recursive Observation System (BRIOS) recently proposed by Chen et al. [
28] alleviates error accumulation by modelling the temporal bidirectional dependence between SAR and optical data, thereby achieving high-quality NDVI data reconstruction. However, its adaptability to cross-resolution data and sensitivity to cloud cover remain key bottlenecks. The methods of fusing SAR data and optical data integrate multisource remote sensing data to reconstruct NDVI time series, leveraging their complementary advantages. SAR data provide cloud-penetrating capabilities, whereas optical data offer spectral information that is critical for vegetation monitoring. However, some problems still exist in the SAR and optical data fusion methods, including the contamination of optical spectral information by coherent speckle noise during the feature fusion stage and the difference in inherent feature information between SAR and optical data.
In view of the defects of the fusion method of SAR and optical data, scholars have proposed deriving optical data from SAR data in recent years to obtain NDVI time series directly from SAR data. Guo et al. [
29] proposed an edge-preserving convolutional generative adversarial network (EPCGAN), which improves the edge detail accuracy of SAR-to-optical images through gradient branch constraints, but it is limited to single-channel SAR input and cannot utilize the complementary information of multichannel data. To overcome the bottleneck of missing time-series data, Garioud et al. [
30] developed the SenRVM model, which uses a recurrent neural network (RNN) to derive long-term NDVI data from SAR time series in the cloud area. However, overreliance on the quality of the NDVI input leads to overfitting under high missing data rates. Yang et al. [
31] designed an ICGAN model for multiscale feature fusion, which enhances the visual consistency and classification availability of SAR-NDVI conversion through chromatic aberration constraints, but the full-scale calculation mechanism causes high time consumption and terrain artefacts. Recently, Hu et al. [
32] combined dual-phase SAR conversion with the generation of a burn index (dNBR) to assess fire damage but was limited by the severe imbalance of training data (false detection of nonfire surface changes). SAR-optical conversion methods have gradually overcome the limitations of single modalities and expanded application scenarios through technological evolution. However, the above methods for deriving optical data from SAR data are highly dependent on paired datasets. In cloudy and rainy areas, it is difficult to obtain enough paired data to complete high-level method training.
In recent years, Zhu et al. [
33] proposed Cycle-Consistent Adversarial Networks (CycleGAN) for image conversion without paired data. CycleGAN is currently widely used in many fields. Sandfort et al. [
34] used CycleGAN to derive noncontrast images from contrast CT images to enhance CT image data in the medical field and reduce the workload and cost of manual segmentation in CT imaging. Engin et al. [
35] used CycleGAN to derive corresponding fog-free ground-truth images from blurry images, effectively improving the defogging effect of a single image, and Deng et al. [
36] used CycleGAN to derive pedestrian images in different fields to enhance the effect of the pedestrian reidentification (re-ID) model. Quan et al. [
37] used an improved CycleGAN to derive fully sampled MRI images from undersampled MRI images, thereby achieving high-quality MRI reconstruction. CycleGAN has also been gradually applied to the field of remote sensing image processing. Efremova et al. [
38] proposed a framework combining CycleGAN and autoencoder to derive optical data from SAR data, thereby predicting soil moisture content (SMC) through machine learning. Suárez et al. [
39] used CycleGAN to derive near-infrared (NIR) images from RGB images and combined the derived NIR images with true RGB images to calculate the NDVI values. In summary, CycleGAN demonstrates good cross-domain data conversion capabilities in the field of remote sensing, especially for unpaired data sources.
However, directly applying the standard CycleGAN to SAR-to-optical translation faces specific limitations. First, CycleGAN relies heavily on Convolutional Neural Networks (CNNs), which excel at extracting local features but struggle to capture long-range global dependencies. This often leads to geometric distortions in complex landscapes where spatial context is crucial. Second, without structural constraints, the generator may synthesize realistic-looking textures that, however, fail to align with the actual ground objects—a phenomenon known as ‘hallucination.’ To address the challenges of unsupervised translation, several representative advanced models have been proposed. DualGAN [
40] introduced a dual-learning mechanism to improve training stability and data efficiency using a closed-loop architecture similar to natural language translation. GP-UNIT [
41] leveraged generative priors from pre-trained networks to capture rich content correspondences, aiming to bridge drastic visual discrepancies in complex domain mappings. DCLGAN [
42] incorporated contrastive learning to maximize mutual information between input and output patches, effectively mitigating the mode collapse issue often found in unpaired translation. While these models represent state-of-the-art advancements in efficiency, complex mapping, and stability, respectively, they still exhibit shortcomings in maintaining fine-grained structural fidelity and suppressing speckle noise when applied to the highly heterogeneous SAR-to-NDVI translation task. Therefore, it is necessary to improve the CycleGAN architecture to simultaneously capture spatiotemporal dependencies and enforce structural consistency. Based on CycleGAN, this study developed a method to derive temporally continuous NDVI time series from SAR images without paired data. The main contributions of this study are summarized as follows: (1) We propose an improved SA-CycleGAN, which integrates a novel spatiotemporal attention generator capable of capturing long-range dependencies in SAR sequences, thereby effectively distinguishing complex land cover patterns. (2) We introduce a structural similarity (SSIM) loss function to suppress inherent speckle noise in SAR data while preserving high-frequency structural details in NDVI images. (3) We demonstrate that the proposed method can acquire temporally continuous NDVI data in cloudy areas without requiring strictly paired training data, achieving superior accuracy compared to current mainstream unsupervised models.
2. Materials and Methods
2.1. Study Areas
In this study, we selected four sites (Zhangbei, Xishuangbanna, AGRO, HARV) with different land cover types to test the proposed algorithm. The latitude and longitude of the center points for these sites are listed in
Table 1. Each site covers a rectangular area of 5.12 km × 5.12 km centered on the center point.
The Zhangbei site, which is located in northern China, represents a temperate grassland ecosystem. It is characterized predominantly by temperate grassland vegetation. This makes it a representative area for research on grazing dynamics and grassland ecosystem processes. The Xishuangbanna site, located in Yunnan Province, China, represents a tropical rainforest ecosystem. This region is characterized by high biodiversity and dense tropical vegetation. Consequently, it plays a critical role in studies related to tropical ecosystems, species richness, and the impacts of climate change. The AGRO site is situated in Iowa, USA, and is primarily agricultural. It features extensive cultivation of crops, including corn, soybeans, and wheat. This site is well suited for investigations into farmland ecosystem functioning and crop phenology. The HARV site, located in Massachusetts, USA, is dominated by mixed temperate deciduous forest. This environment is suitable for examining forest ecosystem dynamics and biodiversity changes.
2.2. Data
The Sentinel-1 SAR and Sentinel-2 multispectral instrument (MSI) imagery provided by the European Space Agency’s (ESA) Copernicus Program was used in this study. Sentinel-1 consists of two satellites, Sentinel-1A and Sentinel-1B. They provide day-and-night C-band SAR imaging. With both satellites in operation, they achieve a six-day revisit time at the equator. In this study, SAR images acquired in interferometric wide-swath (IW) mode with a spatial resolution of approximately 10 m were used. Sentinel-2 also consists of two satellites, Sentinel-2A and Sentinel-2B. Each satellite carries an MSI. The MSI captures images in 13 spectral bands, covering the visible (VIS), near-infrared (NIR), and shortwave infrared (SWIR) regions. Spatial resolution varies by spectral band: the four VIS/NIR bands (blue, green, red, and NIR) have a resolution of 10 m; the six red-edge and SWIR bands have a resolution of 20 m; and the three atmospheric correction bands have a resolution of 60 m. When both satellites are in orbit, Sentinel-2 has a revisit time of five days at the equator. In this study, L2A products (atmospherically corrected lower atmospheric reflectance data) with a spatial resolution of approximately 10 m were used.
For each site, Sentinel-1 SAR images covering the period 2017–2021 and Sentinel-2 MSI images covering 2019–2023 were collected for our study. Given the extensive temporal coverage and data volume involved across multiple sites, the Google Earth Engine (GEE) platform was utilized for efficient data download. The number of Sentinel-1 SAR and Sentinel-2 MSI images for each site is detailed in
Table 2.
2.3. Overall Procedure
In this study, an enhanced CycleGAN (denoted by SA-CycleGAN) is developed to derive temporally continuous NDVI time series from SAR images without paired data. The SA-CycleGAN was trained using Sentinel-1 SAR and Sentinel-2 MSI images that were not strictly paired in space and time. The overall workflow is illustrated in
Figure 1. It comprises four steps. Step 1: Data preprocessing. For the Sentinel-1 SAR images, speckle filtering and boundary noise removal were performed within GEE to mitigate noise effects. Three bands were prepared for each SAR image: the VV and VH backscatter coefficients and the normalized difference backscatter index (NDBI), calculated in decibels (dB). For Sentinel-2 MSI images, preprocessing involved selecting images with less than 20% cloud cover on the basis of the ‘CLOUDY_PIXEL_PERCENTAGE’ attribute. Three bands were prepared: surface reflectance in the red (B4) and near-infrared (B8) bands and NDVI calculated from the surface reflectance. Step 2: Training dataset construction. After preprocessing, aligned Sentinel-1 SAR and Sentinel-2 MSI images were exported. These exports covered regions of interest (ROIs) defined for the four study sites. To create the final dataset, the exported images were cropped into nonoverlapping tiles of 256 × 256 pixels. All tiles were visually inspected, and those containing large areas of no data, significant cloud cover, or noticeable mosaic artefacts were discarded. The resulting dataset was then split into training and test datasets at a 4:1 ratio. The acquisition date and corresponding file path for each available image tile within the specified time range for each ROI were also exported in JSON format. Step 3: Model training. The Sentinel-1 SAR and Sentinel-2 MSI training datasets, along with time-series information, serve as inputs to the SA-CycleGAN. The model is iteratively trained, passing training datasets through its two modified generators and discriminators to refine its parameters and achieve an optimal configuration. Step 4: Model testing and validation. The SAR image test datasets were fed into the optimized SA-CycleGAN model to derive the NDVI values. The accuracy of the NDVI values was then verified by comparison with the NDVI values calculated from the Sentinel-2 MSI images.
2.4. SA-CycleGAN
CycleGAN consists of two generators and two discriminators. Combining adversarial loss with cycle consistency loss cleverly achieves mutual conversion between two image domains without paired training data. In this study, an enhanced CycleGAN (SA-CycleGAN) was developed. The structure of the SA-CycleGAN is shown in
Figure 2. To enhance the performance of the generator, we propose a novel spatial–temporal attention architecture that integrates 3D convolutions with self-attention and cross-attention mechanisms. This design enables the model to simultaneously capture global spatial patterns, fine-grained local details, and critical temporal dynamics from the input data. For the discriminator, we retain the PatchGAN architecture from the CycleGAN, which effectively balances computational efficiency with the ability to assess local image realism. Furthermore, we augment the CycleGAN loss function with a structural similarity (SSIM) loss term to explicitly improve the structural fidelity of the derived results. The following sections provide a detailed description of the spatial–temporal attention generator, the PatchGAN discriminator, and the composite loss function employed for network training.
2.4.1. Spatiotemporal Attention Generator
To address the complexity of time series data and enhance cross-modal translation between SAR and optical images, we propose a novel spatiotemporal attention generator. This architecture significantly redesigns the traditional encoder–decoder framework by integrating three key components: a spatiotemporal encoder, an intermediate position including a self-attention mechanism and residual blocks, and an attention decoder. The architecture of the spatiotemporal attention generator is shown in
Figure 3a.
The workflow of the spatiotemporal attention generator begins with a spatiotemporal encoder, which processes an input image sequence of T frames (e.g., T = 3). The spatiotemporal encoder consists of two 3D convolutional layers and a downsampling layer consisting of three 2D convolutional layers. The first 3D convolutional layer uses a 3 × 3 × 3 convolution kernel to capture local spatiotemporal features and short-term pixel dynamics between adjacent frames, and the second 3D convolutional layer uses convolution kernels with a temporal dimension equal to the sequence length (e.g., T × 3 × 3) to fuse and compress temporal information along the temporal axis. The output is a single, time-aware 2D feature map that encapsulates the dynamics of the input sequence. This time-aware feature map then enters the downsampling layer. An initial 7 × 7 2D convolutional layer performs large-scale feature extraction, followed by two 3 × 3 2D convolutional layers with stride 2, which progressively downsample the spatial dimensions of the feature map while increasing its channel depth. The feature maps generated at each scale by the encoder are saved via skip connections for later use in the decoder.
In the middle of the generator, the deeply encoded features undergo complex feature transformations using nine residual blocks. The structure of the residual block is shown in
Figure 3b. Following these residual blocks, a self-attention mechanism is applied.
Figure 3c shows the structure of the self-attention mechanism. This module operates on a single feature map, computing the query (
Q), key (
K), and value (
V) from the same source. Its function is to capture long-range spatial dependencies within the feature map, ensuring global consistency by allowing different spatial locations to interact and optimize each other’s representations before the decoding phase begins.
The final module of the generator is the attention decoder, which consists of upsampling layers and criss-cross attention mechanisms to reconstruct the image. This decoder innovatively incorporates the criss-cross attention mechanism between two upsampling layers. This strategy aims to bridge the modality gap by establishing a direct dialog between the Sentinel-2 MSI image generation process and the encoded Sentinel-1 SAR image features and aligning the temporal sequence information captured from the Sentinel-1 SAR image with the spatial features synthesized for the Sentinel-2 MSI image. The structure of the criss-cross attention mechanism is shown in
Figure 3d. Unlike the self-attention mechanism, the query (
Q) vector of the criss-cross attention mechanism is generated from the upsampled feature maps of the decoder, and the key (
K) and value (
V) vectors come from the feature maps of the corresponding encoder layer in the other sequence. Since the encoder features are the product of the upstream temporal encoder, the
K and
V vectors inherently contain temporal sequence information, such as the temporal variation in backscatter patterns. Then, Formula (1) is used to compute the dot product similarity between the decoder’s
Q values and the encoder’s time-aware
K values to generate an attention map. Then, Formula (1) was used to compute the dot product similarity between the decoder’s
Q values (generated from the optical feature maps) and the encoder’s time-aware
K values (derived from the SAR sequence) to generate an attention map. Crucially, the softmax operation in the attention mechanism acts as a soft-masking filter. It assigns lower weights to incoherent patterns (such as SAR speckle noise) and higher weights to consistent structural features that align with the semantic context of the optical decoding process. This effectively suppresses the propagation of SAR noise into the generated optical domain.
The attention map weights the importance of different temporal patterns from the Sentinel-1 SAR image relative to the feature currently being synthesized in the Sentinel-2 MSI image decoder. The attention map is then applied to the V vector to generate a context vector, which is integrated into the decoder’s feature maps. This attention decoding, performed at multiple semantic scales, aims to leverage the relevant spatiotemporal context from the source domain to guide the translation process.
2.4.2. PatchGAN Discriminators
The discriminator structure of the CycleGAN is based on PatchGAN, and its core idea is to improve the quality of high-frequency details of generated images by fine-tuning the local area of the image. The SA-CycleGAN in this study retains the discriminator structure of the CycleGAN; that is, the PatchGAN discriminator is used to evaluate the NDVI values derived from Sentinel-1 SAR images by comparison with the NDVI values calculated from the Sentinel-2 MSI images. The 256 × 256 × 3 image output by the generator is first processed by a multilevel convolution module, which includes four convolution layers, instance normalization and LeakyReLU activation (CILR) stacking, and gradually downsampled to a feature map size of 64 × 64 × 512. The step size of each convolution layer is 2, the kernel size is 4 × 4, and the padding method is reflection padding to avoid edge artifacts.
2.4.3. Loss Functions
In the CycleGAN, the loss function consists mainly of adversarial loss and cycle consistency loss, which work together to ensure that the mapping is reversible and that the output remains consistent with the input domain. To improve the structural accuracy and detail of the generated images, a structural similarity index (SSIM) loss is added to the loss function.
We adopt the adversarial loss function proposed in the least squares generative adversarial network (LSGAN) [
43], and its objective function can be expressed as
where
y represents images from domain
Y (Sentinel-2 MSI images) and
x represents images from domain
X (Sentinel-1 SAR images).
represents the score of the images in the real
Y in the discriminator
.
G represents the generator that derives Sentinel-2 MSI images from Sentinel-1 SAR images, and
represents the images derived by the generator from
x that are identically distributed to
Y.
represents the score obtained by the discriminator based on the derived images.
Compared with the binary cross-entropy loss of the traditional GAN, the adversarial loss results in better training stability and a higher quality of generated results, thus alleviating the problem of gradient disappearance. The traditional CycleGAN includes the cycle consistency loss to constrain the reversibility of mappings between the two domains:
where
F represents the generator that derives Sentinel-1 SAR images from Sentinel-2 MSI images and
represents the images derived by the generator from
y that are identically distributed to
X. The meanings of
,
x and
y are the same as those in Formula (2).
To preserve the structural and textural similarity of the synthesized images, this study incorporates the SSIM loss to supplement the adversarial loss and cycle consistency loss:
where the SSIM is derived from factors such as luminance, contrast, and structure.
where
corresponds to the NDVI values calculated from the Sentinel-2 MSI image and y corresponds to the NDVI values derived from the Sentinel-1 SAR image by the SA-CycleGAN.
and
are the local means of
and
, respectively;
and
are the local variances of x and y, respectively;
is the local covariance of
and
;
and
are stability constants introduced to prevent the denominator from approaching 0, which are set to 0.01
2 and 0.03
2, respectively. A higher SSIM value, corresponding to a lower loss, indicates greater similarity between the NDVI values derived from the Sentinel-1 SAR image and the NDVI values calculated from the Sentinel-2 MSI image.
The total loss function can be written as follows:
where
and
control the weights of the cycle consistency loss and the SSIM loss, respectively. In this study,
is set to 10, and
is set to 0.5 to balance texture and structure preservation.
2.5. Training Dataset
After initial preprocessing, the core task was to construct a time-series dataset suitable for model training. The Sentinel-1 SAR and Sentinel-2 MSI images covering the four sites were exported from the Google Earth Engine.
A key component of this phase was the integration of temporal metadata, which were exported along with the images as JSON files. These files record the acquisition date and corresponding file path of each available image within the specified time range for each site. Furthermore, to prepare suitable inputs for the SA-CycleGAN model, the Sentinel-1 SAR and Sentinel-2 MSI images were systematically cropped into nonoverlapping tiles of 256 × 256 pixels. The cropped tiles with land cover according to the GLC_FCS10 dataset [
44] for the four selected sites are shown in
Figure 4. Rigorous quality control was then performed. Before sequence generation, each tile was visually inspected, and those with significant issues (such as large areas without data, significant unobstructed cloud cover or cloud shadows, or obvious processing artifacts) were removed. This screening process ensures that the final dataset contains high-quality Sentinel-1 SAR and Sentinel-2 MSI images. These selected datasets are divided into training and testing datasets at a standard 4:1 ratio. In
Figure 4, the red border area of each site represents the area of the test dataset, and the remaining area is the training dataset. The blue points in the testing dataset represent the points selected for the temporal consistency analysis described in
Section 3.2. The numbers of tiles for the training and testing datasets obtained after screening for each site are shown in
Table 3.
During the training phase, we parse the JSON file during data loading for the time series information of each geographic location to create a chronological list of all available image tiles. A sliding window method with a predefined sequence length (three consecutive images in this study) is subsequently applied to this list. The generator and discriminator are then updated alternately in a dynamic manner. In this study, the following training strategy is adopted: In each iteration, the parameters of the generator are first fixed, and then the discriminator is updated by minimizing its loss. Then, the discriminator parameters are fixed, and the generator is updated by minimizing the combination of adversarial loss, cycle consistency loss, and SSIM loss. To prevent the discriminator from converging too quickly or the generator from performing poorly, we set a specific update frequency and use a history buffer to stabilize the training. The learning rate follows a fixed or linear decay strategy, with the initial rate specified as 0.0002 and gradually decreasing over time. The Adam optimizer [
45] is used, and the parameter value is adjusted to 0.5 on the basis of experiments to ensure a balance between details, structure, and overall realism. The training is based on an RTX4090 graphics card with 24 GB of video memory, and the number of iterations is set to 100 epochs. When both the adversarial loss and the cycle consistency loss (as well as the SSIM loss) tend to stabilize or slowly decrease, the model is considered to have reached the optimal state.
2.6. Performance Evaluation
To evaluate the effectiveness of the SA-CycleGAN model, we conducted comparative experiments against three established unsupervised image-to-image translation algorithms: DualGAN, GP-UNIT, and DCLGAN. The evaluation comprised two primary components: spatial consistency and temporal consistency. To assess spatial consistency, we selected Sentinel-1 SAR and Sentinel-2 MSI images over the four study areas. The selected Sentinel-2 MSL images were required to have less than 5% cloud cover. We then aligned the NDVI values derived from the Sentinel-1 SAR images by the SA-CycleGAN with the NDVI values calculated from the corresponding Sentinel-2 MSI images. The spatial distribution of the NDVI values derived from the SA-CycleGAN was visualized for intuitive comparison. Furthermore, we plotted a probability kernel density plot of the NDVI values derived from the SA-CycleGAN and the NDVI values calculated from the corresponding Sentinel-2 MSI images to verify spatial consistency. To evaluate temporal consistency, we focused on the NDVI values calculated from Sentinel-2 MSI images between 2019 and 2021. We compared the NDVI time series derived by the SA-CycleGAN under these conditions with the outputs of three other models. All the models were trained and tested using the same dataset configuration. The results were further analyzed by meteorological season: spring (March to May), summer (June to August), autumn (September to November), and winter (December to February). We calculated the seasonal means of the quantitative metrics and generated radar plots to visually compare model performance across seasons. Finally, a series of ablation studies were performed to verify the contributions of the key components in our proposed model. We compared the performance of the full SA-CycleGAN against three variants: the original CycleGAN, CycleGAN with only the spatial–temporal attention module (Attention-CycleGAN), and CycleGAN with only the SSIM loss function (SSIM-CycleGAN). Furthermore, sensitivity analysis was performed for different input image sequence lengths (T) to determine the optimal input image sequence length for the model.
To quantitatively evaluate the model performance, four metrics, namely, the root mean square error (RMSE), coefficient of determination (R
2), peak signal-to-noise ratio (PSNR) and SSIM, were calculated. The
RMSE is the difference between the predicted and actual values. A lower
RMSE indicates smaller prediction errors. It is calculated as follows:
where
denotes the NDVI values calculated from the Sentinel-2 MSI images and
is the NDVI value derived from SA-CycleGAN.
R2 evaluates how well the model’s outputs fit the true data. A higher
R2 value signifies greater consistency between the predicted and actual values. It is calculated as follows:
where
is the average of the NDVI values calculated from the Sentinel-2 MSI images. The
PSNR assesses the quality of the derived NDVI images. Higher
PSNR values indicate better image quality. It is expressed as follows:
where
is the maximum possible pixel value of the image (e.g., 1 for normalized NDVI) and MSE is the mean squared error between the Sentinel-2 MSI and derived images. The
SSIM measures the structural similarity between the NDVI image derived by the SA-CycleGAN and the NDVI image calculated from Sentinel-2 MSI images, where higher values indicate better preservation of structural details. It is defined as consistent with Formula (5).
4. Discussion
Our results demonstrate that the SA-CycleGAN significantly outperforms state-of-the-art unsupervised models (DualGAN, GP-UNIT, and DCLGAN) in deriving high-fidelity NDVI from Sentinel-1 SAR images. This superiority is particularly evident in spatially heterogeneous regions, such as the Zhangbei agricultural site. As observed in
Figure 5 and the kernel density distributions in
Figure 9a, the SA-CycleGAN successfully reconstructs the bimodal distribution of NDVI values, effectively distinguishing between vegetated surfaces and bare soil. Traditional CNN-based models like DualGAN and GP-UNIT rely on local convolutions, which tend to average spatially adjacent features, leading to “mode collapse” where distinct spectral peaks are merged. In contrast, our integration of a spatiotemporal attention mechanism allows the model to capture global context and long-range dependencies, preserving sharp field boundaries and preventing the oversmoothing of fine-grained textures observed in comparison models. Similarly, in the dense tropical rainforest of Xishuangbanna (
Figure 6) and the mixed forest of HARV (
Figure 8), the SA-CycleGAN maintains structural integrity and accurate vegetation density gradients where other models fail, exhibiting severe artifacts or spectral aliasing. These findings confirm that the synergistic effect of the spatiotemporal attention module and the SSIM loss function is crucial for handling complex land cover patterns and preserving edge details in the absence of paired training data.
A critical advantage of the SA-CycleGAN is its ability to generate temporally continuous and phenologically accurate NDVI time series, acting as a robust gap-filling tool for cloudy regions. The temporal consistency analysis (
Figure 10) reveals that our model effectively filters out the noise and instability inherent in optical time series caused by residual cloud contamination. For instance, at the Zhangbei and AGRO sites, the derived NDVI trajectories closely match the upper envelope of the Sentinel-2 observations, accurately characterizing growth cycles despite fluctuations in the ground-truth data. This “robust filter” capability is further highlighted at the Xishuangbanna site (
Figure 10b), where frequent cloud cover renders optical data unreliable; the SA-CycleGAN provides a stable, high-NDVI baseline that reflects the true phenological state of the rainforest. Quantitative seasonal analysis (
Figure 11) further underscores this robustness, showing that SA-CycleGAN maintains superior metrics (RMSE, PSNR, SSIM, R
2) across all seasons, even mitigating snow-related scattering noise during winter months where other models’ performance degrades significantly
Despite its superior performance, the SA-CycleGAN exhibits some limitations in specific scenarios. In regions with extremely high heterogeneity and subtle NDVI gradients, such as the 1E subregion of Zhangbei (
Figure 5c) and the 1D subregion of HARV (
Figure 8d), the model occasionally overestimates or underestimates vegetation density compared to Sentinel-2 data. This suggests that while the attention mechanism improves global dependency modelling, extremely fine-scale texture variations in mixed forests or sparse vegetation might still be challenging to fully resolve without higher-resolution SAR input. Additionally, although the model shows resilience to winter snow cover, performance metrics do decrease slightly compared to summer seasons (
Figure 11), indicating that snow scattering mechanisms introduce complex non-linearities that are harder to translate. Future work will focus on integrating physical scattering models or multi-frequency SAR data to better handle these extreme conditions and further improve fine-scale texture preservation.