An Enhanced CycleGAN to Derive Temporally Continuous NDVI from Sentinel-1 SAR Images

Wang, Anqi; Xiao, Zhiqiang; Zhao, Chunyu; Li, Juan; Zhang, Yunteng; Song, Jinling; Yang, Hua

doi:10.3390/rs18010056

Open AccessArticle

An Enhanced CycleGAN to Derive Temporally Continuous NDVI from Sentinel-1 SAR Images

by

Anqi Wang

,

Zhiqiang Xiao

^*

,

Chunyu Zhao

,

Juan Li

,

Yunteng Zhang

,

Jinling Song

and

Hua Yang

State Key Laboratory of Remote Sensing and Digital Earth, Faculty of Geographical Science, Beijing Normal University, Beijing 100875, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2026, 18(1), 56; https://doi.org/10.3390/rs18010056

Submission received: 8 November 2025 / Revised: 18 December 2025 / Accepted: 20 December 2025 / Published: 24 December 2025

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

An enhanced CycleGAN (SA-CycleGAN) was developed to derive temporally continuous NDVI values from Sentinel-1 SAR images in the absence of paired training data.
The SA-CycleGAN integrates 3D convolution, spatiotemporal attention, and structural similarity (SSIM) loss.
The performance of the SA-CycleGAN significantly outperforms three existing unsupervised models: DualGAN, GP-UNIT, and DCLGAN.

What are the implications of the main findings?

This study provides an effective solution for robustly and faithfully deriving the temporally continuous NDVI values from all-weather SAR imagery in the absence of paired training data.
This research solves the data loss problem caused by frequent cloud cover in optical remote sensing imagery and provides strong technical support for continuous ecological environment monitoring in cloudy and rainy areas.

Abstract

Frequent cloud cover severely limits the use of optical remote sensing for continuous ecological monitoring. Synthetic aperture radar (SAR) offers an all-weather alternative, but translating SAR data to optical equivalents is challenging, particularly in cloudy regions where paired training data are scarce. To address this, we developed an enhanced CycleGAN (denoted by SA-CycleGAN) to derive a high-fidelity, temporally continuous normalized difference vegetation index (NDVI) from SAR imagery. The SA-CycleGAN introduces a novel spatiotemporal attention generator that dynamically computes global and local feature relationships to capture long-range spatial dependencies across diverse landscapes. Furthermore, a structural similarity (SSIM) loss function is integrated into the SA-CycleGAN to preserve the structural and textural integrity of the synthesized images. The performance of the SA-CycleGAN and three unsupervised models (DualGAN, GP-UNIT, and DCLGAN) was evaluated by deriving NDVI time series from Sentinel-1 SAR images across four sites with different vegetation types. Ablation experiments were conducted to verify the contributions of the key components in the SA-CycleGAN model. The results demonstrate that the SA-CycleGAN significantly outperformed the comparison models across all four sites. Quantitatively, the proposed method achieved the lowest Root Mean Square Error (RMSE) of 0.0502 and the highest Coefficient of Determination (R²) of 0.88 at the Zhangbei and Xishuangbanna sites, respectively. The ablation experiments confirmed that the attention mechanism and SSIM loss function were crucial for capturing long-range features and maintaining spatial structure. The SA-CycleGAN proves to be a robust and effective solution for overcoming data gaps in optical time series.

Keywords:

NDVI; CycleGAN; time series; SAR

1. Introduction

The normalized difference vegetation index (NDVI) is a critical metric that effectively reflects vegetation growth status and is closely correlated with aboveground net primary productivity [1]. It is widely employed in diverse applications, such as vegetation phenology extraction [2], environmental monitoring [3], and agricultural management [4]. Furthermore, the NDVI plays a pivotal role in detecting anomalous events within tropical rainforests, including deforestation [5], flooding [6], and droughts [7]. However, deriving continuous NDVI time series data from optical remote sensing data entails significant challenges. Persistent cloud cover [8] and atmospheric interference [9] frequently lead to data gaps, severely compromising the continuity and reliability of satellite observations, particularly over tropical regions.

To solve the discontinuity problem of NDVI time series derived from optical remote sensing data, researchers have proposed a variety of NDVI time series smoothing and reconstruction methods. In accordance with the principles of the methods and the characteristics of the data source, these methods can be divided into two categories [10]: temporal-based methods and frequency-based methods. Temporal-based methods involve the smoothing and gap-filling of time series data primarily through the use of asymmetric Gaussian function fitting [11], weighted least squares linear regression [12] or Savitzky–Golay (SG) filters [13]. The frequency-based methods often involve the use of Fourier functions [14], the harmonic analysis of time series [15] and wavelet transform [16]. Contemporary smoothing and fitting methods are capable of attenuating noise in time series to a certain extent. Nonetheless, traditional techniques for reconstructing the NDVI are characterized by certain limitations, including the loss of detail regarding vegetation changes [17], suboptimal performance during critical phenological stages [15], and overparameterization [18].

In response to the shortcomings of these traditional methods, researchers have developed machine learning and deep learning-based approaches for reconstructing long-term NDVI series. Song et al. [19] proposed a CNN-based MODIS-to-Landsat image fusion framework, which improved the spatial details of low-resolution MODIS images through end-to-end mapping to generate high-resolution Landsat NDVI images. However, owing to inherent information loss, the framework failed to restore severely flooded areas. Yu et al. [20] developed LSTM- and RNN-based networks to fuse MODIS-Sentinel-2 NDVI time series for monitoring short-term vegetation dynamics, but accumulated errors in prolonged series compromised interannual cyclical modeling. Benzenati et al. [21] pioneered the integration of Transformer architectures into remote sensing reconstruction, achieving spatiotemporal joint modeling via multihead attention mechanisms to generate high-resolution NDVI images in partially cloud-covered regions. Nevertheless, the method’s heavy reliance on cloud-free training data limits its adaptability to dynamic environments. Qin et al. [22] introduced the ReCoff framework, which combines multiscale feature fusion with physics-driven residual constraints (e.g., temperature-NDVI correlations) to maintain spatial consistency under extreme cloud cover. However, cross-platform heterogeneity reduced the fusion efficiency, and the method exhibited sensitivity to high-latitude snow interference.

Although compared with traditional methods, deep learning-based methods have improved the reconstruction of spatiotemporal NDVI data, they still cannot overcome the inherent limitations of optical remote sensing. The core limitation is that optical sensors cannot penetrate clouds to obtain valid data. This defect directly leads to two serious problems: first, frequent cloud cover causes a large amount of data loss, disrupting the continuity of long-term time series; second, the model becomes very unstable because of its heavy reliance on high-quality cloud-free imagery, especially in extreme environments such as tropical rainforests and monsoon regions, where data loss rates can reach as high as 80%. In summary, relying solely on optical data cannot meet the needs of vegetation monitoring in key areas and important time periods. Research urgently needs to turn to multisource data fusion.

Synthetic aperture radar (SAR) [23] is a remote sensing technology that can penetrate clouds and collect data under all weather conditions and at all times of the day. It overcomes optical limitations and has unique advantages in obtaining long-term surface data series in cloudy and rainy areas. In recent years, many studies have proposed ways to solve the problem of optical data reconstruction under cloud interference by utilizing the microwave observation advantages of SAR. He et al. [24] used generative adversarial networks (cGAN) to achieve the fusion of multiphase SAR and optical image data but were limited by the physical mechanism differences between SAR and optical data, which resulted in blurred time-series transitions. To improve the ability to model data continuity and solve the data gap caused by clouds, Zhao et al. [25] proposed the MCNN-Seq model by combining convolutional neural networks (CNNs) and sequence-to-sequence learning and introduced an attention mechanism to optimize the decoding process. This significantly improved the NDVI reconstruction accuracy under short-term cloud interference; however, this method still has the problem of performance degradation in scenarios where long-term series data are missing. Li et al. [26] used the Transformer architecture to capture global spatiotemporal dependencies by fusing SAR data and optical data to reconstruct the main sentence of the vegetation NDVI time series in cloudy areas but faced the challenge of high computational cost. Mao et al. [27] turned to lightweight linear regression (ST-MLR) to achieve faster fusion of SAR and optical data and obtain single-band NDVI long-term data, which improved efficiency but sacrificed the ability to represent complex vegetation dynamics in a nonlinear way. The Bidirectional Recursive Observation System (BRIOS) recently proposed by Chen et al. [28] alleviates error accumulation by modelling the temporal bidirectional dependence between SAR and optical data, thereby achieving high-quality NDVI data reconstruction. However, its adaptability to cross-resolution data and sensitivity to cloud cover remain key bottlenecks. The methods of fusing SAR data and optical data integrate multisource remote sensing data to reconstruct NDVI time series, leveraging their complementary advantages. SAR data provide cloud-penetrating capabilities, whereas optical data offer spectral information that is critical for vegetation monitoring. However, some problems still exist in the SAR and optical data fusion methods, including the contamination of optical spectral information by coherent speckle noise during the feature fusion stage and the difference in inherent feature information between SAR and optical data.

In view of the defects of the fusion method of SAR and optical data, scholars have proposed deriving optical data from SAR data in recent years to obtain NDVI time series directly from SAR data. Guo et al. [29] proposed an edge-preserving convolutional generative adversarial network (EPCGAN), which improves the edge detail accuracy of SAR-to-optical images through gradient branch constraints, but it is limited to single-channel SAR input and cannot utilize the complementary information of multichannel data. To overcome the bottleneck of missing time-series data, Garioud et al. [30] developed the SenRVM model, which uses a recurrent neural network (RNN) to derive long-term NDVI data from SAR time series in the cloud area. However, overreliance on the quality of the NDVI input leads to overfitting under high missing data rates. Yang et al. [31] designed an ICGAN model for multiscale feature fusion, which enhances the visual consistency and classification availability of SAR-NDVI conversion through chromatic aberration constraints, but the full-scale calculation mechanism causes high time consumption and terrain artefacts. Recently, Hu et al. [32] combined dual-phase SAR conversion with the generation of a burn index (dNBR) to assess fire damage but was limited by the severe imbalance of training data (false detection of nonfire surface changes). SAR-optical conversion methods have gradually overcome the limitations of single modalities and expanded application scenarios through technological evolution. However, the above methods for deriving optical data from SAR data are highly dependent on paired datasets. In cloudy and rainy areas, it is difficult to obtain enough paired data to complete high-level method training.

In recent years, Zhu et al. [33] proposed Cycle-Consistent Adversarial Networks (CycleGAN) for image conversion without paired data. CycleGAN is currently widely used in many fields. Sandfort et al. [34] used CycleGAN to derive noncontrast images from contrast CT images to enhance CT image data in the medical field and reduce the workload and cost of manual segmentation in CT imaging. Engin et al. [35] used CycleGAN to derive corresponding fog-free ground-truth images from blurry images, effectively improving the defogging effect of a single image, and Deng et al. [36] used CycleGAN to derive pedestrian images in different fields to enhance the effect of the pedestrian reidentification (re-ID) model. Quan et al. [37] used an improved CycleGAN to derive fully sampled MRI images from undersampled MRI images, thereby achieving high-quality MRI reconstruction. CycleGAN has also been gradually applied to the field of remote sensing image processing. Efremova et al. [38] proposed a framework combining CycleGAN and autoencoder to derive optical data from SAR data, thereby predicting soil moisture content (SMC) through machine learning. Suárez et al. [39] used CycleGAN to derive near-infrared (NIR) images from RGB images and combined the derived NIR images with true RGB images to calculate the NDVI values. In summary, CycleGAN demonstrates good cross-domain data conversion capabilities in the field of remote sensing, especially for unpaired data sources.

However, directly applying the standard CycleGAN to SAR-to-optical translation faces specific limitations. First, CycleGAN relies heavily on Convolutional Neural Networks (CNNs), which excel at extracting local features but struggle to capture long-range global dependencies. This often leads to geometric distortions in complex landscapes where spatial context is crucial. Second, without structural constraints, the generator may synthesize realistic-looking textures that, however, fail to align with the actual ground objects—a phenomenon known as ‘hallucination.’ To address the challenges of unsupervised translation, several representative advanced models have been proposed. DualGAN [40] introduced a dual-learning mechanism to improve training stability and data efficiency using a closed-loop architecture similar to natural language translation. GP-UNIT [41] leveraged generative priors from pre-trained networks to capture rich content correspondences, aiming to bridge drastic visual discrepancies in complex domain mappings. DCLGAN [42] incorporated contrastive learning to maximize mutual information between input and output patches, effectively mitigating the mode collapse issue often found in unpaired translation. While these models represent state-of-the-art advancements in efficiency, complex mapping, and stability, respectively, they still exhibit shortcomings in maintaining fine-grained structural fidelity and suppressing speckle noise when applied to the highly heterogeneous SAR-to-NDVI translation task. Therefore, it is necessary to improve the CycleGAN architecture to simultaneously capture spatiotemporal dependencies and enforce structural consistency. Based on CycleGAN, this study developed a method to derive temporally continuous NDVI time series from SAR images without paired data. The main contributions of this study are summarized as follows: (1) We propose an improved SA-CycleGAN, which integrates a novel spatiotemporal attention generator capable of capturing long-range dependencies in SAR sequences, thereby effectively distinguishing complex land cover patterns. (2) We introduce a structural similarity (SSIM) loss function to suppress inherent speckle noise in SAR data while preserving high-frequency structural details in NDVI images. (3) We demonstrate that the proposed method can acquire temporally continuous NDVI data in cloudy areas without requiring strictly paired training data, achieving superior accuracy compared to current mainstream unsupervised models.

2. Materials and Methods

2.1. Study Areas

In this study, we selected four sites (Zhangbei, Xishuangbanna, AGRO, HARV) with different land cover types to test the proposed algorithm. The latitude and longitude of the center points for these sites are listed in Table 1. Each site covers a rectangular area of 5.12 km × 5.12 km centered on the center point.

The Zhangbei site, which is located in northern China, represents a temperate grassland ecosystem. It is characterized predominantly by temperate grassland vegetation. This makes it a representative area for research on grazing dynamics and grassland ecosystem processes. The Xishuangbanna site, located in Yunnan Province, China, represents a tropical rainforest ecosystem. This region is characterized by high biodiversity and dense tropical vegetation. Consequently, it plays a critical role in studies related to tropical ecosystems, species richness, and the impacts of climate change. The AGRO site is situated in Iowa, USA, and is primarily agricultural. It features extensive cultivation of crops, including corn, soybeans, and wheat. This site is well suited for investigations into farmland ecosystem functioning and crop phenology. The HARV site, located in Massachusetts, USA, is dominated by mixed temperate deciduous forest. This environment is suitable for examining forest ecosystem dynamics and biodiversity changes.

2.2. Data

The Sentinel-1 SAR and Sentinel-2 multispectral instrument (MSI) imagery provided by the European Space Agency’s (ESA) Copernicus Program was used in this study. Sentinel-1 consists of two satellites, Sentinel-1A and Sentinel-1B. They provide day-and-night C-band SAR imaging. With both satellites in operation, they achieve a six-day revisit time at the equator. In this study, SAR images acquired in interferometric wide-swath (IW) mode with a spatial resolution of approximately 10 m were used. Sentinel-2 also consists of two satellites, Sentinel-2A and Sentinel-2B. Each satellite carries an MSI. The MSI captures images in 13 spectral bands, covering the visible (VIS), near-infrared (NIR), and shortwave infrared (SWIR) regions. Spatial resolution varies by spectral band: the four VIS/NIR bands (blue, green, red, and NIR) have a resolution of 10 m; the six red-edge and SWIR bands have a resolution of 20 m; and the three atmospheric correction bands have a resolution of 60 m. When both satellites are in orbit, Sentinel-2 has a revisit time of five days at the equator. In this study, L2A products (atmospherically corrected lower atmospheric reflectance data) with a spatial resolution of approximately 10 m were used.

For each site, Sentinel-1 SAR images covering the period 2017–2021 and Sentinel-2 MSI images covering 2019–2023 were collected for our study. Given the extensive temporal coverage and data volume involved across multiple sites, the Google Earth Engine (GEE) platform was utilized for efficient data download. The number of Sentinel-1 SAR and Sentinel-2 MSI images for each site is detailed in Table 2.

2.3. Overall Procedure

In this study, an enhanced CycleGAN (denoted by SA-CycleGAN) is developed to derive temporally continuous NDVI time series from SAR images without paired data. The SA-CycleGAN was trained using Sentinel-1 SAR and Sentinel-2 MSI images that were not strictly paired in space and time. The overall workflow is illustrated in Figure 1. It comprises four steps. Step 1: Data preprocessing. For the Sentinel-1 SAR images, speckle filtering and boundary noise removal were performed within GEE to mitigate noise effects. Three bands were prepared for each SAR image: the VV and VH backscatter coefficients and the normalized difference backscatter index (NDBI), calculated in decibels (dB). For Sentinel-2 MSI images, preprocessing involved selecting images with less than 20% cloud cover on the basis of the ‘CLOUDY_PIXEL_PERCENTAGE’ attribute. Three bands were prepared: surface reflectance in the red (B4) and near-infrared (B8) bands and NDVI calculated from the surface reflectance. Step 2: Training dataset construction. After preprocessing, aligned Sentinel-1 SAR and Sentinel-2 MSI images were exported. These exports covered regions of interest (ROIs) defined for the four study sites. To create the final dataset, the exported images were cropped into nonoverlapping tiles of 256 × 256 pixels. All tiles were visually inspected, and those containing large areas of no data, significant cloud cover, or noticeable mosaic artefacts were discarded. The resulting dataset was then split into training and test datasets at a 4:1 ratio. The acquisition date and corresponding file path for each available image tile within the specified time range for each ROI were also exported in JSON format. Step 3: Model training. The Sentinel-1 SAR and Sentinel-2 MSI training datasets, along with time-series information, serve as inputs to the SA-CycleGAN. The model is iteratively trained, passing training datasets through its two modified generators and discriminators to refine its parameters and achieve an optimal configuration. Step 4: Model testing and validation. The SAR image test datasets were fed into the optimized SA-CycleGAN model to derive the NDVI values. The accuracy of the NDVI values was then verified by comparison with the NDVI values calculated from the Sentinel-2 MSI images.

2.4. SA-CycleGAN

CycleGAN consists of two generators and two discriminators. Combining adversarial loss with cycle consistency loss cleverly achieves mutual conversion between two image domains without paired training data. In this study, an enhanced CycleGAN (SA-CycleGAN) was developed. The structure of the SA-CycleGAN is shown in Figure 2. To enhance the performance of the generator, we propose a novel spatial–temporal attention architecture that integrates 3D convolutions with self-attention and cross-attention mechanisms. This design enables the model to simultaneously capture global spatial patterns, fine-grained local details, and critical temporal dynamics from the input data. For the discriminator, we retain the PatchGAN architecture from the CycleGAN, which effectively balances computational efficiency with the ability to assess local image realism. Furthermore, we augment the CycleGAN loss function with a structural similarity (SSIM) loss term to explicitly improve the structural fidelity of the derived results. The following sections provide a detailed description of the spatial–temporal attention generator, the PatchGAN discriminator, and the composite loss function employed for network training.

2.4.1. Spatiotemporal Attention Generator

To address the complexity of time series data and enhance cross-modal translation between SAR and optical images, we propose a novel spatiotemporal attention generator. This architecture significantly redesigns the traditional encoder–decoder framework by integrating three key components: a spatiotemporal encoder, an intermediate position including a self-attention mechanism and residual blocks, and an attention decoder. The architecture of the spatiotemporal attention generator is shown in Figure 3a.

The workflow of the spatiotemporal attention generator begins with a spatiotemporal encoder, which processes an input image sequence of T frames (e.g., T = 3). The spatiotemporal encoder consists of two 3D convolutional layers and a downsampling layer consisting of three 2D convolutional layers. The first 3D convolutional layer uses a 3 × 3 × 3 convolution kernel to capture local spatiotemporal features and short-term pixel dynamics between adjacent frames, and the second 3D convolutional layer uses convolution kernels with a temporal dimension equal to the sequence length (e.g., T × 3 × 3) to fuse and compress temporal information along the temporal axis. The output is a single, time-aware 2D feature map that encapsulates the dynamics of the input sequence. This time-aware feature map then enters the downsampling layer. An initial 7 × 7 2D convolutional layer performs large-scale feature extraction, followed by two 3 × 3 2D convolutional layers with stride 2, which progressively downsample the spatial dimensions of the feature map while increasing its channel depth. The feature maps generated at each scale by the encoder are saved via skip connections for later use in the decoder.

In the middle of the generator, the deeply encoded features undergo complex feature transformations using nine residual blocks. The structure of the residual block is shown in Figure 3b. Following these residual blocks, a self-attention mechanism is applied. Figure 3c shows the structure of the self-attention mechanism. This module operates on a single feature map, computing the query (Q), key (K), and value (V) from the same source. Its function is to capture long-range spatial dependencies within the feature map, ensuring global consistency by allowing different spatial locations to interact and optimize each other’s representations before the decoding phase begins.

The final module of the generator is the attention decoder, which consists of upsampling layers and criss-cross attention mechanisms to reconstruct the image. This decoder innovatively incorporates the criss-cross attention mechanism between two upsampling layers. This strategy aims to bridge the modality gap by establishing a direct dialog between the Sentinel-2 MSI image generation process and the encoded Sentinel-1 SAR image features and aligning the temporal sequence information captured from the Sentinel-1 SAR image with the spatial features synthesized for the Sentinel-2 MSI image. The structure of the criss-cross attention mechanism is shown in Figure 3d. Unlike the self-attention mechanism, the query (Q) vector of the criss-cross attention mechanism is generated from the upsampled feature maps of the decoder, and the key (K) and value (V) vectors come from the feature maps of the corresponding encoder layer in the other sequence. Since the encoder features are the product of the upstream temporal encoder, the K and V vectors inherently contain temporal sequence information, such as the temporal variation in backscatter patterns. Then, Formula (1) is used to compute the dot product similarity between the decoder’s Q values and the encoder’s time-aware K values to generate an attention map. Then, Formula (1) was used to compute the dot product similarity between the decoder’s Q values (generated from the optical feature maps) and the encoder’s time-aware K values (derived from the SAR sequence) to generate an attention map. Crucially, the softmax operation in the attention mechanism acts as a soft-masking filter. It assigns lower weights to incoherent patterns (such as SAR speckle noise) and higher weights to consistent structural features that align with the semantic context of the optical decoding process. This effectively suppresses the propagation of SAR noise into the generated optical domain.

A t t e n t i o n (Q, K, V) = soft \max (\frac{Q K^{T}}{\sqrt{d_{k}}}) V

(1)

The attention map weights the importance of different temporal patterns from the Sentinel-1 SAR image relative to the feature currently being synthesized in the Sentinel-2 MSI image decoder. The attention map is then applied to the V vector to generate a context vector, which is integrated into the decoder’s feature maps. This attention decoding, performed at multiple semantic scales, aims to leverage the relevant spatiotemporal context from the source domain to guide the translation process.

2.4.2. PatchGAN Discriminators

The discriminator structure of the CycleGAN is based on PatchGAN, and its core idea is to improve the quality of high-frequency details of generated images by fine-tuning the local area of the image. The SA-CycleGAN in this study retains the discriminator structure of the CycleGAN; that is, the PatchGAN discriminator is used to evaluate the NDVI values derived from Sentinel-1 SAR images by comparison with the NDVI values calculated from the Sentinel-2 MSI images. The 256 × 256 × 3 image output by the generator is first processed by a multilevel convolution module, which includes four convolution layers, instance normalization and LeakyReLU activation (CILR) stacking, and gradually downsampled to a feature map size of 64 × 64 × 512. The step size of each convolution layer is 2, the kernel size is 4 × 4, and the padding method is reflection padding to avoid edge artifacts.

2.4.3. Loss Functions

In the CycleGAN, the loss function consists mainly of adversarial loss and cycle consistency loss, which work together to ensure that the mapping is reversible and that the output remains consistent with the input domain. To improve the structural accuracy and detail of the generated images, a structural similarity index (SSIM) loss is added to the loss function.

We adopt the adversarial loss function proposed in the least squares generative adversarial network (LSGAN) [43], and its objective function can be expressed as

L_{G A N} (G, D_{Y}, X, Y) = E_{y {~ P}_{data} (y)} [{logD}_{Y} (y)] + E_{x {~ P}_{data} (x)} [\log (1 - D_{Y} (G (x))]

(2)

where y represents images from domain Y (Sentinel-2 MSI images) and x represents images from domain X (Sentinel-1 SAR images).

D_{Y} (y)

represents the score of the images in the real Y in the discriminator

D_{Y}

. G represents the generator that derives Sentinel-2 MSI images from Sentinel-1 SAR images, and

G (x)

represents the images derived by the generator from x that are identically distributed to Y.

D_{Y} (G (x))

represents the score obtained by the discriminator based on the derived images.

Compared with the binary cross-entropy loss of the traditional GAN, the adversarial loss results in better training stability and a higher quality of generated results, thus alleviating the problem of gradient disappearance. The traditional CycleGAN includes the cycle consistency loss to constrain the reversibility of mappings between the two domains:

L_{c y c} (G, F) = E_{x {~ P}_{data} (x)} [| | F (G (x)) - x | |_{1}] + E_{y {~ P}_{data} (y)} [| | G (F (y)) - y | |_{1}]

(3)

where F represents the generator that derives Sentinel-1 SAR images from Sentinel-2 MSI images and

F (y)

represents the images derived by the generator from y that are identically distributed to X. The meanings of

G (x)

, x and y are the same as those in Formula (2).

To preserve the structural and textural similarity of the synthesized images, this study incorporates the SSIM loss to supplement the adversarial loss and cycle consistency loss:

L_{S S I M} (x, G (x)) = 1 - S S I M (x, G (x))

(4)

where the SSIM is derived from factors such as luminance, contrast, and structure.

S S I M (x, y) = \frac{(2 μ_{x} μ_{y} + C_{1}) (2 σ_{x y} + C_{2})}{(μ_{x}^{2} + μ_{y}^{2} + C_{1}) (σ_{x}^{2} + σ_{y}^{2} + C_{2})}

(5)

where

x

corresponds to the NDVI values calculated from the Sentinel-2 MSI image and y corresponds to the NDVI values derived from the Sentinel-1 SAR image by the SA-CycleGAN.

μ_{x}

and

μ_{y}

are the local means of

x

and

y

, respectively;

σ_{x}

and

σ_{y}

are the local variances of x and y, respectively;

σ_{x y}

is the local covariance of

x

and

y

;

C_{1}

and

C_{2}

are stability constants introduced to prevent the denominator from approaching 0, which are set to 0.01² and 0.03², respectively. A higher SSIM value, corresponding to a lower loss, indicates greater similarity between the NDVI values derived from the Sentinel-1 SAR image and the NDVI values calculated from the Sentinel-2 MSI image.

The total loss function can be written as follows:

L_{t o t a l} = L_{L S G A N} + λ_{c y c} L_{c y c} + λ_{S S I M} L_{S S I M}

(6)

where

λ_{c y c}

and

λ_{S S I M}

control the weights of the cycle consistency loss and the SSIM loss, respectively. In this study,

λ_{c y c}

is set to 10, and

λ_{S S I M}

is set to 0.5 to balance texture and structure preservation.

2.5. Training Dataset

After initial preprocessing, the core task was to construct a time-series dataset suitable for model training. The Sentinel-1 SAR and Sentinel-2 MSI images covering the four sites were exported from the Google Earth Engine.

A key component of this phase was the integration of temporal metadata, which were exported along with the images as JSON files. These files record the acquisition date and corresponding file path of each available image within the specified time range for each site. Furthermore, to prepare suitable inputs for the SA-CycleGAN model, the Sentinel-1 SAR and Sentinel-2 MSI images were systematically cropped into nonoverlapping tiles of 256 × 256 pixels. The cropped tiles with land cover according to the GLC_FCS10 dataset [44] for the four selected sites are shown in Figure 4. Rigorous quality control was then performed. Before sequence generation, each tile was visually inspected, and those with significant issues (such as large areas without data, significant unobstructed cloud cover or cloud shadows, or obvious processing artifacts) were removed. This screening process ensures that the final dataset contains high-quality Sentinel-1 SAR and Sentinel-2 MSI images. These selected datasets are divided into training and testing datasets at a standard 4:1 ratio. In Figure 4, the red border area of each site represents the area of the test dataset, and the remaining area is the training dataset. The blue points in the testing dataset represent the points selected for the temporal consistency analysis described in Section 3.2. The numbers of tiles for the training and testing datasets obtained after screening for each site are shown in Table 3.

During the training phase, we parse the JSON file during data loading for the time series information of each geographic location to create a chronological list of all available image tiles. A sliding window method with a predefined sequence length (three consecutive images in this study) is subsequently applied to this list. The generator and discriminator are then updated alternately in a dynamic manner. In this study, the following training strategy is adopted: In each iteration, the parameters of the generator are first fixed, and then the discriminator is updated by minimizing its loss. Then, the discriminator parameters are fixed, and the generator is updated by minimizing the combination of adversarial loss, cycle consistency loss, and SSIM loss. To prevent the discriminator from converging too quickly or the generator from performing poorly, we set a specific update frequency and use a history buffer to stabilize the training. The learning rate follows a fixed or linear decay strategy, with the initial rate specified as 0.0002 and gradually decreasing over time. The Adam optimizer [45] is used, and the parameter value is adjusted to 0.5 on the basis of experiments to ensure a balance between details, structure, and overall realism. The training is based on an RTX4090 graphics card with 24 GB of video memory, and the number of iterations is set to 100 epochs. When both the adversarial loss and the cycle consistency loss (as well as the SSIM loss) tend to stabilize or slowly decrease, the model is considered to have reached the optimal state.

2.6. Performance Evaluation

To evaluate the effectiveness of the SA-CycleGAN model, we conducted comparative experiments against three established unsupervised image-to-image translation algorithms: DualGAN, GP-UNIT, and DCLGAN. The evaluation comprised two primary components: spatial consistency and temporal consistency. To assess spatial consistency, we selected Sentinel-1 SAR and Sentinel-2 MSI images over the four study areas. The selected Sentinel-2 MSL images were required to have less than 5% cloud cover. We then aligned the NDVI values derived from the Sentinel-1 SAR images by the SA-CycleGAN with the NDVI values calculated from the corresponding Sentinel-2 MSI images. The spatial distribution of the NDVI values derived from the SA-CycleGAN was visualized for intuitive comparison. Furthermore, we plotted a probability kernel density plot of the NDVI values derived from the SA-CycleGAN and the NDVI values calculated from the corresponding Sentinel-2 MSI images to verify spatial consistency. To evaluate temporal consistency, we focused on the NDVI values calculated from Sentinel-2 MSI images between 2019 and 2021. We compared the NDVI time series derived by the SA-CycleGAN under these conditions with the outputs of three other models. All the models were trained and tested using the same dataset configuration. The results were further analyzed by meteorological season: spring (March to May), summer (June to August), autumn (September to November), and winter (December to February). We calculated the seasonal means of the quantitative metrics and generated radar plots to visually compare model performance across seasons. Finally, a series of ablation studies were performed to verify the contributions of the key components in our proposed model. We compared the performance of the full SA-CycleGAN against three variants: the original CycleGAN, CycleGAN with only the spatial–temporal attention module (Attention-CycleGAN), and CycleGAN with only the SSIM loss function (SSIM-CycleGAN). Furthermore, sensitivity analysis was performed for different input image sequence lengths (T) to determine the optimal input image sequence length for the model.

To quantitatively evaluate the model performance, four metrics, namely, the root mean square error (RMSE), coefficient of determination (R²), peak signal-to-noise ratio (PSNR) and SSIM, were calculated. The RMSE is the difference between the predicted and actual values. A lower RMSE indicates smaller prediction errors. It is calculated as follows:

R M S E = \sqrt{\frac{\sum_{i = 1}^{n} (y_{i} - {\hat{y}}_{i})^{2}}{n}}

(7)

where

y_{i}

denotes the NDVI values calculated from the Sentinel-2 MSI images and

{\hat{y}}_{i}

is the NDVI value derived from SA-CycleGAN. R² evaluates how well the model’s outputs fit the true data. A higher R² value signifies greater consistency between the predicted and actual values. It is calculated as follows:

R^{2} = 1 - \frac{\sum_{i = 1}^{n} (y_{i} - {\hat{y}}_{i})^{2}}{\sum_{i = 1}^{n} (y_{i} - {\bar{y}}_{i})^{2}}

(8)

where

\bar{y}

is the average of the NDVI values calculated from the Sentinel-2 MSI images. The PSNR assesses the quality of the derived NDVI images. Higher PSNR values indicate better image quality. It is expressed as follows:

P S N R = 10 \times \log_{10} (\frac{M A X_{I}^{2}}{M S E})

(9)

where

{M A X}_{I}

is the maximum possible pixel value of the image (e.g., 1 for normalized NDVI) and MSE is the mean squared error between the Sentinel-2 MSI and derived images. The SSIM measures the structural similarity between the NDVI image derived by the SA-CycleGAN and the NDVI image calculated from Sentinel-2 MSI images, where higher values indicate better preservation of structural details. It is defined as consistent with Formula (5).

3. Results

3.1. Spatial Consistency Analysis

The maps of the NDVI values derived from the Sentinel-1 SAR images by the SA-CycleGAN, DualGAN, GP-UNIT, and DCLGAN for the four subregions of the Zhangbei site on 21 February 2020, are shown in Figure 5. For comparison, the maps of the NDVI values calculated from the corresponding Sentinel-2 images are also shown in Figure 5. Compared with the NDVI maps from Sentinel-2, the NDVI maps from SA-CycleGAN exhibit high spatial fidelity. With respect to examples, in the heterogeneous agricultural landscape shown in Figure 5a,b,d, the SA-CycleGAN accurately reproduces the sharp boundaries of cultivated fields and the unique geometry of linear features (such as the road in the first row). The superior performance of SA-CycleGAN in this region is largely due to its ability to handle the bimodal distribution of NDVI values characteristic of heterogeneous agricultural landscapes (i.e., distinct separation between bare soil and vegetation). While traditional CNN-based models like DualGAN and GP-UNIT rely on local convolutions, which tend to average spatially adjacent features—leading to ‘mode collapse’ where distinct peaks are merged—the spatiotemporal attention mechanism in SA-CycleGAN captures global context. This allows the model to effectively distinguish and preserve the separate distributions of soil and vegetation. In contrast, the DualGAN, GP-UNIT, and DCLGAN exhibit significant limitations. While the DualGAN and GP-UNIT are able to capture overall NDVI trends, they tend to produce smoother, slightly blurry results, obscuring fine-grained textures and softening the edges of vegetation patches. The DCLGAN model performs the worst, with noticeable blurring of the output and significant artifacts, resulting in a significant loss of structural detail. This issue is observed in all four subregions. Overall, the NDVI values derived from the SA-CycleGAN model are highly consistent with the NDVI values calculated from the Sentinel-2 MSI images. While the model achieves high accuracy, some weaknesses are observed in the 1E subregion (shown in Figure 5c), where vegetation is more uniform and sparse. The NDVI values derived from the SA-CycleGAN are slightly higher than the NDVI values from the Sentinel-2 MSI images (Figure 5c).

Figure 6 shows the NDVI maps derived from the Sentinel-1 SAR images on 27 August 2020, by the SA-CycleGAN, DualGAN, GP-UNIT, and DCLGAN and the NDVI maps calculated from the corresponding Sentinel-2 MSI images of the four subregions of the Xishuangbanna site. As shown in Figure 6, the SA-CycleGAN outperforms the DualGAN, GP-UNIT, and DCLGAN in the more heterogeneous and complex environment of the Xishuangbanna rainforest. Obviously, the SA-CycleGAN excels in delineating clear boundaries between dense forest canopies (high NDVI values) and nonvegetated areas, such as riverbeds and clearings (low NDVI values, shown in red). It accurately reproduces the complex branching shapes and spatial patterns of these features in all four subregions. The NDVI maps derived from the SA-CycleGAN are highly consistent with those from the Sentinel-2 MSI images. Although its performance is robust, a small flaw can be observed upon closer inspection. In Figure 6d, compared with the NDVI map from the Sentinel-2 MSI image in the 3A subregion, the NDVI map derived from the SA-CycleGAN still loses some river branches. The data in Figure 6 demonstrate that the SA-CycleGAN represents a significant improvement over the other models. For example, the DualGAN and GP-UNIT fail to preserve sharp edges, often blurring or distorting the shape of the river network, and tend to oversmooth the fine-grained texture of the forest canopy. The DCLGAN model performs the worst, generating severe blurring and artifacts in the riverbed area of the 3A subregion (Figure 6d).

The NDVI maps derived from the Sentinel-1 SAR image on 13 April 2020, by the SA-CycleGAN, DualGAN, GP-UNIT, and DCLGAN of the four subregions of the AGRO site are shown in Figure 7. The corresponding NDVI maps calculated from the Sentinel-2 MSI image are also shown in Figure 7. The AGRO site is an agricultural region characterized by cyclical cropping. Compared with the NDVI maps calculated from the Sentinel-2 MSI image, the SA-CycleGAN demonstrated an excellent ability to derive the NDVI maps for the four subregions. In the 2C subregion (Figure 7a), the NDVI map derived from the SA-CycleGAN accurately captured the complex pattern of the fields, clearly distinguishing different land cover types, such as areas of active crop growth and bare or fallow land. The distribution and intensity of the NDVI values derived from the SA-CycleGAN within each field closely matched the NDVI maps from the Sentinel-2 MSI image. In contrast, the DualGAN, GP-UNIT, and DCLGAN models struggled to replicate these complex patterns. In particular, the DualGAN model exhibited significant spectral aliasing in the 2C and 2D subregions (Figure 7a,b), incorrectly rendering some bare soil areas as vegetation cover and causing significant blurring at field boundaries. Similarly, the GP-UNIT and DCLGAN models failed to capture fine details, resulting in inconsistent or overly smoothed outputs that lost crucial information about farmland conditions.

Figure 8 shows the NDVI maps derived from the Sentinel-1 SAR image on 22 September 2020, by the SA-CycleGAN, DualGAN, GP-UNIT, and DCLGAN and the NDVI maps calculated from the Sentinel-2 MSI image on the same day for four subregions of the HARV site. The biome type at this site is a mixed forest with multiple layers of vegetation. In all four subregions, the NDVI maps derived from the SA-CycleGAN effectively captured the complex structure of the forest, accurately rendering the boundaries between the dense tree canopy and open woodland. However, the SA-CycleGAN model also has several limitations in regions with high heterogeneity and subtle NDVI gradients. For example, in the 1D subregion (Figure 8d), the SA-CycleGAN struggled to fully reproduce the complex vegetation density composition, resulting in lower NDVI values than those calculated from the Sentinel-2 MSI image. Compared with the DualGAN, GP-UNIT, and DCLGAN models, the SA-CycleGAN model demonstrated the most robust performance over the four subregions. For example, in the 4B and 3B subregions (Figure 8b,c), the SA-CycleGAN avoided severe texture artifacts, while oversmoothing issues were prevalent in the NDVI maps derived from the GP-UNIT and DualGAN models. Furthermore, in the 4B subregion (Figure 8b), the SA-CycleGAN did not suffer from apparent instability, but the DCLGAN model consistently generated lower NDVI values when compared with the NDVI values calculated from the Sentinel-2 MSI image.

The probability kernel density distributions of the NDVI values derived from the Sentinel-1 SAR images by the SA-CycleGAN, DualGAN, GP-UNIT, and DCLGAN and the NDVI values calculated from the Sentinel-2 MSI images over the four subregions of the Zhangbei, Xishuangbanna, AGRO, and HARV sites are shown in Figure 9. In regions with heterogeneous land cover, such as Zhangbei (shown in Figure 9a), the NDVI values from the Sentinel-2 MSI image exhibit a distinct bimodal distribution, likely corresponding to vegetated surfaces and bare soil. The SA-CycleGAN successfully replicates this bimodal shape, whereas the DualGAN, GP-UNIT, and DCLGAN models fail to achieve it completely. The DCLGAN model generally underestimates the NDVI values, whereas the GP-UNIT and DualGAN models fail to accurately reproduce the high-NDVI peaks. Similarly, the NDVI values derived from the SA-CycleGAN closely match the statistical distribution of the NDVI values from the Sentinel-2 MSI image at the AGRO site (Figure 9c). In contrast, the GP-UNIT and DualGAN show significant deviations, producing an erroneously wide distribution of low NDVI values. However, in regions with more homogeneous, dense vegetation, such as Xishuangbanna (Figure 9b) and HARV (Figure 9d), the distributions of the NDVI values from the Sentinel-2 MSI images are characterized by a sharp, unimodal peak at high NDVI values. The distributions of the NDVI values derived from the SA-CycleGAN align almost perfectly with those of the NDVI values from the Sentinel-2 MSI image; however, the DualGAN, GP-UNIT, and DCLGAN models all underestimate the peak NDVI values, failing to capture the dense canopy characteristics of these tropical and mixed forests.

Table 4 presents a quantitative evaluation of the NDVI values derived from Sentinel-1 SAR images by the SA-CycleGAN, DualGAN, GP-UNIT, and DCLGAN versus the NDVI values calculated from the Sentinel-2 MSI image for the four subregions of the Zhangbei, Xishuangbanna, AGRO and HARV sites. At the Zhangbei site, the SA-CycleGAN achieved the lowest RMSE of 0.0502, the highest PSNR of 22.83 dB, an SSIM of 0.89, and an R² of 0.87. These findings demonstrate that compared with the DualGAN, GP-UNIT, and DCLGAN models, the SA-CycleGAN achieves higher accuracy and structural fidelity at this site with a relatively homogeneous landscape. At the Xishuangbanna site, the SA-CycleGAN also achieved the highest scores across all the metrics, including the highest PSNR (22.87 dB) and R² (0.88), demonstrating its robustness in dense, heterogeneous forest environments. Similarly, the SA-CycleGAN performed exceptionally well in the AGRO agricultural scene, achieving the highest PSNR (24.27 dB) and an SSIM of 0.89, demonstrating its exceptional ability to capture distinct areas of cultivated land. At the HARV site, the SA-CycleGAN outperformed the DualGAN, GP-UNIT, and DCLGAN models across all the metrics. However, it is worth noting that all the models, including the SA-CycleGAN, performed slightly worse than the other models did in this region (for example, the RMSE of the SA-CycleGAN increased to 0.0734, and its R² decreased to 0.81). This finding indicates that although the SA-CycleGAN is the most powerful model, the complex and diverse structures of mixed forests pose greater challenges for deriving NDVI values from Sentinel-1 SAR images.

3.2. Temporal Consistency Analysis

The time series of the NDVI values derived from Sentinel-1 SAR images from 2019 to 2021 by the SA-CycleGAN, DualGAN, GP-UNIT, and DCLGAN at designated points of the Zhangbei, Xishuangbanna, AGRO, and HARV sites are shown in Figure 10. For comparison, Figure 10 also shows the time series of the NDVI values calculated from the corresponding Sentinel-2 MSI images at designated points at these sites. It is important to note that the NDVI values calculated from the Sentinel-2 MSI images exhibit substantial fluctuations across different time periods. These abrupt anomalous points are primarily caused by residual cloud contamination and atmospheric interference, which are inherent challenges in optical remote sensing. We utilizing these values as a reference specifically to demonstrate that the proposed method acts as a robust filter, effectively mitigating these cloud-induced disturbances.

The NDVI time series at the Zhangbei site are shown in Figure 10a. The NDVI time series at this site exhibits distinct seasonality, with a peak in the growing season and low NDVI values in the nongrowing season. Despite some overestimation during the growing season, the time series of the NDVI values derived from the SA-CycleGAN closely matched the upper envelope of the time series of the NDVI values calculated from the Sentinel-2 MSI images most of the time, accurately characterizing the grassland growth and decay cycle. In contrast, the NDVI values derived from the DualGAN significantly underestimate the NDVI peak during the growing season; the time series of the NDVI values derived from the GP-UNIT exhibit a significant phenological delay, with both the rise and fall of its curve lagging behind the real situation, and the time series of the NDVI values derived from the DCLGAN exhibit significant and irregular fluctuations, making it the least reliable. The NDVI time series at the Xishuangbanna site is shown in Figure 10b. This region features dense vegetation year-round, resulting in consistently high NDVI values. However, frequent cloud cover causes significant instability in the time series of the NDVI values calculated from the Sentinel-2 MSI images. The SA-CycleGAN generates a smooth, stable, high-NDVI baseline that accurately reflects the minimal seasonal influence of the rainforest. In contrast, the DualGAN method significantly overestimates the NDVI values. The time series of the NDVI values derived from the GP-UNIT and DCLGAN exhibit dramatic and unstable fluctuations and fail to capture the vegetation state at this site. The time series of the NDVI values derived from the SA-CycleGAN, DualGAN, GP-UNIT, and DCLGAN and the NDVI values calculated from the Sentinel-2 MSI images at the AGRO site are shown in Figure 10c. The time series of the NDVI values derived from the SA-CycleGAN accurately captures the entire crop cycle, from sowing and growth to harvesting and fallowing. The peak and trough patterns and timing of the time series are highly consistent with those of the time series of the NDVI values calculated from the Sentinel-2 MSI images. However, the DualGAN results in an underestimation during the growing season; the time series of the NDVI values derived from the GP-UNIT demonstrates a particularly prominent phenological delay in the early stage of the growing season, when crops are rapidly growing; and the time series of the NDVI values derived from the DCLGAN exhibit high instability and fail to track agricultural activity cycles. The NDVI time series at the HARV site are shown in Figure 10d. These NDVI time series reveal the seasonal phenological characteristics of forest ecosystems. The time series of the NDVI values derived from the SA-CycleGAN closely matches the annual growth and leaf fall cycles of the forest. In contrast, the DualGAN underestimates the NDVI values during the growing season and fails to reflect the full growth potential of the forest. Although the time series of the NDVI values derived from the GP-UNIT can capture long-term trends, it is slow to simulate the rapid decline in the autumn NDVI. The time series of the NDVI values derived from the DCLGAN is completely distorted, especially with very large deviations between 2020 and 2021, which once again confirms the instability of the DCLGAN model.

The seasonal performance of the SA-CycleGAN, DualGAN, GP-UNIT, and DCLGAN in generating NDVI values from Sentinel-1 SAR images was quantitatively evaluated. The seasonal averages of the RMSE, PSNR, SSIM, and R² for the NDVI values derived from Sentinel-1 SAR images from 2019 to 2021 by the SA-CycleGAN, DualGAN, GP-UNIT, and DCLGAN models at the Zhangbei, Xishuangbanna, AGRO, and HARV sites are shown in Figure 11. Across all the sites and seasons, the SA-CycleGAN demonstrated superior performance, achieving more favorable scores for all the metrics than the other three models. The DualGAN and GP-UNIT generally exhibited intermediate performance, whereas the DCLGAN consistently yielded the lowest accuracy. Specifically, the radar charts in Figure 11 reveal that the SA-CycleGAN maintains a larger and more consistent coverage area towards the optimal end of each metric’s scale (i.e., lower RMSE and higher PSNR, SSIM, and R²) throughout all four seasons. For instance, at the Zhangbei site, the SA-CycleGAN consistently has the lowest RMSE and the highest PSNR, SSIM, and R² across all the seasons. This pattern of superiority is also evident at the Xishuangbanna, AGRO, and HARV sites. Further analysis reveals that the accuracy of all four models tends to decrease during winter, which is likely attributable to challenging imaging conditions, such as, snow cover that impact SAR characteristics. However, even under these adverse conditions, the SA-CycleGAN demonstrates notable resilience. It effectively mitigates the influence of snow-related scattering noise, maintaining a distinct advantage over the other models. For example, its winter RMSE values, while higher than its own summer performance, remain considerably lower than those of its counterparts. Conversely, during summer—a season often characterized by extensive cloud cover in optical imagery—the SA-CycleGAN also excels. The attention mechanism of the SA-CycleGAN model appears to effectively preserve crucial NDVI features, resulting in robust performance and, for some metrics, its highest accuracy levels.

3.3. Ablation Experiment and Sensitivity Analysis

Table 5 shows the results of ablation experiments using the spatiotemporal attention mechanism and the structural similarity (SSIM) loss function. For clarification, the CycleGAN model with the spatiotemporal attention mechanism is denoted by Attention-CycleGAN. The CycleGAN model with the SSIM loss function is denoted by SSIM-CycleGAN. The SA-CycleGAN developed in this study integrates the spatiotemporal attention mechanism and the SSIM loss function. At the Zhangbei site, the CycleGAN model had mediocre performance (RMSE: 0.0912; SSIM: 0.72). Compared with the CycleGAN model, the Attention-CycleGAN model significantly improved all the metrics, with R² increasing significantly from 0.71 to 0.81, demonstrating that the spatiotemporal attention mechanism effectively enhances the ability of the CycleGAN model to capture spatial dependencies. Compared with the CycleGAN model, the SSIM-CycleGAN model also improved performance, particularly in terms of structural fidelity, where the SSIM value increased from 0.72 to 0.81. Ultimately, the SA-CycleGAN achieved the best performance, with the best RMSE (0.0502) and PSNR (22.83 dB), demonstrating the strong synergy between the spatiotemporal attention mechanism and the SSIM loss function. At the Xishuangbanna site, the CycleGAN model achieved an SSIM of only 0.79. The Attention-CycleGAN model increased the PSNR from 20.05 dB to 22.03 dB, demonstrating its superiority in handling complex textures. Furthermore, the SSIM-CycleGAN model significantly improved structural similarity, reaching an SSIM of 0.85. When the spatiotemporal attention mechanism and the SSIM loss function were combined, the SA-CycleGAN model again demonstrated overwhelming superiority at this site, achieving the lowest RMSE (0.0575) and the highest R² (0.88). This shows that the attention mechanism and the structural loss function complement each other in complex rainforest environments, collectively improving the overall model fidelity. At the AGRO site, the CycleGAN model exhibited weak structural representation (SSIM: 0.74). The SSIM-CycleGAN model significantly improved the structure-preserving ability of the CycleGAN model, with the SSIM value increasing to 0.82. Similarly, the Attention-CycleGAN model also increased the PSNR from 19.12 dB to 21.45 dB. Ultimately, the SA-CycleGAN model achieved the best performance across all the metrics, particularly the highest PSNR (24.27 dB). These findings demonstrate that the combination of the spatiotemporal attention mechanism and the SSIM loss function is crucial for accurately deriving NDVI maps with regular spatial structures at this site. At the HARV site, the CycleGAN achieved the lowest performance across all the metrics (e.g., RMSE: 0.0967; PSNR: 17.89 dB). Regardless of whether the spatiotemporal attention mechanism or the SSIM loss function was added to the CycleGAN model, the performance of the resulting models significantly improved. For example, the Attention-CycleGAN model increased the R² from 0.69 to 0.79, whereas the SSIM-CycleGAN model improved the SSIM from 0.68 to 0.77. This demonstrates the crucial importance of both spatial dependency modelling and structural constraints in heterogeneous environments. Ultimately, the SA-CycleGAN model comprehensively outperforms the other variants, achieving state-of-the-art performance at this site (RMSE: 0.0734; SSIM: 0.82).

Meanwhile, in order to verify the rationality of the input image sequence length (T) being 3 in the spatiotemporal encoder, we conducted a sensitivity analysis on the input image sequence length (T). We trained the SA-CycleGAN with input image sequence lengths of T = 1, T = 3, and T = 5 on a subset of the Zhangbei dataset. The quantitative results (RMSE and SSIM) and training efficiency are summarized in Table 6.

As shown in Table 6, T = 3 significantly outperforms T = 1, confirming the crucial role of temporal context in smoothing SAR speckle noise. However, increasing T to 5 does not yield further improvements. Theoretically, this is due to the rapid temporal decorrelation of SAR data (e.g., caused by humidity variations or wind effects). Shorter windows (T = 3) effectively capture local temporal dynamics, while longer sequences (T ≥ 5) may introduce irrelevant historical data from different phenological stages, leading to diminishing returns. Furthermore, requiring longer continuous sequences (e.g., T ≥ 5) significantly reduces the number of available training samples, as finding long, continuous, cloudless optical sequences for validation is extremely difficult in real-world scenarios. Additionally, using 3D convolutions consumes significant memory in terms of computational efficiency. As shown in the table, setting T = 5 nearly doubles the training time compared to T = 3 without a corresponding improvement in accuracy. Therefore, we select T = 3 as the optimal balance point for the length of the model’s input image sequence.

4. Discussion

Our results demonstrate that the SA-CycleGAN significantly outperforms state-of-the-art unsupervised models (DualGAN, GP-UNIT, and DCLGAN) in deriving high-fidelity NDVI from Sentinel-1 SAR images. This superiority is particularly evident in spatially heterogeneous regions, such as the Zhangbei agricultural site. As observed in Figure 5 and the kernel density distributions in Figure 9a, the SA-CycleGAN successfully reconstructs the bimodal distribution of NDVI values, effectively distinguishing between vegetated surfaces and bare soil. Traditional CNN-based models like DualGAN and GP-UNIT rely on local convolutions, which tend to average spatially adjacent features, leading to “mode collapse” where distinct spectral peaks are merged. In contrast, our integration of a spatiotemporal attention mechanism allows the model to capture global context and long-range dependencies, preserving sharp field boundaries and preventing the oversmoothing of fine-grained textures observed in comparison models. Similarly, in the dense tropical rainforest of Xishuangbanna (Figure 6) and the mixed forest of HARV (Figure 8), the SA-CycleGAN maintains structural integrity and accurate vegetation density gradients where other models fail, exhibiting severe artifacts or spectral aliasing. These findings confirm that the synergistic effect of the spatiotemporal attention module and the SSIM loss function is crucial for handling complex land cover patterns and preserving edge details in the absence of paired training data.

A critical advantage of the SA-CycleGAN is its ability to generate temporally continuous and phenologically accurate NDVI time series, acting as a robust gap-filling tool for cloudy regions. The temporal consistency analysis (Figure 10) reveals that our model effectively filters out the noise and instability inherent in optical time series caused by residual cloud contamination. For instance, at the Zhangbei and AGRO sites, the derived NDVI trajectories closely match the upper envelope of the Sentinel-2 observations, accurately characterizing growth cycles despite fluctuations in the ground-truth data. This “robust filter” capability is further highlighted at the Xishuangbanna site (Figure 10b), where frequent cloud cover renders optical data unreliable; the SA-CycleGAN provides a stable, high-NDVI baseline that reflects the true phenological state of the rainforest. Quantitative seasonal analysis (Figure 11) further underscores this robustness, showing that SA-CycleGAN maintains superior metrics (RMSE, PSNR, SSIM, R²) across all seasons, even mitigating snow-related scattering noise during winter months where other models’ performance degrades significantly

Despite its superior performance, the SA-CycleGAN exhibits some limitations in specific scenarios. In regions with extremely high heterogeneity and subtle NDVI gradients, such as the 1E subregion of Zhangbei (Figure 5c) and the 1D subregion of HARV (Figure 8d), the model occasionally overestimates or underestimates vegetation density compared to Sentinel-2 data. This suggests that while the attention mechanism improves global dependency modelling, extremely fine-scale texture variations in mixed forests or sparse vegetation might still be challenging to fully resolve without higher-resolution SAR input. Additionally, although the model shows resilience to winter snow cover, performance metrics do decrease slightly compared to summer seasons (Figure 11), indicating that snow scattering mechanisms introduce complex non-linearities that are harder to translate. Future work will focus on integrating physical scattering models or multi-frequency SAR data to better handle these extreme conditions and further improve fine-scale texture preservation.

5. Conclusions

In this study, an enhanced CycleGAN (SA-CycleGAN) was developed to derive temporally continuous NDVI values from Sentinel-1 SAR images in the absence of paired training data. To overcome the limitations of traditional 2D convolutions, the SA-CycleGAN first uses 3D convolutional layers to capture short-term temporal dynamics from SAR image sequences, creating a time-aware feature representation. A self-attention module then processes these features to model long-term spatial dependencies, which enables the model to distinguish complex land cover patterns, such as fragmented forests and uniform cropland. A key advancement of the SA-CycleGAN lies in the introduction of a criss-cross attention mechanism during the upsampling stage, which establishes an explicit correspondence between the source SAR data and the generated target NDVI image. This ensures that the time-aware features encoded from the SAR sequence are precisely aligned with the synthesized spatial structure, enabling accurate conversion of temporal patterns (e.g., crop growth cycles). Furthermore, the synergistic effect of SSIM loss and adversarial loss is crucial for enhancing the preservation of fine-grained structural details and edge sharpening. The SA-CycleGAN effectively reduces the inherent noise of SAR without oversmoothing critical texture information and achieves superior unpaired SAR-to-NDVI conversion. This opens new possibilities for addressing the scarcity of optical data and enables robust environmental applications in cloudy and rainy regions.

Author Contributions

Conceptualization, Z.X. and A.W.; methodology, Z.X., A.W., J.S. and H.Y.; software, A.W., C.Z., J.L. and Y.Z.; formal analysis, A.W., C.Z., J.L. and Y.Z.; writing—original draft preparation, A.W. and Z.X.; writing—review and editing, Z.X., A.W., C.Z., J.L., Y.Z., J.S. and H.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China, grant number 42471350.

Data Availability Statement

The data presented in this study are available upon request from the corresponding author. The data are not publicly available due to privacy.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Pettorelli, N.; Vik, J.O.; Mysterud, A.; Gaillard, J.-M.; Tucker, C.J.; Stenseth, N.C. Using the Satellite-Derived NDVI to Assess Ecological Responses to Environmental Change. Trends Ecol. Evol. 2005, 20, 503–510, Erratum in Trends Ecol. Evol. 2006, 21, 237. https://doi.org/10.1016/j.tree.2005.11.006. [Google Scholar] [CrossRef]
Buyantuyev, A.; Wu, J. Urbanization Diversifies Land Surface Phenology in Arid Environments: Interactions Among Vegetation, Climatic Variation, and Land Use Pattern in the Phoenix Metropolitan Region, USA. Landsc. Urban Plan. 2012, 105, 149–159. [Google Scholar] [CrossRef]
Chu, H.; Venevsky, S.; Wu, C.; Wang, M. NDVI-Based Vegetation Dynamics and Its Response to Climate Changes at Amur-Heilongjiang River Basin from 1982 to 2015. Sci. Total Environ. 2019, 650, 2051–2062. [Google Scholar] [CrossRef] [PubMed]
Sruthi, S.; Aslam, M.A.M. Agricultural Drought Analysis Using the NDVI and Land Surface Temperature Data; a Case Study of Raichur District. Aquat. Procedia 2015, 4, 1258–1264. [Google Scholar] [CrossRef]
Schultz, M.; Clevers, J.G.P.W.; Carter, S.; Verbesselt, J.; Avitabile, V.; Quang, H.V.; Herold, M. Performance of Vegetation Indices from Landsat Time Series in Deforestation Monitoring. Int. J. Appl. Earth Obs. Geoinf. 2016, 52, 318–327. [Google Scholar] [CrossRef]
Durigon, V.L.; Carvalho, D.F.; Antunes, M.A.H.; Oliveira, P.T.S.; Fernandes, M.M. NDVI Time Series for Monitoring RUSLE Cover Management Factor in a Tropical Watershed. Int. J. Remote Sens. 2014, 35, 441–453. [Google Scholar] [CrossRef]
Berveglieri, A.; Imai, N.N.; Christovam, L.E.; Galo, M.L.B.T.; Tommaselli, A.M.G.; Honkavaara, E. Analysis of Trends and Changes in the Successional Trajectories of Tropical Forest Using the Landsat NDVI Time Series. Remote Sens. Appl. Soc. Environ. 2021, 24, 100622. [Google Scholar] [CrossRef]
Van Leeuwen, W.J.D.; Orr, B.J.; Marsh, S.E.; Herrmann, S.M. Multi-Sensor NDVI Data Continuity: Uncertainties and Implications for Vegetation Monitoring Applications. Remote Sens. Environ. 2006, 100, 67–81. [Google Scholar] [CrossRef]
Weng, Q.; Fu, P. Modeling Annual Parameters of Clear-Sky Land Surface Temperature Variations and Evaluating the Impact of Cloud Cover Using Time Series of Landsat TIR Data. Remote Sens. Environ. 2014, 140, 267–278. [Google Scholar] [CrossRef]
Xiao, Z.; Liang, S.; Tian, X.; Jia, K.; Yao, Y.; Jiang, B. Reconstruction of Long-Term Temporally Continuous NDVI and Surface Reflectance From AVHRR Data. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2017, 10, 5551–5568. [Google Scholar] [CrossRef]
Jonsson, P.; Eklundh, L. Seasonality Extraction by Function Fitting to Time-Series of Satellite Sensor Data. IEEE Trans. Geosci. Remote Sens. 2002, 40, 1824–1832. [Google Scholar] [CrossRef]
Sellers, P.J.; Tucker, C.J.; Collatz, G.J.; Los, S.O.; Justice, C.O.; Dazlich, D.A.; Randall, D.A. A Global 1° by 1° NDVI Data Set for Climate Studies. Part 2: The Generation of Global Fields of Terrestrial Biophysical Parameters from the NDVI. Int. J. Remote Sens. 1994, 15, 3519–3545. [Google Scholar] [CrossRef]
Jönsson, P.; Eklundh, L. TIMESAT—A Program for Analyzing Time-Series of Satellite Sensor Data. Comput. Geosci. 2004, 30, 833–845. [Google Scholar] [CrossRef]
Roerink, G.J.; Menenti, M.; Verhoef, W. Reconstructing Cloudfree NDVI Composites Using Fourier Analysis of Time Series. Int. J. Remote Sens. 2000, 21, 1911–1917. [Google Scholar] [CrossRef]
Yang, G.; Shen, H.; Zhang, L.; He, Z.; Li, X. A Moving Weighted Harmonic Analysis Method for Reconstructing High-Quality SPOT VEGETATION NDVI Time-Series Data. IEEE Trans. Geosci. Remote Sens. 2015, 53, 6008–6021. [Google Scholar] [CrossRef]
Martínez, B.; Gilabert, M.A. Vegetation Dynamics from NDVI Time Series Analysis Using the Wavelet Transform. Remote Sens. Environ. 2009, 113, 1823–1842. [Google Scholar] [CrossRef]
Lu, X.; Liu, R.; Liu, J.; Liang, S. Removal of Noise by Wavelet Method to Generate High Quality Temporal Data of Terrestrial MODIS Products. Photogramm. Eng. Remote Sens. 2007, 73, 1129–1139. [Google Scholar] [CrossRef]
Jun, W.; Zhongbo, S.; Yaoming, M. Reconstruction of a Cloud-Free Vegetation Index Time Series for the Tibetan Plateau. Mt. Res. Dev. 2004, 24, 348–353. [Google Scholar] [CrossRef]
Song, H.; Liu, Q.; Wang, G.; Hang, R.; Huang, B. Spatiotemporal Satellite Image Fusion Using Deep Convolutional Neural Networks. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2018, 11, 821–829. [Google Scholar] [CrossRef]
Yu, W.; Li, J.; Liu, Q.; Zhao, J.; Dong, Y.; Wang, C.; Lin, S.; Zhu, X.; Zhang, H. Spatial–Temporal Prediction of Vegetation Index With Deep Recurrent Neural Networks. IEEE Geosci. Remote Sens. Lett. 2022, 19, 2501105. [Google Scholar] [CrossRef]
Benzenati, T.; Kallel, A.; Kessentini, Y. STF-Trans: A Two-Stream Spatiotemporal Fusion Transformer for Very High Resolution Satellites Images. Neurocomputing 2024, 563, 126868. [Google Scholar] [CrossRef]
Qin, P.; Huang, H.; Chen, P.; Tang, H.; Wang, J.; Chen, S. Reconstructing NDVI Time Series in Cloud-Prone Regions: A Fusion-and-Fit Approach with Deep Learning Residual Constraint. ISPRS J. Photogramm. Remote Sens. 2024, 218, 170–186. [Google Scholar] [CrossRef]
Fauvel, M.; Lopes, M.; Dubo, T.; Rivers-Moore, J.; Frison, P.-L.; Gross, N.; Ouin, A. Prediction of Plant Diversity in Grasslands Using Sentinel-1 and -2 Satellite Image Time Series. Remote Sens. Environ. 2020, 237, 111536. [Google Scholar] [CrossRef]
He, W.; Yokoya, N. Multi-Temporal Sentinel-1 and -2 Data Fusion for Optical Image Simulation. ISPRS Int. J. Geo-Inf. 2018, 7, 389. [Google Scholar] [CrossRef]
Zhao, W.; Qu, Y.; Chen, J.; Yuan, Z. Deeply Synergistic Optical and SAR Time Series for Crop Dynamic Monitoring. Remote Sens. Environ. 2020, 247, 111952. [Google Scholar] [CrossRef]
Li, J.; Li, C.; Xu, W.; Feng, H.; Zhao, F.; Long, H.; Meng, Y.; Chen, W.; Yang, H.; Yang, G. Fusion of Optical and SAR Images Based on Deep Learning to Reconstruct Vegetation NDVI Time Series in Cloud-Prone Regions. Int. J. Appl. Earth Obs. Geoinf. 2022, 112, 102818. [Google Scholar] [CrossRef]
Mao, Y.; Van Niel, T.G.; McVicar, T.R. Reconstructing Cloud-Contaminated NDVI Images with SAR-Optical Fusion Using Spatio-Temporal Partitioning and Multiple Linear Regression. ISPRS J. Photogramm. Remote Sens. 2023, 198, 115–139. [Google Scholar] [CrossRef]
Chen, Y.; Cao, R.; Liu, S.; Peng, L.; Chen, X.; Chen, J. A New Deep Learning-Based Model for Reconstructing High-Quality NDVI Time-Series Data in Heavily Cloudy Areas: Fusion of Sentinel 1 and 2 Data. Int. J. Digit. Earth 2024, 17, e2407941. [Google Scholar] [CrossRef]
Guo, J.; He, C.; Zhang, M.; Li, Y.; Gao, X.; Song, B. Edge-Preserving Convolutional Generative Adversarial Networks for SAR-to-Optical Image Translation. Remote Sens. 2021, 13, 3575. [Google Scholar] [CrossRef]
Garioud, A.; Valero, S.; Giordano, S.; Mallet, C. Recurrent-Based Regression of Sentinel Time Series for Continuous Vegetation Monitoring. Remote Sens. Environ. 2021, 263, 112419. [Google Scholar] [CrossRef]
Yang, X.; Zhao, J.; Wei, Z.; Wang, N.; Gao, X. SAR-to-Optical Image Translation Based on Improved CGAN. Pattern Recognit. 2022, 121, 108208. [Google Scholar] [CrossRef]
Hu, X.; Zhang, P.; Ban, Y.; Rahnemoonfar, M. GAN-Based SAR and Optical Image Translation for Wildfire Impact Assessment Using Multi-Source Remote Sensing Data. Remote Sens. Environ. 2023, 289, 113522. [Google Scholar] [CrossRef]
Zhu, J.-Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks. arXiv 2020, arXiv:1703.10593. [Google Scholar]
Sandfort, V.; Yan, K.; Pickhardt, P.J.; Summers, R.M. Data Augmentation Using Generative Adversarial Networks (CycleGAN) to Improve Generalizability in CT Segmentation Tasks. Sci. Rep. 2019, 9, 16884. [Google Scholar] [CrossRef]
Engin, D.; Genc, A.; Ekenel, H.K. Cycle-Dehaze: Enhanced CycleGAN for Single Image Dehazing. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Salt Lake City, UT, USA, 18–22 June 2018; pp. 825–833. [Google Scholar]
Deng, W.; Zheng, L.; Ye, Q.; Kang, G.; Yang, Y.; Jiao, J. Image-Image Domain Adaptation with Preserved Self-Similarity and Domain-Dissimilarity for Person Re-Identification. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 994–1003. [Google Scholar]
Quan, T.M.; Nguyen-Duc, T.; Jeong, W.-K. Compressed Sensing MRI Reconstruction Using a Generative Adversarial Network With a Cyclic Loss. IEEE Trans. Med. Imaging 2018, 37, 1488–1497. [Google Scholar] [CrossRef] [PubMed]
Efremova, N.; Seddik, M.E.A.; Erten, E. Soil Moisture Estimation Using Sentinel-1/-2 Imagery Coupled With CycleGAN for Time-Series Gap Filing. IEEE Trans. Geosci. Remote Sens. 2022, 60, 4705111. [Google Scholar] [CrossRef]
Suarez, P.L.; Sappa, A.D.; Vintimilla, B.X.; Hammoud, R.I. Image Vegetation Index Through a Cycle Generative Adversarial Network. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Long Beach, CA, USA, 16–20 June 2019; pp. 1014–1021. [Google Scholar]
Yi, Z.; Zhang, H.; Tan, P.; Gong, M. DualGAN: Unsupervised Dual Learning for Image-to-Image Translation. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2868–2876. [Google Scholar]
Yang, S.; Jiang, L.; Liu, Z.; Loy, C.C. Unsupervised Image-to-Image Translation with Generative Prior. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 18311–18320. [Google Scholar]
Han, J.; Shoeiby, M.; Petersson, L.; Armin, M.A. Dual Contrastive Learning for Unsupervised Image-to-Image Translation. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Nashville, TN, USA, 19–25 June 2021; pp. 746–755. [Google Scholar]
Goodfellow, I.J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative Adversarial Networks. arXiv 2014, arXiv:1406.2661. [Google Scholar] [CrossRef]
Zhang, X.; Liu, L.; Zhao, T.; Zhang, W.; Guan, L.; Bai, M.; Chen, X. GLC_FCS10: A Global 10 m Land-Cover Dataset with a Fine Classification System from Sentinel-1 and Sentinel-2 Time-Series Data in Google Earth Engine. Earth Syst. Sci. Data 2025, 17, 4039–4062. [Google Scholar] [CrossRef]
Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2017, arXiv:1412.6980. [Google Scholar] [CrossRef]

Figure 1. Overall framework of the method for deriving temporally continuous NDVI from SAR images.

Figure 2. SA-CycleGAN architecture.

Figure 3. (a) Structure of the spatiotemporal attention generator, (b) Residual block structure, (c) Self-attention mechanism structure, and (d) Cross-attention mechanism structure.

Figure 4. Tiles with land cover for the (a) Zhangbei, (b) Xihuangbanna, (c) AGRO and (d) HARV sites. The tiles with red borders are the test set area, and the other tiles are the training set area. The blue points in the test set area are the pixels for the temporal consistency analysis in Section 3.2. In the diagram, number 1−5 and letters A−E represent the row and column numbers of the divided areas, respectively.

Figure 5. Maps of the NDVI values derived from Sentinel-1 SAR images by the SA-CycleGAN, DualGAN, GP-UNIT, and DCLGAN and the NDVI values calculated from the Sentinel-2 MSI images on 21 February 2020, of four subregions of the Zhangbei site. (a) 2D, (b) 2A, (c) 1E, and (d) 1D.

Figure 6. Maps of the NDVI values derived from Sentinel-1 SAR images by the SA-CycleGAN, DualGAN, GP-UNIT, and DCLGAN and the NDVI values calculated from the Sentinel-2 MSI images on 27 August 2020, for four subregions of the Xishuangbanna site. (a) 3A, (b) 3B (c) 5A, and (d) 5C.

Figure 7. Maps of the NDVI values derived from the Sentinel-1 SAR image on 13 April 2020, by the SA-CycleGAN, DualGAN, GP-UNIT, and DCLGAN and the NDVI values calculated from the Sentinel-2 MSI image on the same day for four subregions of the AGRO site. (a) 1B, (b) 1C, (c) 1A, and (d) 5E.

Figure 8. Maps of the NDVI values derived from the Sentinel-1 SAR image by the SA-CycleGAN, DualGAN, GP-UNIT, and DCLGAN and the NDVI values calculated from the Sentinel-2 MSI image on 22 September 2020, for four subregions of the HARV site. (a) 4E, (b) 4B, (c) 3B, and (d) 1D.

Figure 9. Kernel density distributions of the NDVI values derived from Sentinel-1 SAR images by the SA-CycleGAN, DualGAN, GP-UNIT, and DCLGAN and the NDVI values calculated from the Sentinel-2 MSI image over the four tiles of the (a) Zhangbei, (b) Xishuangbanna, (c) AGRO, and (d) HARV sites.

Figure 10. Time series of the NDVI values derived from Sentinel-1 SAR images by the SA-CycleGAN, DualGAN, GP-UNIT, and DCLGAN and the NDVI values calculated from the Sentinel-2 MSI images from 2019 to 2021 for the blue points in Figure 4 of the (a) Zhangbei, (b) Xishuangbanna, (c) AGRO, and (d) HARV sites.

Figure 11. Seasonal averages of the RMSE, PSNR, SSIM, and R² for the NDVI values derived from Sentinel-1 SAR images from 2019 to 2021 by the SA-CycleGAN, DualGAN, GP-UNIT and DCLGAN at the (a) Zhangbei, (b) Xishuangbanna, (c) AGRO, and (d) HARV sites.

Table 1. Vegetation types and longitudes and latitudes of the selected sites.

Sites	Longitude (°)	Latitude (°)	Vegetation Types
Zhangbei	114.69	41.29	Temperate grassland
Xishuangbanna	101.20	21.95	Tropical seasonal rainforest
AGRO	−88.29	40.01	Farmland
HARV	−72.17	42.53	Temperate deciduous broad-leaved forest

Table 2. Number of Sentinel-1 SAR and Sentinel-2 MSI images.

Sensors	Date Range	Zhangbei	Xishuangbanna	AGRO	HARV
Sentinel-1 SAR	1 January 2017–31 December 2021	293	142	127	130
Sentinel-2 MSI	1 January 2019–31 December 2023	154	59	115	104

Table 3. The number of tiles in the training and test sets for each site.

Dataset	Satellite	Zhangbei	Xishuangbanna	AGRO	HARV
Train	Sentinel-1	2063	1681	2032	2077
Train	Sentinel-2	1848	697	1552	1306
Test	Sentinel-1	515	419	508	519
Test	Sentinel-2	462	175	392	335

Table 4. Quantitative evaluation of the NDVI values derived from Sentinel-1 SAR images by the SA-CycleGAN, DualGAN, GP-UNIT, and DCLGAN against the NDVI values calculated from the Sentinel-2 MSI images over the four tiles of the Zhangbei (21 February 2020), Xishuangbanna (27 August 2020), AGRO (13 April 2020) and HARV (22 September 2020) sites.

Site Name	Models	RMSE	PSNR	SSIM	R²
Zhangbei	SA-CycleGAN	0.0502	22.83	0.89	0.87
	DualGAN	0.0721	20.34	0.78	0.81
	GP-UNIT	0.0683	21.02	0.80	0.83
	DCLGAN	0.0815	19.45	0.72	0.78
Xishuangbanna	SA-CycleGAN	0.0575	22.87	0.86	0.88
	DualGAN	0.0882	19.02	0.69	0.75
	GP-UNIT	0.0756	20.15	0.74	0.79
	DCLGAN	0.0952	18.47	0.65	0.70
AGRO	SA-CycleGAN	0.0534	24.27	0.89	0.86
	DualGAN	0.0834	20.89	0.76	0.77
	GP-UNIT	0.0705	22.10	0.80	0.82
	DCLGAN	0.0918	19.12	0.70	0.71
HARV	SA-CycleGAN	0.0734	21.57	0.82	0.81
	DualGAN	0.0967	18.23	0.66	0.72
	GP-UNIT	0.0851	19.54	0.68	0.75
	DCLGAN	0.1023	17.89	0.62	0.68

Table 5. Ablation experiment of self-attention mechanism and SSIM loss.

Site Name	Models	RMSE	PSNR	SSIM	R²
Zhangbei	CycleGAN	0.0912	18.76	0.72	0.71
	Attention-CycleGAN	0.0795	20.26	0.80	0.81
	SSIM-CycleGAN	0.0834	19.83	0.81	0.78
	SA-CycleGAN	0.0502	22.83	0.89	0.87
Xishuangbanna	CycleGAN	0.0708	20.05	0.79	0.81
	Attention-CycleGAN	0.0610	22.03	0.83	0.84
	SSIM-CycleGAN	0.0598	22.15	0.85	0.87
	SA-CycleGAN	0.0575	22.87	0.86	0.88
AGRO	CycleGAN	0.0836	19.12	0.74	0.73
	Attention-CycleGAN	0.0698	21.45	0.80	0.81
	SSIM-CycleGAN	0.0675	22.20	0.82	0.83
	SA-CycleGAN	0.0534	24.27	0.89	0.86
HARV	CycleGAN	0.0967	17.89	0.68	0.69
	Attention-CycleGAN	0.0798	20.05	0.76	0.79
	SSIM-CycleGAN	0.0796	20.12	0.77	0.78
	SA-CycleGAN	0.0734	21.57	0.82	0.81

Table 6. Sensitivity analysis of input sequence length on the Zhangbei dataset.

Sequence Length (T)	RMSE	SSIM	Training Time (per Epoch)
T = 1 (Single Frame)	0.0621	0.75	~8 min
T = 3 (Proposed)	0.0502	0.89	~10 min
T = 5	0.0515	0.87	~22 min

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, A.; Xiao, Z.; Zhao, C.; Li, J.; Zhang, Y.; Song, J.; Yang, H. An Enhanced CycleGAN to Derive Temporally Continuous NDVI from Sentinel-1 SAR Images. Remote Sens. 2026, 18, 56. https://doi.org/10.3390/rs18010056

AMA Style

Wang A, Xiao Z, Zhao C, Li J, Zhang Y, Song J, Yang H. An Enhanced CycleGAN to Derive Temporally Continuous NDVI from Sentinel-1 SAR Images. Remote Sensing. 2026; 18(1):56. https://doi.org/10.3390/rs18010056

Chicago/Turabian Style

Wang, Anqi, Zhiqiang Xiao, Chunyu Zhao, Juan Li, Yunteng Zhang, Jinling Song, and Hua Yang. 2026. "An Enhanced CycleGAN to Derive Temporally Continuous NDVI from Sentinel-1 SAR Images" Remote Sensing 18, no. 1: 56. https://doi.org/10.3390/rs18010056

APA Style

Wang, A., Xiao, Z., Zhao, C., Li, J., Zhang, Y., Song, J., & Yang, H. (2026). An Enhanced CycleGAN to Derive Temporally Continuous NDVI from Sentinel-1 SAR Images. Remote Sensing, 18(1), 56. https://doi.org/10.3390/rs18010056

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Enhanced CycleGAN to Derive Temporally Continuous NDVI from Sentinel-1 SAR Images

Highlights

Abstract

1. Introduction

2. Materials and Methods

2.1. Study Areas

2.2. Data

2.3. Overall Procedure

2.4. SA-CycleGAN

2.4.1. Spatiotemporal Attention Generator

2.4.2. PatchGAN Discriminators

2.4.3. Loss Functions

2.5. Training Dataset

2.6. Performance Evaluation

3. Results

3.1. Spatial Consistency Analysis

3.2. Temporal Consistency Analysis

3.3. Ablation Experiment and Sensitivity Analysis

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI