1. Introduction
Sequential optical satellite images provide valuable observations of temporal and spatial changes for large-scale ground surveys, making them essential data sources for monitoring urban development, disasters, agriculture, and more. However, the availability of optical data is often limited due to its high susceptibility to weather conditions. Spatiotemporal fusion [
1] and spatiotemporal-spectral fusion [
2] can partially mitigate the impact of adverse weather by predicting missing images based on available optical data. Nevertheless, obtaining low-resolution reference optical images during prolonged rainy seasons remains difficult. In contrast, synthetic aperture radar (SAR) imaging, which utilizes microwaves to penetrate clouds, facilitates all-weather monitoring and significantly broadens the scope of remote sensing. Since SAR images are not easily interpreted, converting them into optical images can enhance usability for downstream applications. Additionally, synthesized optical images can be employed to restore optical images that are partially obscured by clouds or haze.
Solutions have emerged for one-to-one image translation from SAR images to optical images as a conditional image generation problem. The Generative Adversarial Network (GAN) framework is commonly used in these works [
3,
4,
5,
6,
7], where the generators use the encoder–decoder framework. Convolutional neural networks (CNNs) were used for feature extraction, recombination, and deconvolution, and the discriminator improves the training quality of the generator. Besides GAN and CNNs, some works utilized visual Transformer and diffusion models [
8], where attention mechanisms and data distribution learning are utilized separately.
Sequential SAR-to-optical translation holds promise due to the steady availability of repeated SAR observations, enabling temporal understanding of land covers. To leverage temporal information, methods such as Gaussian process regression [
9] and deep learning-based approaches like recurrent neural networks (RNNs) [
10,
11] or Transformers [
12] have been explored. Unfortunately, these algorithms targeted the synthesis of time series normalized difference vegetation index (NDVI) and cannot be used for multi-channel image generation.
In summary, there is currently no research demonstrating the feasibility of translating a SAR image sequence into an optical image sequence. The studies mentioned either performed single translations or achieved sequence translation from SAR to NDVI. To our knowledge, Peng et al. [
13] conducted the only sequence translation from SAR to optical images, which necessitates auxiliary optical images as references and is essentially a fusion task. Clearly, direct translation from a SAR sequence to an optical sequence is challenging, as it involves modeling spatial, temporal, and spectral features simultaneously. Additionally, there is a significant lack of training data.
Multi-temporal SAR-to-optical image translation has potential of enhancing the generation quality, as it harnesses the sequential properties inherent in both SAR and optical imagery. SAR images record microwave backscatter coefficients. For land covers with unchanged structures, the SAR intensity is highly correlated with atmospheric water vapor content. This water vapor content is determined by atmospheric temperature and pressure, exhibits seasonal cycles, and fluctuates with meteorological conditions. Sequential SAR imagery captures the overarching trends of seasonal changes while also dynamically resolving short-term meteorological variations. These identified trends inform the decoder to generate optical images corresponding to specific seasons and weather conditions, thus enhancing the quality of the generated output. On the contrary, a single image cannot capture climactic dynamics nor reflect seasonal trends.
This work aims to translate sequential SAR images into sequential optical images. Our method takes multiple consecutive SAR images as input and produces temporally aligned optical images. The model employs a diffusion framework to learn the data distribution of optical images. A transformer architecture is utilized to model the temporal and spatial relationships between image patches. The process is conducted in a latent space using a variational autoencoder (VAE) to optimize memory usage. A conditional branch is designed to encode sequential SAR images, facilitating the synchronization of features between SAR and optical images. Additionally, the model incorporates the capture time of SAR images to enhance the generation of targets. To the best of our knowledge, this work represents the first sequential SAR-to-optical image translation method that does not rely on optical images. From the perspective of satellite observation, acquiring SAR sequences is nearly as straightforward as obtaining a single SAR image; thus, the practical value of this work remains significant despite the use of multiple SAR images.
To achieve this goal, sequential SAR–optical datasets have been constructed for the first time. The SAR images are sourced from Sentinel-1, while the optical images are obtained from Sentinel-2. The scenes are located in Australia and France. Therefore, we have two datasets which are named SeqSen12-FCA and SeqSen12-Paris, respectively. The sequences cover a period of up to one year. Our dataset can also be utilized for single SAR-to-optical image translation, which will be tested in this study.
The contributions of this work are summarized below.
- (1)
This is the first study to generate multi-temporal optical images from a sequence of SAR images.
- (2)
It created the first dataset for the conversion between sequential SAR images and sequential optical images.
- (3)
It proposes a conditional diffusion model to address the challenge of sequential SAR-to-optical generation.
The rest of the paper is organized as follows.
Section 2 introduces related works and the principle of the conditional diffusion model.
Section 3 presents the proposed methodology.
Section 4 presents the experimental plans and
Section 5 provides experimental results, comparing different state-of-the-art methods with the proposed method.
Section 6 discusses the potential of better performance.
Section 7 draws conclusions based on the findings.
2. Related Work
This section gives a survey about recent progress in SAR-to-optical image translation, with a specific emphasis on the sequence translation technologies. For single translation, it discusses a range of GAN-based, Transformer-based, and diffusion-based methods. For sequence translation, RNN and Transformer are mentioned. Furthermore, the review delves into conditional diffusion models, which will be used in the next section for hybrid modeling.
2.1. Single SAR-to-Optical Translation
In recent years, some studies have suggested supervised SAR-to-optical image translation. These studies understood SAR images as ordinary images, addressed the translation as a style transfer issues and solved it with the encoder–decoder framework. The network structures, loss functions, and training strategies are deliberately designed in these work.
Most studies use the Generative Adversarial Network (GAN) framework to constrain the spectral accuracy of generated optical images. Wang et al. [
4] proposed a bidirectional consistent adversarial network to synthesize optical images from input SAR images. Fu et al. [
5] proposed a cascaded residual adversarial network for SAR to optical image conversion, which introduces cascaded residual connections and GAN loss. Tan et al. [
14] designed two GAN models, one responsible for denoising SAR images and the other for coloring SAR images. Zhang et al. [
15] introduced gradient information for SAR images and texture information such as contrast, uniformity, and correlation based on a gray level co-occurrence matrix into the generator to improve structural similarity. Guo et al. [
16] proposed an edge preserving constraint for SAR to optical image translation, which enhances the structure of the translated image using the edge information of SAR images. Yang et al. [
17] utilized multi-scale receptive fields and chromatic aberration loss to improve its translation ability. Li et al. [
18] proposed a multi-scale generative adversarial network based on wavelet feature learning. Wei et al. [
19] proposed a GAN that combines cross fusion inference and wavelet decomposition to preserve image structural details and enhance high-frequency band information.
The latest modeling techniques for translation include vision Transformer, diffusion models, and physics-guided explainable models. Wang et al. [
6] proposed a hybrid GAN network that combines CNN and visual Transformer. Kong et al. [
20] proposed an encoder–decoder structure generator based on the SWIN Transformer. Recently, Bai et al. [
8] proposed a conditional diffusion model for the conversion of SAR to optical images. This model is based on the diffusion process and utilizes SAR images as constraints to convert Gaussian noise into real optical images. The diffusion-based generation model outperforms current GAN based models in generating optical images. Other applications of the diffusion model in SAR-to-optical image translation were found in [
21,
22]. As for the interpretable model, Zhang et al. [
23] designed a third-order finite difference residual block in light of thermodynamics to efficiently extract inter-domain invariant features, and Zhang et al. [
24] proposed a neural partial differential Equation (Taylor central difference)-based residual block to build the translation network.
2.2. Sequence Translation from SAR to NDVI
The reconstruction from time series to time series is traditionally understood as a time dimension interpolation problem, and the main sequence generation methods include local interpolation, global interpolation, Gaussian process regression, and so on. The local interpolation method uses a time series based on sliding time windows to infer the temporal changes in optical remote sensing images. Common methods include cubic spline interpolation, adaptive regression filtering, and Savitzky–Golay filtering algorithm. Global interpolation methods recover lost information by fitting data to pre-defined functions, including Whittaker smoothing, asymmetric Gaussian fitting, and harmonic analysis based on Fourier transform. Gaussian process regression is a supervised learning method that can be used to learn the interrelationships between multiple datasets [
9].
Recently, deep learning-based sequence generation methods have gradually been used for the reconstruction of SAR sequence data to optical remote sensing sequence data. Zhao et al. [
10] extended the traditional CNN-RNN model and proposed a deep learning network based on context sequences from multiple CNN sequences to RNN sequences. The network first uses multiple CNNs to extract feature information from SAR data, and it then uses RNNs to establish the connection between SAR sequence data and optical remote sensing sequence data. Li et al. [
12] proposed an end-to-end spatiotemporal fusion method based on Transformer, which collaborated SAR and optical time series to reconstruct optical remote sensing sequences of cloudy areas. Roßberg and Schmitt [
11] proposed a RNN model based on a Gated Recurrent Unit (GRU), which is suitable for processing variable sequence length sequence data and can effectively handle missing optical remote sensing data.
The generation of SAR sequences to optical sequences is expected to solve the issue of temporal resolution caused by weather factors, but the related research focused on the generation of NDVI sequences from SAR sequences. Apparently, the image generation is far more difficult than the prediction of NDVI. Therefore, the translation from SAR image sequences to optical image sequences remains an untouched issue.
2.3. Conditional Diffusion Model in the Latent Space
As shown in
Figure 1, the conditional diffusion model in the latent space learns the distribution of features extracted from images using an encoder. It consists of a forward process and a backward process. In the forward process, random Gaussian noise
is gradually added to the reference optical features
to generate a noisy optical feature map
. This can be denoted as
where
and
increases gradually from 0.0001 to 0.02 with the time step
t .
By applying a reparameterization technique, the noisy feature can be represented as,
where
.
The backward process involves training the conditional diffusion model to reverse the noise degradation introduced during the forward process, denoted as,
where
S denotes the conditional SAR features that is concatenated with the noisy optical feature
during model training and fed into the model.
The training of the model is constrained by the mean squared error (MSE) between the noise
Z predicted by the network and the noise
added during the forward process. To achieve accelerated sampling, a deterministic accelerated sampling strategy proposed in [
25] is suggested. The sampling of the diffusion model is represented as
7. Conclusions
In this paper, we focus on translating a SAR image sequence to an optical image sequence, and we solved it with diffusion networks. This work may represent the first effort in sequence translation that incorporates time-series images along with their respective capture times. Following the diffusion framework, we employed twelve Transformers as the backbone to estimate the noise in the image features, which were preprocessed using a variational autoencoder. Additionally, a conditional branch was designed specifically for SAR sequences to extract relevant features. The capture time was encoded and integrated into the Transformers.
Since the existing translation datasets were designed for single translations, two new datasets were created for sequence translation. The Sentinel-1 GRD data served as the SAR source, while the Sentinel-2 red, green, and blue (R/G/B) data acted as the optical source. Experiments were conducted on the two datasets using sequence translation, single translation, and ablation studies. The translation results were compared with three single translation algorithms. The scores indicate that when sequential SAR images were utilized, the RMSE loss decreased by 3.26% and 22.9% for the two datasets, respectively. Including capturing moments helps to reduce the RMSE loss by 0.75% and 5.01% for the two datasets, respectively. In terms of visual comparison, our method demonstrates superior radiometric accuracy and spectral fidelity.