Multi-temporal Sentinel-1 and -2 Data Fusion for Optical Image Simulation

In this paper, we present the optical image simulation from a synthetic aperture radar (SAR) data using deep learning based methods. Two models, i.e., optical image simulation directly from the SAR data and from multi-temporal SARoptical data, are proposed to testify the possibilities. The deep learning based methods that we chose to achieve the models are a convolutional neural network (CNN) with a residual architecture and a conditional generative adversarial network (cGAN). We validate our models using the Sentinel-1 and -2 datasets. The experiments demonstrate that the model with multi-temporal SAR-optical data can successfully simulate the optical image, meanwhile, the model with simple SAR data as input failed. The optical image simulation results indicate the possibility of SARoptical information blending for the subsequent applications such as large-scale cloud removal, and optical data temporal superresolution. We also investigate the sensitivity of the proposed models against the training samples, and reveal possible future directions.

Abstract-In this paper, we present the optical image simulation from a synthetic aperture radar (SAR) data using deep learning based methods. Two models, i.e., optical image simulation directly from the SAR data and from multi-temporal SARoptical data, are proposed to testify the possibilities. The deep learning based methods that we chose to achieve the models are a convolutional neural network (CNN) with a residual architecture and a conditional generative adversarial network (cGAN). We validate our models using the Sentinel-1 and -2 datasets. The experiments demonstrate that the model with multi-temporal SAR-optical data can successfully simulate the optical image, meanwhile, the model with simple SAR data as input failed. The optical image simulation results indicate the possibility of SARoptical information blending for the subsequent applications such as large-scale cloud removal, and optical data temporal superresolution. We also investigate the sensitivity of the proposed models against the training samples, and reveal possible future directions.

I. INTRODUCTION
T HE optical data provided by Sentinel-2 has 13 spectral bands from visible, near infrared to short wave infrared spectrum, with a 5-day revisit time at the equator [1]. Sentinel-2 is useful in time-series analysis such as land cover changes and damage area detection. Change analysis using optical data assumes that all investigated images are cloud-free to classify every pixel in the image, which is often not possible, especially for the cloudy areas of the earth. Usually, there is only one low-cloudy image nearly every month in the cloudy area. Some researchers have reportedly used data from alternative months (previous or next) to composite the data corrupted by clouds [2], [3]. However, these methods remove only small clouds and also ignore the changes between monthly data. In order to overcome these serious limitations, it is necessary to combine other remote sensing data resources, and conduct multi-source data fusion to predict cloud-free Sentinel-2 images.
The last few decades has witnessed a rapid growth in SAR data. SAR data captured by Sentinel-1 exhibits totally different characteristics from that of the optical data. ability to provide routine, day and night, all-weather resolution observation, and can also overcome various kinds of bad weather conditions such as clouds, rain, smoke and fog [4]. In particular, it is expected to provide near daily coverage over Europe and Canada [4]. Therefore, one obvious question arises: can we use SAR data to predict the optical image?
Recently, many researchers have contributed to the information fusion of SAR and optical images with different motives. Researchers [4] have adopted Intensity Hue Saturation (IHS) to integrate hyperspectral, and Topographic SAR into a single image to enhance urban surface features. Some groups [5] tried removal of the speckle noise from SAR data via fusion of two data sources. Reportedly [6] the SAR and optical data can be matched by deep learning methods to generate SAR-like image, for the generation of precise ground control points.
Many researchers conduct the fusion of SAR and optical data to produce middle image [6] or final application results. Whether or not the SAR data can be directly translated to optical data still remains a concern. In this work, we aim to investigate the possibility of optical data simulation from SAR data. We have adopted deep learning based methods for the optical simulation for the following three reasons. First, a deep CNN can efficiently capture the image characteristics. Second, several smart techniques have been proposed for training CNN such as, batch normalization (BN) [7], residual networks (ResNets) [8] and Rectifier Linear Unit (ReLU) [9]; recently, a generative adversarial network (GAN) has been proposed and demonstrated as useful in data generation. Third, a deep architecture can be accelerated by graphics processing unit (GPU).
We chose to use two methods, CNN with ResNets and cGAN [10] to complete the task. Equipped with state-of-theart data simulation algorithms we investigated the possibility of optical image simulation from single SAR imagery acquired at a similar period, and multi-temporal SAR-optical images (SAR imagery with the side information from previous or next pairs of SAR and optical images). Our experiments on Sentinel-1 and -2 data demonstrate the necessity of using multi-temporal images as input and the effectiveness of cGAN.

A. Problem Formulation
The purpose of this work is to simulate an optical image using either a single SAR image or multi-temporal SARoptical images, which is outlined in Fig.1  B, displays the simulation from SAR (S2) combined with the additional information from the previous pairs of SAR and optical data (S1 and O1). Task B is also referred to as multitemporal fusion based optical image simulation. The CNN and cGAN are adopted to complete the simulation tasks, and the details of the investigated methods are presented in the subsequent section.

B. cGAN
Conditional GAN is extended from GAN [11] and deep convolutional GAN (DCGAN) [12], which describes a minimax game between a generative model G and a discriminative model D. The generator G is trained from the input image x, and random noise z to generate the output image y: G : x, z → y. The discriminator D is trained to distinguish the fake image G(x, z) from the real image y. The adversarial processing of the cGAN is presented as follows. The discriminator D tries to distinguish the realistic inputreal pairs as 1, i.e., D(x, y) = 1, and detect the simulated input-fake pairs as 0, i.e., D(x, G(x, z)) = 0. From a second prospective, the generator G tries to generate a fake image to fool the discriminator D, in order to increase the accuracy of D(x, G(x, z)) to 1. If at any instance the discriminator D cannot distinguish between input-real and input-fake pairs, then the fake image generated by G can be regarded as the predicted optical image (we call it real image). The cGAN loss of this adversarial processing can be detailed as: Here, log function is adopted to relax the gradient insufficient at the beginning of the training [11]. From a second perspective, the generator's objectives are not only to fool the discriminator, but also to generate the image near the real output y in the sense of L1/L2 distance. To encourage less blurring, L1 distance, is absorbed into the cGAN loss, resulting in the final objective function: Here, parameter λ demonstrates the trade-off between the cGAN loss, and L1 loss.

C. Network Architectures
This work requires simultaneous training of the generative and discriminative networks. As stated in the preceding section, the generator produces a fake image from the input data, and the discriminator tries to classify the input-fake pair and input-real pair. The discriminator is first trained to improve the classification accuracy. A trained discriminator is then used to help to train the generative network. The process alternates until the end. Main architectures of generator and discriminator are [12] with the modules of the form Conv(convolution)-BN-ReLu. To keep the spatial size of input and output images, pooling step is left out and stride size is set as 1. We used zero-padding to make up for the spatial size reduction cased by the convolution kernel.
The discriminative network used is the same as the one introduced in [10] with patchGAN to capture high-frequencies and reduced parameters. The generative network is illustrated in Fig.2. In this figure, n64k7 means the corresponding number of output features is 64 and the kernel size is 7. The input image is of size 256 × 256 × n (e.g., n = 2 for Task A when two polarimetric channels are used; n = 8 for Task B when four polarimetric channels and four spectral bands are used), and the output image is of size 256 × 256 × 4. ResNets have been demonstrated as very useful in the restoration task [13]. However, ResNets identify network by shortcut, which is inconsistent with our generator network (the features of input is not equal to that of output). In the first three layers, the features rise to 256 dimensions, followed by 9 ResNets blocks. Each ResNets block layer is completed by the modules of the form Conv-BN-ReLu-Drop-Conv-ReLu [14]. In this case, the ResNets are used in the 256 feature space, and concluded by three layers to reduce the feature dimension to 4. In particular, Tanh function is adopted instead of ReLu at the output layer as reported in [12].

D. Implementation Issues
Previous works on GAN have demonstrated the importance of using Gaussian noise as input in the generative network. In this work on cGAN, the input is x and the noise is absorbed into the dropout part, which can also produce reasonable results [10]. In our experiments, the dropout rate is set as 0.5. Mini-batch stochastic gradient decent with Adam solver is adopted to train the particular model. The model is trained  II   The Training and Test Patches provided by the Images  Iraq  Jianghan  Xiangyang  Train  561  1188  754  Test  99 165 None on 200 epochs with batch size 1 and learning rate 0.0002. As suggested in literature [15], λ in the loss objective (3) is set to 100 to encourage both reconstruction accuracy and object sharpness, simultaneously.

A. Dataset
Sentinel-1 and -2 data 1 were adopted in our experiments to confirm the possibility of optical simulation from SAR data. The data were pre-processed and co-registered by Sentinel Application Platform (SNAP) software provided by ESA [16]. We processed the SAR image with the flowchart of calibrationdespeckling-Range Doppler Terrain, and two bands of VV/VH intensities with a pixel spacing of 10m. For Sentinel-2 data, we chose 4 bands (R-G-B-NIR) with a ground sampling distance of 10m for the experiments. The SAR and optical images were co-registered by reprojection; SAR and Optical data pairs from three areas (Iraq, Jianghan, and Xiangyang) were used in the experiments. The acquisition time for each image is presented in Table I. The absolute difference in acquisition time between S1 and O1 (or, S2 and O2) is ensured to be less then 5 days. Images from Iraq, Jianghan and Xiangyang are of size 8460 × 5121, 10657 × 8659 and 6801 × 7651, respectively. An earthquake happened in Iraq area between time T1 and T2, that caused many changes in the terrain; images from Jianghan and Xiangyang, two similar areas of China were sensed simultaneously. O2 images of data pairs from each of these areas are presented in Fig.3.

B. Training and Test Setup
The images were segmented into non-overlapping patch pairs of spatial size 256 × 256. The training data were then selected from the area inside the red rectangle, and the test data from the blue ones. The number of training, and test patches are summarized in Table II. Models specific to each test area were designed according to the respective training dataset as per the details given below.
Case 1) The test patches were taken from Iraq image. The optical image simulation results of Tasks A and B were verified with different models, i.e., CNN (the generation model described in Fig.2)    Case 2) The test patches were taken from Jianghan image. In this case, we performed only Task B, and train MTCNN and MTcGAN models with four different training sets. For the first three sets, the samples were selected from the training parts of the Jianghan, Iraq, and Xiangyang images, respectively. In this case, the simulated optical images of MTCNN and MTcGAN methods can be with different training sets. We also added the whole training patches together to formulate the final training set. The experimental studies conducted are thus expected to testify the influence of different training sets for the final optical image simulation.

C. Evaluation Index
In this paper, three evaluation indicators: the peak signal-tonoise ratio (PSNR), the structural similarity (SSIM), and the mean spectral angle (MSA), were used to access the quality of the simulated optical image. For the multispectral image, we calculated the values of PNSR and SSIM of each band between simulated optical image, and the reference image, and determined the average [17].

IV. RESULTS
The training program was completed on a single GTX1080 GPU. The PSNR, SSIM and MSA values of different simulation results for Case 1, are evaluated and listed in Table  III. The values of the three indices for input O1 data are regarded as baseline for the other sets. The best of the values for each quality index in the table is shown in bold. Table III, also shows that CNN and cGAN achieve lower values for all three quality indices compared to the baseline. That essentially concludes that Task A, which describes the optical image simulation from single SAR imagery fails to predict the image. On the other hand Task B related methods, i.e., MTCNN and MTcGAN are found to achieve higher values for each index type compared to the baseline. The results indicate that compared to the input images, MTCNN and MTcGAN can successfully simulate the optical images. Furthermore, higher index value of MTcGAN than that of MTCNN, suggests the advantage of adversarial network in our simulation task. Fig. 4 shows several patches of input S2 and O1, and output reference O2, compared with our optical image simulation results. In Fig. 4, MTCNN and MTcGAN demonstrate much better results than that of CNN and cGAN from a visual perspective. In fact, SAR image and optical image being totally different from each other, it is extremely difficult to learn a mapping between the two. The optical simulation results of Task A presented in Fig. 4 are hence blurred, and one can not distinguish objects from these simulated image. However, with multi-temporal fusion based optical simulation of Task B, one can learn the changed information between S1 and S2, and then accordingly reconstruct the change on the basis of O1 image. As illustrated in the red rectangle of Fig. 4, one can see a change has happened between O1 and O2. Our goal is to pass this change from the SAR image to the optical image, and then reconstruct the same in the simulated O2 image. Following this strategy, the complexity of Task B can be significantly reduced, and the optical image can thus be successfully simulated.
We also investigated the influence of different training samples on the final optical image simulation results of MTCNN and MTcGAN methods in Case 2. Unfortunately, accumulation of training sets fails to solve the problem, which is illustrated by the fact that the simulation results obtained with the whole training data together is worse than the one with only Jianghan data.
The experimental part is thus concluded with the verification of two hypotheses. First, a multi-temporal fusion based optical simulation in Task B is valid and effective. Second, GAN based method can produce better results than that of CNN. However, the corresponding simulation results are not so perfect. As illustrated in the red rectangle of Fig. 4, MTcGAN, standing for the best method, can only simulate a blurred object of the change information compared to the reference one. Additionally, the model is sensitive to the training samples; if the training samples are improper, it may lead to production of some fake results with the trained model.

V. CONCLUSION
We have thus investigated the possibility of optical image simulation from single SAR imagery and multi-temporal SARoptical images, in this paper. Two deep learning based methods have been designed for the said tasks, i.e., CNN with ResNets and cGAN. We tested our models on Sentinel-1 and -2 datasets and drew the following conclusions. First, multi-temporal data fusion based optical image simulation can successfully generate the optical images. The simulated optical images obtained show more similarity to the reference optical images, both in visual and quantitative evaluation, compared to the input SAR and optical images. Second, an adversarial network is proved useful and effective in our task.
Despite the satisfactory performance of multi-temporal fusion model with the cGAN method, there is still much room for improvement. The simulated optical images especially in the changing part of S1 and S2 images, are blurred and need improvement. Selection of the training samples is also a big concern for our model since without proper samples, the models may create fake optical images. Finally, in our model, we have chosen only two time period information, i.e., T1 and T2, and it may be possible to choose a few more to obtain better simulation results.