Sentinel-2 Cloud Removal Considering Ground Changes by Fusing Multitemporal SAR and Optical Images

Publicly available optical remote sensing images from platforms such as Sentinel-2 satellites contribute much to the Earth observation and research tasks. However, information loss caused by clouds largely decreases the availability of usable optical images so reconstructing the missing information is important. Existing reconstruction methods can hardly reflect the real-time information because they mainly make use of multitemporal optical images as reference. To capture the real-time information in the cloud removal process, Synthetic Aperture Radar (SAR) images can serve as the reference images due to the cloud penetrability of SAR imaging. Nevertheless, large datasets are necessary because existing SAR-based cloud removal methods depend on network training. In this paper, we integrate the merits of multitemporal optical images and SAR images to the cloud removal process, the results of which can reflect the ground information change, in a simple convolution neural network. Although the proposed method is based on deep neural network, it can directly operate on the target image without training datasets. We conduct several simulation and real data experiments of cloud removal in Sentinel-2 images with multitemporal Sentinel-1 SAR images and Sentinel-2 optical images. Experiment results show that the proposed method outperforms those state-of-the-art multitemporal-based methods and overcomes the constraint of datasets of those SAR-based methods.


Introduction
Remote sensing platforms such as Sentinel-2 satellites provide a large number of observation optical images, which contribute a lot to observation tasks such as Earth monitoring [1][2][3] and agriculture [4,5]. Nevertheless, the existence of cloud results in the severe information loss, which has a negative impact on the further application of remote sensing images. According to the statistics [6], above half of the Earth is covered by cloud so reconstruction of missing information caused by cloud is of great value. According to acquisition time of reference data in the reconstruction, traditional cloud removal can be classified into three families [7] which are, respectively, spatial-based methods, spectralbased methods and multitemporal-based methods.
Spectral-based methods make use of the bands with intact information as reference to reconstruct the bands' missing information by establishing the relationship between bands. The reconstruction of dead pixels in Aqua MODIS band 6, where 15 of the 20 detectors are non-functional, is a classical case. Several spectral-based methods [8,9] solve this problem by establishing the polynomial models between bands. Generally, results of spectral-based methods are of high visual effect and accuracy, but they cannot deal with the situation where all bands have missing information. Spatial-based methods can deal with the missing information of all bands. They assume that the missing information and the remaining information share the same statistic law and geometrical structure so the missing information can be reconstructed from the remaining information. Spatial-based methods can be further sub-divided into four sub-classes, which are, respectively, interpolation methods [10,11], propagation diffusion methods [12][13][14], variation-based methods [15,16] and exemplar-based methods [17,18]. Spatial-based methods can well reconstruct the missing information of all bands, but they can only deal with the missing areas with small sizes. Even worse, the authority of reconstruction parts cannot be guaranteed because the reconstruction process is all based on the statistic of remaining areas. Multitemporal-based methods make use of homogeneous data from other times as reference to reconstruct the missing information. They outperform the above two methods in the reconstruction of large missing information. The precondition of traditional heterochronic methods is that there is no ground change occurring between the two periods. The multitemporal-based methods can be further divided into three sub-classes which are, respectively, replacement methods [19,20], learning methods [21,22] and filtering methods [23,24]. Recently, some multitemporal cloud removal methods based on deep learning have been proposed. For example, Uzkent, et al. [25] introduced deep neural network into the multitemporal cloud removal task, where a well-trained network is achieved with a large multitemporal training dataset collected by authors. Singh and Komodakis [26] provided a high-resolution multitemporal training dataset with several scenes and proposed Cloud-GAN based on cycle-consistent training to removal clouds. However, these methods cannot adequately deal with the situation where no training dataset is available. In general, accuracy and authority of multitemporal methods' results can be guaranteed due to the existence of reference data. However, they actually cannot guarantee that no ground change occurs during the selected time range.
Despite the success of above cloud removal methods, they usually use homogeneous data as reference, but not heterogeneous data such as synthetic aperture radar (SAR) data. Considering the possible ground information change, SAR can reflect real-time ground information compared with the multitemporal data due to its strong penetrability against cloud, making it great reference data in the cloud removal task. However, the relationship between the optical image and SAR image is so complex that traditional methods cannot simulate the relationship due to the limited fitting ability. In recent years, many brand new cloud removal methods, which view SAR data as reference, have arisen due to the development of deep learning. Due to the strong nonlinear fitting ability of deep neural networks, they can better simulate the relationship between SAR and optical images. Generally, these new methods can be divided into isochronic methods and heterochronic methods.
Isochronic SAR-based methods [27][28][29][30], which view contemporary SAR data as reference, attempt to train a deep neural network to establish the relationship between SAR and optical images. Then they directly map the SAR images to the cloudy areas of optical image as the cloud removal results. For example, the authors of [31] made use of pix2pix [32] model to map the relationship between paired SAR and optical images. In [27], the authors introduced CycleGAN [33] model to train the network unpaired SAR and optical images and the results show good visual effect. In [30], a deep dilated network was proposed to establish the relationship between SAR and optical images more accurately. After obtaining simulated optical images from SAR images, the authors of [29] further made use of the information around the corrupted areas to guide the correction of spatial and spectral information of simulated optical images. To remove cloud in Sentinel-2 images, in [28] the authors provided a dataset including the Sentinel-2 images, Landsat optical images and Sentinel-1 SAR images to train the network. Despite the success of the above methods, they do not perform well in the areas with dense texture because directly mapping the SAR images to optical images relies too much on the fitting ability of the network and the quality of training datasets.
To relieve the difficulty of direct transformation, some studies [34][35][36][37] proposed heterochronic SAR-based methods by further adding multitemporal SAR and optical images as reference. With pix2pix model, the authors of [34] synthesized the corrupted

•
Based on deep neural network, we propose a novel method to remove clouds in images from Sentinel-2 satellite with multitemporal images from Sentinel-1 and Sentinel-2 satellites. The cloud removal results can reflect the ground information change in the reconstruction areas during the two periods. • Different from the existing SAR-based methods which need large training datasets, the proposed method can directly act on a given corrupted optical image without datasets, confirming that a large training dataset is unnecessary in the cloud removal task. • Severe simulation and real data experiments are conducted with multitemporal optical and SAR images from different scenes. Experimental results show that the proposed method outperforms many other cloud removal methods and has strong flexibility across different scenes.
We organize our paper as follow: In Section 2, we present the workflow of the proposed reconstruction method. In Section 3, simulation and real data experiments are conducted with Sentinel-1 and Sentinel-2 images in different scenes to verify the effectiveness and flexibility of the proposed method. Some discussions are included in Section 4. In Section 5, we summarize the proposed method and discuss future improvements.

Problem Formulation
Some notations of data are given here for simplification. We use paired SAR and optical images from two periods, t 1 and t 2 . Among them, we denote the SAR and optical images from t 1 by S 1 and O 1 . SAR and optical images from t 2 are denoted by S 2 and O 2 . We hypothesize here that the target optical image to be reconstructed is O 2 and the rest three images serve as the reference. Deep neural networks, which are labeled as G, are introduced in our cloud removal process. In the reconstruction with contemporary SAR data, the idea of existing methods is to train the network G with paired SAR and optical images from the same time, whose process is described in Equation (1): Then they directly transform SAR images to contemporary optical images with the trained network.
These methods perform well when dealing with very high-resolution SAR and optical pairs [27]. When it comes to the medium-resolution images such as Sentinel-1 and 2 images, the results may not be satisfying [31]. To deal with medium-resolution remote sensing images, some reconstruction methods [34][35][36] make use of multitemporal optical and SAR data as reference. These methods aim to train the network G with the concatenation of S 1 , S 2 and O 1 as input and O 2 as target to relieve the difficulty of direct transformation from SAR to optical images. Then the trained network is applied to new data. The general training and application process are described in Equation (2): These methods cannot work when there are not enough training datasets due to the adversarial training strategy and the deep network structure. Furthermore, they will obtain unsatisfying results when they deal with images outside the domain of training datasets, which is very common in SAR-optical tasks. These drawbacks indicate the poor flexibility of deep learning methods in this application.

Method for Cloud Removal
Instead of the ideal training condition where O 2 in training datasets have no missing information, we attempt to deal with a more general situation, where the given O 2 has missing information and the training dataset is unavailable. We aim to obtain O * 2 which is the reconstruction result of O 2 . Here we also make use of S 1 , S 2 and O 1 as reference images and introduce a deep neural network, which is denoted as G. The whole framework of our method is displayed in Figure 1a.
First, we implement a cloud detection algorithm to get the missing information mask M from O 2 : Then, we concatenate the images in the sequence of O 1 , S 1 , S 2 and treat them as the input of network G for a simulated output O G 2 : which is described in Equation (4): With the cloud mask M and the simulated output map O G 2 , we make use of the residual information of O 2 to construct a local optimization term to optimize the simulated output O G 2 in the network G. The reconstruction term is described in Equation (5): means the Hadamard product operation. With Equation (5), we expect that network G can learn the ground information change between t 1 and t 2 , and further impose the real-time ground information to the expected cloud removal results O G 2 . The process of Equations (4)−(5) is displayed in Figure 1b.
However, the remaining information in O 2 is local so it cannot perfectly constrain the global information and continuity of cloud removal results. To further guarantee the authority and reflect the ground information changes in O G 2 , we concatenate the images in the sequence of O G 2 , S 2 and S 1 , and feed them again to the same network G for another output O G 1 : Then, another optimization term is constructed with this output O G 1 and O 1 to constrain the global information and continuity, which is described in Equation (7):     (6) and (7), we expect that this global optimization term can further constrain the network G to diffuse the knowledge of ground information change from the local areas to cloudy areas, so that the results can reflect the global ground information changes between t 2 and t 1 . The process of Equation (6) is displayed in Figure 1c.
For further continuity of the cloud removal result O G 2 , we then impose a total variation term to the optimization term. The total optimization term can be described as follows: Finally, different from those deep learning methods who retain the network G after optimization for further application such as Equations (1) and (2), we abandon the network but only retain the final optimization output O G * 2 as the final reconstruction result. The action is described in Equation (9): The whole process needs no training datasets, but only the target image with clouds and reference images. Compared with those deep learning methods, our method does not need to consider transfer learning when it comes to novel images outside the domain of training datasets, ensuring the flexibility of the proposed method.

Network Structure
In the proposed method, we use a simple network with five main consecutive blocks. In the first and fourth block, there are a convolution operation layer, a batch normalization layer and a non-linear activation layer. The second and the third blocks are the residual blocks whose structure is displayed in Figure 1. After processed by two convolution and batch normalization blocks, the input feature map adds the output feature map as the final output. The aim of the residual blocks is to avoid the gradient vanishing problem. The final block has a convolution operation layer and a non-linear activation layer to obtain the target reconstruction result. The total network design is presented in Figure 1.

Data Introduction
From the Sentinel 1 and Sentinel 2 satellites, we collected eight groups of SAR images and optical images from two periods, respectively, for the simulation experiments and two groups of SAR and optical images from two periods for real data experiments. The gap between the two periods is less than one year. Within one period, the gap of the acquired time of SAR and optical images is constrained within a week to confirm that SAR and optical images from the same time share the same ground information. SAR images are obtained in the mode of IW and have two bands which are, respectively, VV and VH. We preprocessed the SAR images with deburst, calibration, terrain correction and finally unified the spatial resolution into 10 m. The optical images have the resolution of 10 m and we only chose the R, G and B bands in the experiments for the sake of simplification. We cropped patches with the size of 2000 × 2000 in the same areas of optical and SAR images from the two periods. The acquisition time of these images is listed in Table 1. For the simulation reconstruction experiments, we collected the optical images with cloud corruption from Sentinel 2 and made use of Fmask 4.3 algorithm to catch the masks of clouds and their shadows in these images. Then, we cropped the patches randomly with the size of 2000 × 2000 from these cloud masks. We filtered 8 patches which have around 30% of cloud areas and their corresponding optical patches. Furthermore, in the eight groups of optical and SAR images, we selected one optical image in each group and replaced the information of some areas with the corresponding areas of the optical cloud patch to simulate the corrupted optical images. For the real data experiment, we directly made use of Fmask 4.3 algorithm to detect the clouds and their shadow in the image to make the cloud mask.

Evaluation Methods
We made use of four popular evaluation indexes to assess the reconstruction results of the proposed method. They are, respectively, peak-signal-to-noise ratio (PSNR), Structure, Similarity index (SSIM), Spectral Angle Mapper (SAM) and Correlation Coefficient (CC). They can evaluate the spectral and spatial fidelity of the cloud removal results. For PSNR, SSIM and CC, a higher score indicates a better result. For SAM, a lower score means the better result.

Implementation Details
We conducted our experiments under the framework of Pytorch 1.0 with one GPU of RTX 2080Ti. The reconstruction result was obtained by optimizing the output of network with Adam optimizer. The learning rate of the network in our method was set as 0.001 and the number of optimization epochs was set as 2000.

Simulation Experiment Results
The experiment results are displayed in Figure 2. Figure 2a Figure 2e is the cloud mask. Figure 2 f−h, respectively, displays the reconstruction results of AWTC, MNSPI and WLR. Figure 2i is the result of the proposed method and Figure 2j displays the ground truth of corrupted optical image. We selected two areas and magnify them in Figures 3 and 4. The experiment results are displayed in Figure 2. Figure 2(a−b) presents the simulated optical image with missing information and the multitemporal optical image from adjacent period. Figure 2(c−d) shows the SAR images obtained from the same time as optical images shown in Figure 2(a−b). Figure 2(e) is the cloud mask. Figure 2 (f−h), respectively, displays the reconstruction results of AWTC, MNSPI and WLR. Figure 2 (i) is the result of the proposed method and Figure 2 (j) displays the ground truth of corrupted optical image. We selected two areas and magnify them in Figures 3 and 4.   In Figure 3, the area boxed in yellow is mainly the town area. In the ground truth image, the spectral information of the town area and surrounding agricultural area is vividly different from each other. However, in the reference image O 1 , the town area and the surrounding agricultural area share the similar spectral information and it is very easy to mix them up. Therefore, in the results of WLR and MNSPI, the boxed town area is reconstructed as agricultural land because they can only take O 1 as reference image. Even worse, AWTC only obtains blurry results in the gap areas. The proposed method, on the other hand, can distinguish the town area from the surrounding agricultural area and the cloud removal result has accurate spectral information thanks to the reference images S 1 and S 2 . The above example indicates that reconstruction process of traditional methods is just information cloning from the optical images from adjacent periods and does not take the ground information change into consideration. Therefore, the traditional multitemporal-based methods may not be practical in real applications. Our method can reflect the ground information changes which are displayed in two SAR images from two periods. Another example is the area displayed in Figure 4. The area boxed in red is the town area while the surrounding area boxed in yellow is the agricultural area. In the ground truth image, the town area and the agricultural area has obviously different spectral information while in the reference image O 1 , the two areas share the same spectral information. WLR reconstructs the town area into the agricultural area while the agricultural area in the result of MNSPI has wrong spectral information. The result of AWTC is blurry because AWTC cannot deal well with images with large size. The proposed method perfectly distinguishes the town area and agricultural area, and the result has true spectral information. Table 2 lists the quantitative evaluation of our method and three traditional multitemporalbased methods. We mark the highest score in each index in bold and the second highest score with underline. Our method obtains highest scores in all four evaluation indexes. Simulation experiments indicate that the introduction of auxiliary multitemporal SAR images is of vital importance to reflect the ground change in the results.

Real Experiment Results
Results of real data experiments are displayed in Figure 5. Figure 5a−d, shows, respectively, the corrupted optical image, the reference optical image and SAR images from two periods. Figure 5e−g displays the results of AWTC, MNSPI and WLR. The result of our method is presented in Figure 5h. We magnify a selected area of Figure 5 in Figure 6 for further observation.
We can observe from two SAR images that the ground information in the area boxed in red has changed between the two periods. Due to the fact that no ground truth image can be referred in this set of experiments, we cannot directly compare the correctness of each method. However, we find in and that the two areas in the yellow box share the same ground information in both and. In the results of MNSPI and WLR, the ground objects in two yellow boxes are different. AWTC again cannot deal with large images and obtains blurry results. The proposed method, on the other hand, has ground objects with the same spectral information in the two yellow boxes, confirming the accuracy of our cloud removal result.

Discussion
To further explore the function of each data and module of our method, we conduct the ablation experiment of the proposed method. Then we discuss the efficiency of our method.

Ablation Study about Loss Function
In this section, we analyze the contributions of each loss function in the proposed method with the eight groups of multitemporal SAR-optical images used in the simulation experiment. The model, which contains all three loss functions, is set as the baseline model. Then we denote the model without total variation term as W/O(TV). The model without global loss function as W/O(global). On the other hand, we have to admit that local loss function contributes mainly to the cloud removal results and without the local loss function, no results will be obtained. Therefore, we cannot do the ablation study on the local loss function term. Table 3 lists the quantitative evaluation results of above ablation models. We mark the highest scores in bold. The evaluation results indicate that all loss functions contribute positively to the proposed method. Figure 6 displays the cloud removal results of the proposed method and three ablation models. Figure 7a displays the cloudy image. Figure 7b−c shows, respectively, the results of models W/O(global) and W/O(TV). Figure 7d−e is the result of full model and the ground truth. We select an area from Figure 6 and display it in Figure 8. It can be viewed from Figure 8 that global loss function contributes a lot to relieving the overfitting of model. Without global loss function, the model may generate some nonexistent ground information which are boxed in yellow. Although TV loss function may not contribute as much as global loss function, it also helps to relieve the overfitting of model. We can see from Figure 8c that without TV loss function, there still remains some artifacts in the cloud removal result.

Ablation Study about Reference Data
In this part, we further analyze the function of reference images used in the framework with the eight groups of multitemporal SAR and optical images. For the sake of fairness, we set the model with only local loss function as baseline because the construction of global loss function needs all reference images. With global loss function, we cannot compare the contribution of each reference image.
The model without and is denoted as W/O(and). We denote the model without as W/O( ) and the model without as W/O( ). Figure 9 displays the results of the baseline model and three ablation models. Figure 9a-d shows, respectively, the multitemporal SAR and optical images. Figure 9e is the cloud mask. Figure 9f-i displays the results of four ablation models and the baseline model. Figure 9j is the ground truth image. We select an area from the above images and display it in Figure 10. It can be observed from two SAR images that obvious ground information change occurs between two periods, which is boxed in red. Results of models W/O( ) and W/O(and) cannot reflect this ground information change because reference images do not contain the ground information of. The result of model W/O( ) reflects the ground information change in the reconstruction result and its result is satisfying in terms of visual evaluation. However, the shape of some ground objects, which is boxed in yellow, is not the same as ground truth. The reason may be that W/O( ) retains the corresponding spatial information of in the reconstructed areas of. However, as is known to all, the shapes of objects in SAR images and optical images are different due to their different imaging process. W/O( ) does not take the spatial information difference into consideration so that spatial information of some reconstruction areas is warped. The baseline model, with the aid of all reference images, can obtain results whose ground information has an accurate shape, outperforming all ablation models. We also evaluate our method and three ablation models quantitatively and list the quantitative evaluation results in Table 4. The highest scores are marked in bold and the second highest scores are marked with underline. The baseline model undoubtedly achieves the highest scores in all four indexes and outperforms the remaining three ablation models by a large extent. The ablation models indicate that all reference images including, and are of vital importance in the proposed method.

Time Cost
Despite the fact that the proposed method can take ground information change into consideration and also obtain results with both good quantitative and qualitative evaluation, we have to admit that the above achievements are at the cost of efficiency. Different from those deep learning methods that process images in a feed-forward manner with a well-trained network, the proposed method adopts an image-optimization process which processes the original images in a backward-propagation manner because there is no extra dataset to train the network. To process an image with a size of 2000 × 2000, deep learning methods may only need several seconds while the proposed method will need several minutes. Table 5 lists the time cost of the proposed method and two comparison methods on one image with a size of 2000 × 2000. WLR only takes several seconds while MNSPI takes tens of seconds. The proposed method takes hundreds of seconds, an excessive amount of time. We will seek a solution for the efficiency of our method in future work.

Conclusions
In this paper, we proposed a method for cloud removal of optical images from Sentinel-2 satellites. The proposed method takes advantages of multitemporal optical images from Sentinel-2 satellites and multitemporal SAR images from Sentinel-1 satellites as reference for the reconstruction. The whole process is conducted in a deep neural network and does not need a training dataset. As far as we know, our method is the first multitemporalbased method that can reflect the ground information change in practical cloud removal tasks without large training datasets. The success of simulation and real data experiments confirms the superiority of applicability of the proposed method.
In our future work, we would collect a large training dataset to train a deep neural network that can operate in a forward manner and reflect the ground information change in the reconstruction results. Moreover, we will modify the network structure for more accurate results.

Conflicts of Interest:
The authors declare no conflict of interest.