1. Introduction
Free optical imageries such as Sentinel-2 [
1], GaoFen 1 (GF-1, GaoFen means high resolution in Chinese) [
2], Landsat 8 [
3] and Moderate Resolution Imaging Spectroradiometer (MODIS) [
4] play essential roles on current agricultural applications, e.g., crop classification [
5], cropland monitoring [
6], analysis of historical samples [
7], and research on arable land-use intensity [
8]. However, due to the limitation of wavelength, these optical imageries inevitably suffer from cloud and shadow contamination, which could lead to inefficient earth surface reflectance data needed to conduct subsequent research [
5,
9].
In order to solve the above problems, researchers have begun to carry out research from two different ideas.
One idea is to fuse multisource [
10], multitemporal [
11], or multispectral [
12] optical imageries. The rationale of this kind of idea is to use cloud-free imageries or bands on the reference phase to recover the missing information on the target phase [
13]. The advantage of this method is that we have many multisource optical image data from which to choose. At the same time, each band range of these data is relatively similar, as is the reflectance of ground objects captured, which is more conducive to the collaborative fusion among the data. However, the disadvantage is that it is not effective at capturing reference data if cloud conditions occur continuously [
14]. Thus, such methods cannot fundamentally solve the problem of missing information caused by the existence of cloud.
The other idea is to fuse the synthetic aperture radar (SAR) [
15] and target optical imageries or use the SAR data only. SAR has a penetration capacity sufficient to capture surface features through cloud [
16]. Therefore, in theory, the fusion of SAR and optical data can solve the deficiency of the first method. However, this approach has its limitations. SAR and optical remote sensing are fundamentally different in imaging principles, resulting in different features of some ground features (e.g., roads, playgrounds and airport runways) in spectral reflectance and the SAR backscattering coefficient. Therefore, it is still difficult to establish a mapping relationship between the two types of data on these ground features [
17]. In addition, according to these authors’ original goal, we found that this part of the literature can be roughly divided into two categories: the first category is cloud removal on optical remote sensing data, and the second is SAR-to-optical translation. The second type of target was originally intended to facilitate the interpretation of obscure SAR images, but we believe that this type of target can be extended to the production of cloud-free optical remote sensing data. Although, for the second type, cloud removal is not performed on the basis of the original cloud contaminated data, it is similar to the results of the first type, where the cloud-free optical remote sensing data is obtained in the end. The research on cloud removal using SAR data discussed below actually includes these two types.
In addition, in most cases, these two methods will try to select the same space and time according to the target data when selecting reference data [
18,
19]. In terms of space, the closer the distance, the greater the correlation between ground objects, the greater the distance, and the greater the difference (the First Law of Geography [
20]). In terms of time, the closer the interval is, the smaller the change between ground features will be. However, a few studies did not follow the above rules when selecting data. Zhang et al. used the similarity correlation measurement index to measure the correlation of all remote sensing images in the region within a period of time, sorted them according to index level, and selected the image with a high score as the input [
21]. The unsupervised Cycle-Consistent Adversarial Networks (CycleGAN) [
22] does not need the remote sensing image pairs in the same region. Although it is difficult to obtain the mapping function of styles between two images like supervised learning with this method, it can obtain a mapping relationship between styles through a series of images of two different styles, which is more convenient for the expansion of the dataset [
23,
24].
We used ‘TI = “cloud removal” OR AK = “cloud removal”’ as retrieving conditions and retrieved all the relevant literature through the Web of Science. At the same time, we used ’(TI = “SAR” OR AK = “SAR”)AND(TI = “cloud removal” OR AK = “cloud removal”)’ as retrieving conditions and retrieved the relevant literature through the Web of Science. On this basis, we further searched for the relevant papers through the citations obtained from this body of literature to expand the number of papers. Finally, the literature on optical data cloud removal based on SAR data were collected and sorted (
Figure 1). The figure shows that, in studies on optical remote sensing data cloud removal, the first method of integrating optical remote sensing data starts earlier than the second method, and there are more papers on the first method. It can be seen in
Figure 1 that the development of cloud removal methods used for optical data based on SAR started in recent years. Once generative adversarial networks (GANs) were constructed, many scholars tried to use this kind of neural network to study SAR-to-optical translation.
The study of cloud removal using optical remote sensing data alone has taken a long time to develop. So far, there have been a number of relevant literature reviews [
25,
26,
27], which supports the further development of cloud removal research. However, due to the short development time of the research on cloud removal integrating SAR data, no relevant review can be found. The academic paper published by Fuentes Reyes et al. [
28] could be regarded as a relevant review, but its content is more focused on the effect of the CycleGAN network in SAR-to-optical translation. Therefore, we hope that this review can indicate the current cloud removal development of optical remote sensing data using SAR data, especially the emerging methods based on deep learning from 2018.
2. Literature Survey
We researched cloud removal methods for optical imageries using SAR published in journals and conferences. We found 26 papers published in 9 international journals (
Table 1), 13 conferences papers [
29,
30,
31,
32,
33,
34,
35,
36,
37,
38,
39,
40,
41] and 4 papers [
42,
43,
44,
45] published in arXiv (pronounced “archive”—the X represents the Greek letter chi [
]), which is an open-access repository of electronic preprints (known as e-prints) approved for posting after moderation but not peer review. Among these 26 journal papers, 14 were published in journals including Remote Sensing and IEEE Transactions on Geoscience and Remote Sensing. 84% of the papers were published after 2019, and the total number of papers is not large, which indicates that the research on cloud removal from optical remote sensing data using SAR data is still at a new stage and still has great potential. We divided these papers into five types of methods, as shown in
Section 3.1, and the development of these methods, which are published in journals, is shown in
Figure 2.
At the same time, in order to help scholars to search for papers with keywords, we carried out text analysis on the 26 papers and extracted the words commonly used in the titles (
Figure 3). It can be seen in the figure that different scholars have different named methods for the title of the paper, but the words with the highest frequency are still about cloud removal of optical remote sensing data using SAR data, such as ’SAR’, ’cloud’, ’optical’, ’image’ and ’removal’. There are also the keywords about deep learning, especially generative adversarial networks, such as ’learning’ and ’adversarial’, which indicates that the generative adversarial networks is widely applied in this field. In the following, we will focus on the current situation, opportunities and challenges of using SAR data to decloud optical remote sensing data based on GANs.
We also plotted a literature citation map of jounal papers with VOSViewer, as
Figure 4 shows. The total number of papers in the figure is 25, because [
66] cannot be searched in the Web of Science Core Collection (WOSCC). The most cited paper is [
62], which presents a cloud removal method for reconstructing the missing information in cloud-contaminated regions of a high-resolution optical satellite image using two types of auxiliary images, i.e., a low-resolution optical satellite composite image and a SAR image.
4. Evaluation Indicators
How the accuracy of a generated cloud-free image is evaluated is also an important issue for the relevant research. In this section, we counted the existing literature one by one and obtained several evaluation indexes that scholars like to use, as shown in
Table 4.
The most frequently used indicator is Structural Similarity Index Measurement (SSIM) [
77]. In terms of image composition, SSIM defines structural information as independent of brightness and contrast, reflecting the object structure property in the scene, and modeling distortion as a combination of brightness, contrast and structure. The mean value, standard deviation and covariance were used to estimate the luminance, contrast, and structural similarity. The formula is shown as Equation (
10).
where
is the covariance between the realistic and simulated values, and
C and
C are the constants used to enhance the stability of SSIM. SSIM ranges from −1 to 1; the closer the SSIM is to 1, the more accurate the simulated image is.
The second indicator is the Peak Signal-to-Noise Ratio (PSNR) [
78], which is a traditional image quality assessment (IQA) index. A higher PSNR generally indicates that the image is of higher quality. The formula can be defined as
where
and
are the realistic and simulated values for the
ith pixel, respectively,
N is the number of pixels, and
is the mean square error.
The third indicator is Spectral Angle Mapper (SAM), which was proposed by Kruse et al. [
79] in 1993, which regards the spectrum of each pixel in an image as a high-dimensional vector and measures the similarity between the spectra by calculating the included angle between two vectors. The smaller the included angle, the more similar the two spectra are.
The fourth indicator is the Root Mean Square Error (RMSE) [
80], which is used to measure the difference between the actual and simulated values. The formula is defined as
where
and
are the realistic and simulated values for the
ith pixel, respectively, and
N is the number of pixels. The smaller the RMSE is, the more similar the two images are.
In addition to the above four evaluation indicators, scholars also use the following indicators to evaluate accuracy: Correlation Coefficient (CC) [
81], Feature Similarity Index Measurement (FSIM) [
82], Universal Image Quality Index (UIQI) [
83], Mean Absolute Error (MAE) [
80], Mean Square Error (MSE) [
84] and Degree of Distortion (DD) [
56].
5. Limits and Future Directions
On the basis of this literature review, we analyze the problems of existing methods and propose the following six directions for future research.
5.1. Image Pixel Depth
Remote sensing data present the reflectivity information of surface objects, which are recorded continuously from 0 to 1. Considering that the storage of float data will consume a high amount of storage resources, we usually multiply the reflectivity by 10,000 and retain an integer from 1 to 10,000. The storage format of remote sensing data is usually GeoTIFF, but the image input format used for deep learning is usually PNG format, an 8-bit unsigned integer with a pixel depth of 0–255. Therefore, in order to ingest remote sensing data into the deep learning model smoothly, most scholars adopt the method of saving remote sensing images in PNG format.
However, this approach is similar to a lossy compression, with a loss rate of spectral information reaching 97.44%, and most of the spectral information is lost, which will lead to two problems. First, the amount of information obtained by the model is far lower than the amount of information that the data itself can provide, which is not conducive to the model’s feature extraction of spectral information, and a large amount of information can easily be misclassified, which is not conducive to finding the relationship between the backscattering coefficient and spectral reflectance. This may lead to poor cloud removal effect. Second, the final output results of the model can only be images with pixel values of 0–255. These results are acceptable for simple visual inspection, but cannot be used for subsequent quantitative calculations of the model, because the information loss is too large.
In fact, this problem is not limited to the field of cloud removal studied in this paper, but also exists in other fields that use deep learning to study remote sensing data. We do not recommend choosing to lose the accuracy of remote sensing data itself in order to blindly put the data into a deep learning model, which may cause substantial interference to our research. We suggest that the GDAL library be used to modify the data loading module and data output module of the deep learning model, instead of the usual image format of deep learning, which will enable the deep learning model to directly read and generate data with a higher pixel depth.
5.2. Number of Image Bands
The essential reason for this problem is the same as that in
Section 5.1, which is caused by the inconsistency between the professional storage format used for remote sensing data and the data reading and writing format commonly used for deep learning. Optical remote sensing data capture the reflectivity information of different electromagnetic bands: one channel of a panchromatic image, multiple channels of multispectral data and dozens or hundreds of hyperspectral data channels. SAR data also have different combinations according to different polarization modes. The PNG data format commonly used for deep learning can only accommodate three channels, so sometimes optical image data and SAR data are forced to delete the remaining channels or repeat an additional channel, so that the channel number of the image data can reach the three channels required by the PNG data format.
When the remaining channels are deleted, some spectral information will be lost, which causes the features that can be learned by deep learning to be deleted artificially during data processing from the beginning, which is not conducive to the training of cloud removal models. In particular, the impact of clouds is different on different bands. Using the fusion of different bands is an effective way to remove thin clouds from images. However, in order to meet the requirements of PNG format, the remaining bands can only be deleted. We suggest that the GDAL library be used to modify the data loading module and data output module of the deep learning model, instead of the usual image format of deep learning with three channels, which will enable the deep learning model to directly read and generate data with proper channel amounts.
5.3. Global Training Datasets
At present, most research involves the use of a small amount of data in the local area for model training. However, for deep learning models such as generative adversarial networks, many training samples are required for the model to learn knowledge, and sufficient verification samples are also required to evaluate the model effect. However, due to the large repetitive workload and low innovation in dataset production, there are still few publicly available and downloadable datasets for SAR and optical remote sensing image fusion. Therefore, the global regional dataset for optical image cloud removal based on SAR data will be important for research in this field. This dataset needs to have the following characteristics:
it involves multiple regions of interests in the world,
it has paired SAR and optical data,
it is a multi-temporal series,
it is in GeoTIFF format, and
the target image which is corrupted by clouds and its cloud free image.
First, different regions of the world must be covered as much as possible in the dataset, so that the model has the opportunity to learn features in different terrains and regions. In this way, the generalization ability of the trained model can be stronger. Second, SAR and optical image data must be fused, so the paired datasets are essential. Third, the model must have a chance to learn the characteristics of the time dimension, which is conducive to further improving the accuracy of the model. Fourth, the feature is to ensure that the dataset has a relatively complete information feature, which was mentioned in the previous two sections. Last, there must be available verification data in the process of cloud removal research, so that the model can be continuously trained and optimized.
5.4. Accuracy Verification for Cloud Regions
At present, the vast majority of research is to evaluate the effect of cloud removal for whole scene data, and the evaluation result is the average result of the whole scene image. However, for cloud removal applications, it is key to recover the areas polluted by clouds or cloud shadows, and the original pixel information in the cloud image to be removed can be used in other nonpolluted areas. Therefore, for cloud removal research, there are certain limitations in judging model accuracy from the whole scene image range, and it is necessary to try to conduct targeted accuracy evaluations in the areas polluted by clouds or cloud shadows.
For example,
Table 5 records the precision comparison of whole scene results of two SAR-based cloud removal methods. Method A has obvious advantages, but when we view the local cloud removal effect, as shown in
Figure 8, it is obvious that the cloud removal effect of Method B is better, so it is difficult to evaluate the quality of cloud removal only from the evaluation indicators of the whole scene results. Therefore, we need to extract the area to be cloud-removed separately, and we need to calculate the indicators to evaluate the accuracy of the cloud removal effect, which is more objective.
5.5. Auxiliary Data
The input data of most research are SAR and optical image data, but few studies have paid attention to the value of cloud mask data in this field. At present, many optical data are provided with corresponding cloud mask data when downloading, and there are also many open-source cloud detection model algorithms, making cloud mask data easier to obtain. It is a worthwhile idea to input cloud mask data into the model, which also allows the model to obtain cloud distribution in advance, so as to conduct more targeted cloud removal operations.
Furthermore, we can enrich the types of cloud mask data. In addition to cloud and cloudless, we can further distinguish thin clouds, thick clouds, cloud shadow and other areas and cooperate with the loss function so that the results of the model are more satisfactory. Compared to the
(
7), the new loss function is able to measure and feed back more information to improve the accuracy. The equation can be defined as
where
P,
T and
I denote the predicted, target and input optical images, respectively. The
TCM1 (thick cloud mask) has the same spatial property of the optical images. The pixel value 1 represents thick cloud, and the pixel value 0 represents other areas. The
TCM2 (thin cloud mask) has the same spatial property of the optical images. The pixel value 1 represents thin cloud, and the pixel value 0 represents other areas. The
SM (shadow mask) has the same spatial property of the optical images. The pixel value 1 represents shadow, and the pixel value 0 represents other areas. The
CSM (cloud-shadow mask) has the same spatial property of the optical images. The pixel value 1 represents cloud or cloud shadow, and the pixel value 0 represents cloud-free areas.
1 is a matrix with values of 1 and has the same spatial dimensions as the optical images.
is the total number of pixels in all bands of the optical images.
5.6. Loss Functions
The current studies use a variety of precision indicators to verify the results from multiple dimensions, as shown in
Table 4. However, this can only be used for the final quantitative evaluation of the results and cannot greatly help the optimization of the model. At the same time, most studies use relatively simple L1 loss or L2 loss for the entire space to calculate the loss function. Next, we can try to put these multidimensional indicators, such as SAM, into the model optimization stage (i.e., the loss function calculation) to see whether they can help the model to optimize training from different dimensions.
It is worth noting that, due to the matrix calculation used in the loss function calculation, we need to modify these commonly used precision index calculation formulas so that they conform to the matrix calculation method, which will involve some mathematical knowledge and programming knowledge. Of course, in theory, we can also directly put the precision index calculation formula into the loss function without modification for calculation, but this will greatly increase the training time, and the initial estimation time consumption may be 1000 times higher, so it is not recommended.
6. Conclusions
Through literature retrieval and analysis, in this paper, we find that the research of optical image cloud removal based on SAR data is a valuable research direction. Compared to traditional methods using optical remote sensing data alone to remove clouds, the research on cloud removal integrating SAR data has risen in recent years. After the emergence of deep learning, more scholars began studying cloud removal by SAR and optical data fusion. Therefore, there is a lack of review papers on the research of optical image cloud removal using SAR data. We hope that, through this paper, more relevant research will be carried out to understand the development of this study field and to communicate the advances with more scholars.
In this paper, we present the main contributing journals, keywords, fundamental literature and other research in the field of optical image cloud removal using SAR data. It will be helpful for scholars to search and study the relevant literature in this field. Scholars can read nearly 54% of the literature in this field from two sources: Remote Sensing and IEEE Transactions on Geoscience and Remote Sensing.
In this paper, we summarize the relevant literature from two dimensions: research methods and data input. We classify the research methods into five categories according to GANs: conditional generative adversarial networks (cGANs), unsupervised GANs, convolutional neural networks (CNNs, not GANs), hybrid CNNs and other methods, as
Table 2 shows. We provide the general principles and loss functions of these methods to help scholars understand them. At the same time, we describe the input data used by these methods and summarize three types of data input, as shown in
Table 3, to help scholars understand the data and provide advance preparation for studying these methods.
Moreover, this paper documents the accuracy verification indicators used in the current mainstream literature, which can help subsequent scholars make appropriate choices. The results show that the most popular indicators are SSIM, PSNR, SAM and RMSE. We believe that these four indicators are good choices for relevant scholars to quantitatively verify the accuracy of their models.
Finally, this paper discusses the key points of the future development of this field in terms of six aspects: image pixel depth, the number of image bands, global training datasets, accuracy verification for cloud regions, auxiliary data and loss functions. We discuss the current problems in these six areas and provide our solutions. We hope these problems and solutions can inspire scholars and promote the development of optical image cloud removal using SAR data.