1. Introduction
In recent years, owing to the fast development of remote sensing technologies, remote sensing images (RSI) have gained increasing application value in Earth observation [
1], resource investigation, environmental monitoring, and protection [
2]. Cloud detection is important in RSI processing since most RSI are contaminated by clouds [
3,
4,
5], which decreases the quality of RSI and influences the subsequent applications [
6]. Simultaneously, clouds play a crucial role in Earth’s radiation balance, water cycle, and climate change [
7]. Rapid and accurate cloud detection can help provide an effective data source for the inversion of cloud and aerosol parameters and the study of sea color characteristics.
Directional Polarimetric Cameras (DPC) have attracted much attention as an emerging Earth observation technology [
8]. Compared with traditional optical remote sensing means, clouds and aerosols are more sensitive to polarization information, which makes satellite polarimetric remote sensing more advantageous in atmosphere detection [
9]. The main task of DPC is to obtain multi-band and multi-angle polarized radiation and reflection information. They are used in researching the optical and physical properties of atmospheric aerosols, clouds, and marine water observations. They also provide remote sensing data support for global climate change and atmospheric environment monitoring [
8,
10,
11,
12]. The purpose of this paper is to explore effective cloud detection of DPC imagery, which is the premise of cloud characteristic parameter inversion and ocean water color inversion.
Existing cloud detection methods for polarized RSI, such as DPC imagery, mainly use the difference in reflectance and polarization reflectance between clouds and the ground to set thresholds to detect cloud regions. However, these thresholding algorithms are often limited in performance and generative ability. For example, certain bright surfaces with high reflectance (ice and snow regions, ocean flare regions, etc.) have very small reflectance differences with cloud regions, making threshold setting difficult. Second, different regions often require setting different thresholds. It is, therefore, necessary to investigate more precise and advanced cloud detection algorithms for DPC imagery cloud detection.
Recently, deep neural network-based methods have been used in the field of RSI cloud detection, typically developed based on semantic segmentation networks, such as Fully Convolutional Networks (FCN) [
13], U-Net [
14,
15], and SegNet [
16]. However, semantic segmentation, as a pixel-level classification process, generally requires extremely high image resolution. Their applications are generally for high-resolution RSI, whereas the spatial resolution of DPC imagery is extremely low; therefore, these networks are not suitable. This paper aims to explore effective cloud detection in DPC imagery.
DPC imagery differs from other RSI in that it has multi-spectrum, multi-angle, and multi-polarization characteristics. DPC consists of eight bands with nine observations, each from different angles, and three bands are polarized (with additional Stokes vector images Q and U). Hence, DPC can produce 126 observations per pixel, providing rich information for Earth observation [
12,
17,
18]. To jointly utilize the information from multiple angles, we proposed to use 3D convolution [
19] to extract and use angle features. Furthermore, the Squeeze-and-Excitation Network (Senet) [
20] is applied to automatically extract the essential spectrum information while avoiding losing spectral information during the band selection process. Finally, some previous studies showed that using the polarization information of clouds can detect cloud pixels in certain areas (such as flare areas) [
9,
12,
17,
21]. Polarization images, as another modality different from multi-spectral images, can provide additional information for cloud detection. Inspired by that, we propose to fuse the DPC multi-spectral image and polarization image with a multi-stream architecture, where each stream corresponds to a modality.
Consequently, we propose MFCD-Net, a 3D multimodal fusion network based on cross attention, which takes the multi-angle reflectance image, polarization image Q (Stokes vector Q), and polarization image U (Stokes vector U) as inputs for cloud detection. First, we use angle as the third dimension of the 3D convolution to extract the spatial-angle information completely. The use of 3D convolution can greatly improve the ability to represent multi-dimensional data [
22,
23]. In addition, we consider spectral bands as channel dimensions and assign different weights to each band based on Senet. Furthermore, we use the cross attention [
24] fusion technique in the fusion stage, which enhances the expression of the results of fusion between different modalities. Finally, the overall structure of the whole network is similar to 3D-Unet, and the multimodal fusion operation is carried out in four stages in the down-sampling part, which not only improves spatial feature extraction but also strengthens feature fusion representation by repeated fusion.
So far, there is no DPC cloud detection dataset available. To evaluate the proposed method, we use DPC’s Level-1 cloud products and corresponding Level-2 cloud mask products as the data and label for our dataset, respectively. The dataset contains 126 observations per pixel of the DPC imagery. The main contributions of this paper can be summarized as follows.
To the best of our knowledge, ours is the first approach to introduce the notion of multimodal learning for RSI cloud detection.
The traditional thresholding algorithm has poor performance and limited generalization ability. Furthermore, given the extremely low spatial resolution of DPC images, conventional semantic segmentation-based methods also fail to achieve good performance. To improve the detection performance, this paper proposes a 3D multimodal fusion network (MFCD-Net). It makes up for the lack of spatial features by extracting and using angle features, spectral features, and polarization features, thus achieving good performance.
Simple concatenation and summation feature fusion methods can hardly solve the feature fusion problem where there is an imbalance between features in multimodal data and information inequality contained by different features [
25]. In order to enhance the feature fusion effect between different modalities, a cross-attention fusion module is designed in this paper. It takes the attention map from the polarization modality to enhance the representation of the reflectance modality.
The remainder of this paper is organized as follows.
Section 2 reviews the related work.
Section 3 presents the GF-5 DPC and the corresponding cloud detection dataset.
Section 4 describes our multimodal network in detail.
Section 5 reports the experimental results and discusses the experiment details. Finally,
Section 6 offers the conclusion of this research.
2. Related Work
Up to the present, many efforts have been made for cloud detection of RSI, and a variety of cloud detection methods have been proposed. These methods mainly rely on spectral information, frequency information, spatial texture, and other information, in combination with thresholding, clustering, support vector machines, neural networks, and other algorithms for detection [
26]. Roughly speaking, they can be summarized as rule-based methods and machine learning-based methods.
The earliest and most generally used rule-based cloud detection method is the spectral threshold method [
27]. Several spectral threshold algorithms for DPC imagery have been developed. These methods use reflectance or polarization reflectance to define the threshold. For example, JinghanLi et al. [
10] proposed a multi-information cooperation (MIC) method. Rather than relying on a single constant threshold, the MIC utilizes dynamic thresholds simulated by multiple atmospheric models, time intervals, and underlying surfaces. Some other rule-based cloud detection methods were developed based on textural features. Gray co-occurrence matrix, fractal dimension, and boundary features are the most widely used among the different texture analysis methods [
28] since they are compatible with the texture properties of the cloud. Though straightforward, the rule-based methods have limited generalization capability and are deficient in terms of performance.
Automatically learning from training data, machine learning algorithms such as conventional random forest [
29], K-nearest algorithm [
28], and support vector machine [
30] have been applied to cloud detection algorithms. Owing to their excellent performance in many vision tasks, deep learning algorithms have emerged as the most popular methods for cloud detection. Fengying Xie et al. [
31] proposed a cloud detection algorithm that divides an image into super-pixels by improving simple linear clustering (SLIC) and designs a CNN with two branches to extract multi-scale features of each super-pixel to distinguish pixels. More cloud detection algorithms are based on semantic segmentation models. JingyuYang et al. [
6] proposed a cloud detection neural network (CDnet) with an encoder–decoder structure, feature pyramid module, and boundary refinement block to detect cloud areas in thumbnails effectively. Zhenfeng Shao et al. [
32] superimposed the visible, near-infrared, shortwave, cirrus, and thermal infrared bands of the Lansat8 satellite to obtain complete spectral information and then proposed a convolution neural network based on multi-scale features to identify thick clouds, thin clouds, and non-cloud regions. Weakly supervised cloud detection methods have also been developed. Zou et al. [
33] defined cloud detection as a mixed energy separation process of image foreground and background. The generative antagonistic framework is utilized to establish the groundwork for weak supervision of the cloud image by combining the physical principles behind the cloud image. Yansheng Li et al. [
34] proposed a weakly supervised deep learning-based cloud detection (WDCD) method that uses block-level labels to reduce the labor required for annotating the pixel-level labels. In the past two years, the research of deep learning in the field of cloud detection has gradually matured. The problems of cloud boundary blurring and computational complexity have become recent research hotspots. Kai Hu et al. [
35] proposed Cloud Detection U-Net (CDUNet), which could refine the division boundary of the cloud layer and capture its spatial position information. To reduce the computational complexity without affecting the accuracy, Chen Luo et al. [
36] developed a lightweight autoencoder-based cloud detection network, LWCDNet. Qibin He et al. [
37] proposed a lightweight network (DABNet) to achieve high-accuracy detection of complex clouds, with not only a clearer boundary but also a lower false-alarm rate.
Even though the deep learning-based methods discussed above have achieved impressive performance, they are difficult to apply to DPC imagery due to the low spatial resolution of DPC imagery. Therefore, this paper proposes a novel MFCD-Net using 3D convolution, Senet, and cross attention fusion to extract and utilize angle, spectral, and polarization information to compensate for the lack of spatial information and achieve superior performance.
5. Experiments
In this section, we comprehensively evaluate the proposed MFCD-Net on DPC imagery. Specifically, we first describe the experimental setup and the evaluation metrics. Then, we evaluate the performance of our proposed MFCD-Net qualitatively and quantitatively. Third, we further investigate the performance of the CA module, Senet, and Res-block. Finally, we demonstrate the effectiveness of our method of DPC imagery selection and processing.
5.1. Experimental Settings
5.1.1. Training Details
All networks were trained under the Keras framework and optimized by the Adam algorithm [
48]. The proposed MFCD-Net is trained in an end-to-end manner. The learning rate starts from 10
−5, and is then dynamically changed by the ‘Reduce LR On Plateau’ function. Specifically, when the validation-loss does not decrease for three epochs, the learning rate will drop to the original 0.8. The whole training process has a total of 100 epochs, and the training will end when the validation-loss does not decrease for 30 epochs. The loss function used in the experiment is the cross-entropy loss function. Comparison methods are trained with the same settings as the MFCD-Net.
5.1.2. Evaluation Metrics
Such commonly used semantic segmentation metrics as overall accuracy (
), producer accuracy (
), user accuracy (
), and
(Mean Intersection over Union) have been employed as evaluation metrics to examine the performance of the cloud detection methods. The formulas for calculating these evaluation indicators are as follows:
where TP, TN, FP, and FN denote the number of correctly identified cloud pixels, the number of correctly identified non-cloud pixels, the number of incorrectly identified cloud pixels, and the number of incorrectly identified non-cloud pixels, respectively,
represents the
(Intersection over Union) of cloud pixel, and
represents the
of non-cloud pixels.
5.1.3. Data Augmentation
The use of data augmentation techniques can, to a certain extent, avoid the overfitting problem and improve the generalization ability of the model [
49]. The MFCD-Net is based on three-dimensional convolution, with a considerable number of parameters and a large demand for training data. Because of the poor resolution of DPC imagery, obtaining a large number of label images is challenging; hence the dataset we use is limited and cannot match the network model’s requirements. We have performed data enhancement operations on the training data and label, such as vertical flipping, horizontal flipping, and diagonal mirror flipping, to improve the robustness of the network model.
5.2. Comparative Experiment of Different Methods
The compared methods include the classical semantic segmentation method: FCN, U-Net, Seg-Net, PSP-Net, and DeepLab-V3. In addition, a comparison is made with some advanced cloud detection methods: deformable contextual and boundary-weighted network (DABNet) [
37], lightweight cloud detection network (LWCDnet) [
36], and cloud detection UNet (CDUNet) [
35]. In order to compare objectively and effectively, the parameters of the experiment are kept consistent. Since the above comparison methods are based on 2D-CNN, we modified the reflectance image and polarization images into 2D form to form multi-channel data. The block size of the input image in 2D-CNN-based networks is 32 × 32 × 126.
Figure 10 shows the experiment’s qualitative comparison results on five examples from our dataset. These examples include different backgrounds, such as sea and land regions and special ocean flare regions. In addition, thick and thin cloud scenarios are included. It can be seen that the proposed MFCD-Net has the fewest misclassified pixels in all cases, showing that it has the best capacity to distinguish cloud pixels. It should be noted that our method has significantly fewer misclassified pixels in the edge region than other methods, proving that it can solve the problem of difficult cloud boundary identification well.
Quantitative comparison results are shown in
Table 3. In all the evaluation metrics, our method achieves the best performance, especially MIoU, which is improved by at least 3.38% compared with other methods. The results indicate that the proposed MFCD-Net outperforms 2D-CNN-based comparison methods, showing the superiority of 3D convolution in feature extraction of multi-dimensional data. Along with the proposed MFCD-Net, the U-net and CDUNet achieved 92.95% and 93.82% in terms of OA, respectively, demonstrating that the U-shaped encoder–decoder structure has a strong ability to extract the features of the multi-channel data. This is why we choose an encoder–decoder structure similar to U-net in the proposed MFCD-Net.
5.3. Ablation Study
To demonstrate the advancement and effectiveness of our designed network, we implemented ablation experiments on the DPC dataset for the Res-block, Senet, and CA modules. We list the evaluation performance of the backbone network, as shown in
Table 4. The experiments have been carried out on both unimodal data (input reflectance data only, represented as (R) in
Table 4) as well as multimodal data (input both reflectance and polarization data, represented as (R+P) in
Table 4) to eliminate the effect of data on experiments.
Our network structure is improved based on the 3D-UNet. In order to highlight the important band information of the input data, Senet is used to attach different weights to the features of different bands. It can be seen that the Senet improves the performance from 93.69% to 94.18% in the unimodal case and from 94.56% to 95.03% in the multimodal case in terms of OA. Based on the incorporation of Senet, we further introduce Res-block to replace the normal CBR-block () in the feature extraction process in 3D-Unet and thus steadily improve the model by deepening the number of network layers. The experimental results show that the use of Res-blocks has a significant improvement in the performance of the model in all evaluation metrics. The CA module is the core part of MFCD-Net. Different from the commonly used feature concatenation fusion method, it highlights the reflectance feature by an attentional map of the polarization feature, synergistically fusing the multimodal feature. The ablation networks without the CA module use a feature concatenation fusion method in the case of multimodal data. It can be seen that MFCD-Net (3D-UNet + Senet + Res-block + CA) achieves the highest performance in terms of OA, MIoU, and PA compared with other ablation networks.
5.4. Selection of Dataset
Considering DPC imagery’s multi-spectral, multi-angle, and multi-polarization properties, the selection of band and angle and the use of polarization images are very important steps. We conducted a series of experiments to illustrate the effectiveness of our data selection strategy of separating all data into three modalities and then performing multimodal fusion.
First, in the case of using multi-angle data, we respectively input the R (reflectance) image, Q (Stokes vector Q) image, and U (Stokes vector U) image of each band into the single modality MFCD-Net network for experiments. Specifically, by altering the input data, we conducted the following experiments to demonstrate the effectiveness of using multi-band combined images: inputting 3-band (including various combinations) R images, 8-band R images, 3-band Q images, and 3-band U images. As shown in
Table 5, 8-band R image input outperforms 3-band and single-band R image input. Simultaneously, the performance of 3-band polarization images (Q and U) input is superior to that of single-band polarization image input, demonstrating the effectiveness of using multi-band data input in this research. The detection effect of employing merely R image input, Q image input, or U image input is not as good as the proposed method in this research. This demonstrates that neither the polarization image nor the reflectance image can include all of the necessary information, demonstrating the efficacy of the data use method presented in this study.
Next, a few experiments on multi-angle and single-angle cases were carried out under the adjusted single modality input MFCD-Net network to demonstrate the importance of multi-angle data. Because the size of the third dimension (angle dimension) of the feature maps is 1 in the case of single-angle input, we need to alter the size of the convolution kernel (must be larger than the size of the feature map’s third dimension). We added two groups of experiments to increase the credibility of the experiment, input 3-angle data and input 6-angle data, to compare with the approach of input 9-angle data in this paper.
Table 6 shows that the
-angle input achieved the best performance, and the experimental effect improved as input angle numbers rose. This highlights the role of each angle and the effectiveness of using multi-angle data.
Finally, we replace the reflectance image input with the radiance image input in this approach to show that the reflectance image has more information than the radiance image discussed in
Section 3. The detection performance of the approach utilizing radiance input is not as good as that of the method in this paper using reflectance input, as shown in
Table 7. As this experiment illustrates, reflectance data can provide more information than radiant brightness data. As a consequence, in our dataset, we used reflectance data rather than radiance data.