Cloud Detection of Remote Sensing Image Based on Multi-Scale Data and Dual-Channel Attention Mechanism

: Cloud detection is one of the critical tasks in remote sensing image preprocessing. Remote sensing images usually contain multi-dimensional information, which is not utilized entirely in existing deep learning methods. This paper proposes a novel cloud detection algorithm based on multi-scale input and dual-channel attention mechanisms. Firstly, we remodeled the original data to a multi-scale layout in terms of channels and bands. Then, we introduced the dual-channel attention mechanism into the existing semantic segmentation network, to focus on both band information and angle information based on the reconstructed multi-scale data. Finally, a multi-scale fusion strategy was introduced to combine band information and angle information simultaneously. Overall, in the experiments undertaken in this paper, the proposed method achieved a pixel accuracy of 92.66% and a category pixel accuracy of 92.51%. For cloud detection, the proposed method achieved a recall of 97.76% and an F1 of 95.06%. The intersection over union (IoU) of the proposed method was 89.63%. Both in terms of quantitative results and visual effects, the deep learning model we propose is superior to the existing semantic segmentation methods.


Introduction
Approximately 50-70% of the earth's surface is covered with clouds, which significantly affects the atmospheric radiation budget and climate change [1]. Cloud detection has become an essential topic in remote sensing image processing, cloud climate effect research, weather forecasting, surface energy estimation, and other areas.
At present, the development of remote sensing image technology is gradually maturing. A remote sensing image contains not only rich spatial information, but also spectral information. The multi-dimensional information enables remote sensing data to play a vital role in agriculture and forestry, monitoring natural disasters, environmental pollution, urbanization, and so on.
In May 2018, the China National Space Administration (CNSA) launched the Gaofen-5 satellite with a directional polarimetric camera (DPC) [2,3]. The polarization camera has eight-band channels ranging from 443 nm to 910 nm, among which 490 nm, 670 nm, and 865 nm bands are unpolarized [4]. When the satellite passes over a target, it can obtain the observation results with nine observation angles in each band channel. Therefore, the DPC data describes a certain point from different bands and observation angles, and its spatial resolution is 3.3 km. This paper mainly focuses on cloud detection based on the Gaofen-5 DPC data.
In the deep learning field, common attention mechanisms are implemented in squeeze-andexcitation networks (SE-Net) [22], convolutional block attention module (CBAM) [23], efficient channel attention module (ECA) [24], and so on. The SE-net proposed by Jie Hu et al. [22] is a typical implementation of the channel attention mechanism. The advantage of the SE-net module is that it is flexible and can be applied directly to existing network architectures. In addition, it has a small number of parameters.
The critical point is to acquire the weight of each channel in the input feature layer. The CBAM proposed by Sanghyun Woo et al. [23] is a combination of the channel attention mechanism and the spatial attention mechanism. The ECA-net proposed by Qilong Wang et al. [24] is another form of channel attention with cross-channel interactions. The band information and angle information mentioned in this paper are reflected in the form of channels. Considering the control variables, spatial attention is not used here, so we did not consider using the CBAM module.
Considering the intrinsic 3D structure of the DPC data, motivated by the 3D U-net and channel attention mechanism, this paper proposes a 3D convolutional neural network model with a double channel attention mechanism, which considers the influence of different bands and different observation angles information. Firstly, the data is reconstructed as two architectures from the perspective of different channels and bands. Secondly, the data in two forms is fed into two channels to train dual-channel attention models. Finally, the results learned from the two channels are fused to obtain the optimal solution for cloud detection. This scheme conforms to data characteristics and can effectively improve the accuracy of cloud detection. The attention mechanism added from the input layer guarantees the subsequent network is trained more effectively because of the overall extraction of information. In summary, our research contributions are as follows: - The band information and angle information provided by the data are fully utilized.
The influence of different band information and different observation angle information on experimental accuracy is also considered. -Use 3D U-net as the benchmark network model. While classifying pixels, the texture information of clouds is preserved as much as possible. This benefits from the jump connection structure between the encoder and the decoder. -A dual-channel attention mechanism is proposed to extract useful information from band and angle, respectively.
The remainder of this paper is organized as follows. We will introduce the relevant work of this paper in Section 2. The specific experimental data and the model structure are presented in Sections 3 and 4, respectively. Relevant experimental results and analysis will be discussed in Section 5. Section 6 concludes this paper and outlines future work.

Three-Dimensional U-Net
The original U-net network is used for semantic segmentation of two-dimensional images, but DPC data for cloud detection is three-dimensional remote sensing images. Therefore, we need to switch from 2D U-net to 3D U-net.

U-Net
A U-net [14] comprises an encoder, a decoder, and a jump connection structure. Each encoder layer consists of two convolutional layers and one pooling layer, which is used to extract deeper features. Each decoder layer includes two convolution layers and one upsampling layer, aiming to recover the details of spatial information in the image. The jump connection links the feature layer between the encoder and decoder. The U-net splices the corresponding feature layers of the encoder and the decoder to assist the decoder in recovering details with low-level features. In Figure 1, the left half represents the encoder, the right half is the decoder, and the gray arrow in the middle denotes the jump connection. the right half is the decoder, and the gray arrow in the middle denotes the jump connection. Figure 1. U-net structure. In this figure, blue represents the convolution process, and ReLU is used as the activation function. Orange represents the down-sampling process using max pooling. Yellow is the up-sampling process. The gray arrows are the jump connection structure proposed in this paper.
U-net is widely used in remote sensing image fields [25,26]. In cloud detection, we only need to distinguish cloud regions from other regions, so shallow features are more meaningful. In U-net, the encoder contains rich shallow information, and the decoder contains rich deep details. It splices shallow features and deep features together using a jump connection structure. Compared with other models, U-net enhances the importance of shallow features in the model. In addition, U-net has a simple structure and a small number of parameters. It has an excellent generalization ability to promote a segmentation effect for different datasets.

Three-Dimensional U-Net
DPC data has many bands and observation angles, providing plenty of information for cloud detection. For such three-dimensional images, the two-dimensional operation is incompetent for feature extraction, and three-dimensional convolution [16] bursts into view. Different from 2D convolution, 3D convolution slides across the width and height of the image, as well as the channel [19]. Figure 2 shows the difference between 3D convolution and 2D convolution. The convolution kernel depth of 2D convolution is consistent with the depth of the input layer. This makes the convolution kernel of 2D convolution move only in width and height. A convolution kernel convolved with an image can produce only one channel of output data. However, in 3D convolution, the depth of the convolution kernel is smaller than the input layer. This allows the convolution kernel to move in three dimensions: width, height, and depth. The output of a 3D convolution is still a 3D feature map. Using 3D convolution, we can not only extract spatial information from the data, but also extract band and angle information (channel information). U-net is widely used in remote sensing image fields [25,26]. In cloud detection, we only need to distinguish cloud regions from other regions, so shallow features are more meaningful. In U-net, the encoder contains rich shallow information, and the decoder contains rich deep details. It splices shallow features and deep features together using a jump connection structure. Compared with other models, U-net enhances the importance of shallow features in the model. In addition, U-net has a simple structure and a small number of parameters. It has an excellent generalization ability to promote a segmentation effect for different datasets.

Three-Dimensional U-Net
DPC data has many bands and observation angles, providing plenty of information for cloud detection. For such three-dimensional images, the two-dimensional operation is incompetent for feature extraction, and three-dimensional convolution [16] bursts into view. Different from 2D convolution, 3D convolution slides across the width and height of the image, as well as the channel [19]. Figure 2 shows the difference between 3D convolution and 2D convolution. The convolution kernel depth of 2D convolution is consistent with the depth of the input layer. This makes the convolution kernel of 2D convolution move only in width and height. A convolution kernel convolved with an image can produce only one channel of output data. However, in 3D convolution, the depth of the convolution kernel is smaller than the input layer. This allows the convolution kernel to move in three dimensions: width, height, and depth. The output of a 3D convolution is still a 3D feature map. Using 3D convolution, we can not only extract spatial information from the data, but also extract band and angle information (channel information).
U-net is mainly used for semantic segmentation of two-dimensional images. Convolution, pooling, and up-sampling in the model, all adopt a two-dimensional form. Since the 3D data of DPC, the improved 3D version of U-net needs to be considered. The threedimensional U-net [19] replaces all two-dimensional operations in the original U-net with a three-dimensional structure, but the encoding and the decoding architecture of the model and the jump connection are maintained. Unlike the original 3D U-net, we have added channel attention modules to all of the down-sampling layers. In this way, the attention of the network is focused on the channel, which is beneficial to cloud detection, thus, improving the segmentation accuracy. U-net is mainly used for semantic segmentation of two-dimensional images. Convolution, pooling, and up-sampling in the model, all adopt a two-dimensional form. Since the 3D data of DPC, the improved 3D version of U-net needs to be considered. The threedimensional U-net [19] replaces all two-dimensional operations in the original U-net with a three-dimensional structure, but the encoding and the decoding architecture of the model and the jump connection are maintained. Unlike the original 3D U-net, we have added channel attention modules to all of the down-sampling layers. In this way, the attention of the network is focused on the channel, which is beneficial to cloud detection, thus, improving the segmentation accuracy.

SE-Net
SE-net [22] can weight each feature channel by channel attention mechanism. It increases the weight of essential features and reduces the weight of irrelevant features to improve the effect of feature extraction wisely. Specifically, the importance of each channel is learned automatically. Figure 3 shows the channel attention mechanism of SE-net. For SE-net, the critical point is to acquire the weights of each channel in the input feature layer. Taking advantage of SE-net, we can make the network pay more attention to crucial channels. The global average pooling is applied to the input, followed by two fully connected (FC) layers. After each full connection, the rectified linear unit (ReLU) and sigmoid are used as the activation functions in the form shown in Equations (1) and (2). ReLU has a faster processing speed than other non-linear activation functions and can reduce the vanishing gradient problem. Sigmoid is a linear function whose range is between 0 and 1, to compute the weight of each channel: Then, the output is obtained by multiplying the weight of each channel by each feature channel of the input.

SE-Net
SE-net [22] can weight each feature channel by channel attention mechanism. It increases the weight of essential features and reduces the weight of irrelevant features to improve the effect of feature extraction wisely. Specifically, the importance of each channel is learned automatically. Figure 3 shows the channel attention mechanism of SE-net. For SE-net, the critical point is to acquire the weights of each channel in the input feature layer. Taking advantage of SE-net, we can make the network pay more attention to crucial channels. The global average pooling is applied to the input, followed by two fully connected (FC) layers. After each full connection, the rectified linear unit (ReLU) and sigmoid are used as the activation functions in the form shown in Equations (1) and (2). ReLU has a faster processing speed than other non-linear activation functions and can reduce the vanishing gradient problem. Sigmoid is a linear function whose range is between 0 and 1, to compute the weight of each channel: Figure 2. Two-dimensional convolution process and three-dimensional convolution process.
U-net is mainly used for semantic segmentation of two-dimensional images. Convolution, pooling, and up-sampling in the model, all adopt a two-dimensional form. Since the 3D data of DPC, the improved 3D version of U-net needs to be considered. The threedimensional U-net [19] replaces all two-dimensional operations in the original U-net with a three-dimensional structure, but the encoding and the decoding architecture of the model and the jump connection are maintained. Unlike the original 3D U-net, we have added channel attention modules to all of the down-sampling layers. In this way, the attention of the network is focused on the channel, which is beneficial to cloud detection, thus, improving the segmentation accuracy.

SE-Net
SE-net [22] can weight each feature channel by channel attention mechanism. It increases the weight of essential features and reduces the weight of irrelevant features to improve the effect of feature extraction wisely. Specifically, the importance of each channel is learned automatically. Figure 3 shows the channel attention mechanism of SE-net. For SE-net, the critical point is to acquire the weights of each channel in the input feature layer. Taking advantage of SE-net, we can make the network pay more attention to crucial channels. The global average pooling is applied to the input, followed by two fully connected (FC) layers. After each full connection, the rectified linear unit (ReLU) and sigmoid are used as the activation functions in the form shown in Equations (1) and (2). ReLU has a faster processing speed than other non-linear activation functions and can reduce the vanishing gradient problem. Sigmoid is a linear function whose range is between 0 and 1, to compute the weight of each channel: Then, the output is obtained by multiplying the weight of each channel by each feature channel of the input. Then, the output is obtained by multiplying the weight of each channel by each feature channel of the input. SE-net is a self-attention mechanism. It weights each channel with information from the channel itself, reducing the dependence on external data. It gives the model more freedom and allows it to decide which channel is more valuable from the perspective of its importance to the task, so it has better generalization ability.

Data Sources and Data Formats
The experimental dataset in the subsequent experiment consists of 14 remote sensing images taken with the directional polarimetric camera (DPC) on the Gaofen-5 satellite, Remote Sens. 2022, 14, 3710 6 of 14 stored in HDF format. Each image is 6084 (height) × 12,168 (width) in spatial size. According to the number of spectrum bands and observation angles of DPC on Gaofen-5, each orbit consists of eight HDF files, including the radiance data of each pixel at nine different observation angles. For observing the multi-band and multi-angle remote sensing images intuitively, the visualization image of DPC is shown in Figure 4 with the help of ENVI, which is a professional remote sensing processing software. This image is part of the data for the seventh observation angle of orbit at the 670 nm band. dom and allows it to decide which channel is more valuable from the perspective of its importance to the task, so it has better generalization ability.

Data Sources and Data Formats
The experimental dataset in the subsequent experiment consists of 14 remote sensing images taken with the directional polarimetric camera (DPC) on the Gaofen-5 satellite, stored in HDF format. Each image is 6084 (height) × 12,168 (width) in spatial size. According to the number of spectrum bands and observation angles of DPC on Gaofen-5, each orbit consists of eight HDF files, including the radiance data of each pixel at nine different observation angles. For observing the multi-band and multi-angle remote sensing images intuitively, the visualization image of DPC is shown in Figure 4 with the help of ENVI, which is a professional remote sensing processing software. This image is part of the data for the seventh observation angle of orbit at the 670 nm band.

Data Processing
The data used in the experiment is not the original radiance data, but the reflectance of the top of the atmosphere of each pixel in a specific observation angle and specific band, which is obtained through the reflectance calculation formula [27] shown in Equation (3): where I is the normalized radiance, that is, the original data value; is solar incident irradiance; is solar zenith angle. Among them, the values I and are different in different bands and observation angles. Both I and are dependent on the band and the observation angle.
As shown in Figure 4, there are many invalid data (black field) in the DPC, which will deteriorate the training precision of the model. Therefore, it is necessary to eliminate some regions containing invalid data. The data needs to be clipped as small enough to make full use of the data. The size of the picture we selected is 32 × 32.
The DPC data has eight bands and nine observation angles, so the data is 3D data containing 72 channels, as shown in Figure 5.

Data Processing
The data used in the experiment is not the original radiance data, but the reflectance of the top of the atmosphere of each pixel in a specific observation angle and specific band, which is obtained through the reflectance calculation formula [27] shown in Equation (3): where I is the normalized radiance, that is, the original data value; E 0 is solar incident irradiance; θ 0 is solar zenith angle. Among them, the values I and E 0 are different in different bands and observation angles. Both I and E 0 are dependent on the band and the observation angle.
As shown in Figure 4, there are many invalid data (black field) in the DPC, which will deteriorate the training precision of the model. Therefore, it is necessary to eliminate some regions containing invalid data. The data needs to be clipped as small enough to make full use of the data. The size of the picture we selected is 32 × 32.
The DPC data has eight bands and nine observation angles, so the data is 3D data containing 72 channels, as shown in Figure 5.
Considering the dual attention channel structure of the proposed model, we split the data by means of integrating the data with the same observation angle and the same band together, as given in Figure 6. The original data is divided into two groups. The upper group is eight 3D-data blocks, representing eight bands with a size of 32 × 32 × 9, in which 32 × 32 denotes the spatial dimension and nine corresponds to nine angles. The lower group has a similar composition. The difference is that the data blocks are organized as nine angles, and each angle has eight bands. Then, the two groups are fed into the proposed network as channel dimensions for training. Figure 6 shows the data preprocessing method. Considering the dual attention channel structure of the proposed model, we split the data by means of integrating the data with the same observation angle and the same band together, as given in Figure 6. The original data is divided into two groups. The upper group is eight 3D-data blocks, representing eight bands with a size of 32 × 32 × 9, in which 32 × 32 denotes the spatial dimension and nine corresponds to nine angles. The lower group has a similar composition. The difference is that the data blocks are organized as nine angles, and each angle has eight bands. Then, the two groups are fed into the proposed network as channel dimensions for training. Figure 6 shows the data preprocessing method. Figure 6. Format of the data fed into the network, blue replies bands as the channel and yellow replies angles as the channel.  Considering the dual attention channel structure of the proposed model, we split the data by means of integrating the data with the same observation angle and the same band together, as given in Figure 6. The original data is divided into two groups. The upper group is eight 3D-data blocks, representing eight bands with a size of 32 × 32 × 9, in which 32 × 32 denotes the spatial dimension and nine corresponds to nine angles. The lower group has a similar composition. The difference is that the data blocks are organized as nine angles, and each angle has eight bands. Then, the two groups are fed into the proposed network as channel dimensions for training. Figure 6 shows the data preprocessing method. Figure 6. Format of the data fed into the network, blue replies bands as the channel and yellow replies angles as the channel. Figure 6. Format of the data fed into the network, blue replies bands as the channel and yellow replies angles as the channel.

Data Augmentation
Due to the recent launch of the Gaofen-5, the related cloud detection products are not mature enough, and manual data annotation is so high, that the data is insufficient for network training. There were only 14 orbits of labeled data in the experiment, so data augmentation [14] was needed to increase the data. In this experiment, horizontal flip, vertical flip, and diagonal mirror images in the data augmentation were used to triple the number of original data, which alleviated over-fitting to some extent and improved the accuracy of the test data.

Overall Framework of Network
In this section, we will introduce our model in detail. The proposed network flow chart is shown in Figure 7. The input data is processed into two shapes the same as shown in Figure 6 We take the band and observation angle as channels, respectively. The encoderdecoder network is an improvement on the 3D U-net, in which a layer of SE-net is added to conduct channel attention mechanism at the beginning, for the purpose of emphasizing crucial channel information in the input data. This channel attention mechanism is also appended before each pooling operation to amplify the characteristic effects offered by each channel. After a series of encoder-decoder operations, the output will be reshaped to a new uniform size for the convenience of the next fusion operation. In the fusion stage, the two-channel reshaped features are merged by maximum. Finally, Softmax is used to classify each pixel for the final detection result.
number of original data, which alleviated over-fitting to some extent and improved the accuracy of the test data.

Overall Framework of Network
In this section, we will introduce our model in detail. The proposed network flow chart is shown in Figure 7. The input data is processed into two shapes the same as shown in Figure 6 We take the band and observation angle as channels, respectively. The encoder-decoder network is an improvement on the 3D U-net, in which a layer of SE-net is added to conduct channel attention mechanism at the beginning, for the purpose of emphasizing crucial channel information in the input data. This channel attention mechanism is also appended before each pooling operation to amplify the characteristic effects offered by each channel. After a series of encoder-decoder operations, the output will be reshaped to a new uniform size for the convenience of the next fusion operation. In the fusion stage, the two-channel reshaped features are merged by maximum. Finally, Softmax is used to classify each pixel for the final detection result.
The proposed network is a dual-channel attention mechanism network that takes band and observation angles as channels, respectively. They are individually fed into the improved 3D U-net network for training. In this way, both the influence of bands and observation angles on cloud detection are concerned. Meanwhile, different bands and angles are assigned different weights with the help of dual-channel attention.

Improvement of 3D U-Net
The backbone network in this paper is the 3D U-net, developed from the 2D U-net, replacing all 2D operations with 3D operations. Instead of the original 3D U-net, we added The proposed network is a dual-channel attention mechanism network that takes band and observation angles as channels, respectively. They are individually fed into the improved 3D U-net network for training. In this way, both the influence of bands and observation angles on cloud detection are concerned. Meanwhile, different bands and angles are assigned different weights with the help of dual-channel attention.

Improvement of 3D U-Net
The backbone network in this paper is the 3D U-net, developed from the 2D U-net, replacing all 2D operations with 3D operations. Instead of the original 3D U-net, we added a dropout and an SE-net to each down-sampling layer. Dropout regularization is added after each layer of convolution to alleviate the overfitting of the network. After regularization, each layer will be input to SE-net for the channel attention mechanism. The pooling layer is max pooling. Batch normalization is also introduced for the convolutional kernels of up-sampling on each layer. Specific parameter settings for each module are listed in Table 1. Note here that the channel attention mechanism is added to both the input and the down-sampling of each layer.

Loss
We adopt the binary cross entropy [14] as the loss function, given as follows: where, x represents the label of the sample; cloud area as a positive class is 1; background as a negative class is 0; y is the probability of a positive sample prediction.

Experiments
In this section, to evaluate the cloud detection performance of the proposed multiinput dual-attention mechanism network, we conducted extensive experiments on DPC datasets collected by Gaofen-5. Specifically, we first introduce experimental settings and algorithm evaluation indicators. Then the performance of our proposed modules and variants of the network are discussed. Finally, we evaluate its performance compared with some standard methods.

Experimental Settings
During training, all layers of the entire network were adjusted using the Adam optimizer with a learning rate of 1 × 10 −4 . The data batch size and epoch are set to 32 and 100, respectively.
Five indexes were computed to evaluate the model, including pixel accuracy (PA), category pixel accuracy (CPA), recall, F1, and IoU values. The specific calculation formulas are exhibited in Table 2. Where TP represents the number of pixels of positive samples identified, TN represents the number of pixels that identify a positive sample as a negative sample, FP represents the number of pixels to identify a positive sample from a negative sample, and FN represents the number of pixels of negative samples identified. PA can be used to indicate the accuracy of the model, that is, the proportion of the number of correct pixels identified by the model to the total number of pixels. CPA represents the proportion of truly positive samples among the samples recognized as positive by the model. Recall represents how much of an actual positive sample the classifier can predict. F1 believes that CPA and recall are equally important, and F1 is the harmonic mean of CPA and recall. IoU is the proportion of the intersection of the real value and the predicted value to the union of the real value and the predicted value.

Ablation Experiments
To verify the feasibility of the dual-channel attention mechanism with multi-scale input, the detection precision of different modules and fusion strategies in ablation experiments were performed. The benchmark models and strategies are 3D U-net, 3D U-net + band attention (3D U-net + BA), 3D U-net + angle attention (3D U-net + AA), concatenation fusion, and maximum fusion. The experimental comparison results are shown in Table 3. Based on the reference network 3D U-net, both the channel attention module based on the band and the channel attention module based on the angle can improve the representation ability of the features extracted from the network. Compared with the benchmark network, the accuracy has been significantly improved. However, the performance of the angle attention module is better than that of the band attention module. In this paper, it is considered that the angle information of remote sensing images is more conducive to improving cloud detection accuracy than the band information due to the extensive imaging range of remote sensing images. Data observed from different viewing angles will lead to cloud boundary deviation due to different viewing angles. Using the channel attention mechanism based on the angle will focus on learning feature channels with a slight deviation from the actual value to improve the accuracy of cloud detection. We can also find that the band attention has the most enormous gap compared with the network we put forward, and the angle attention is close to our performance. This indicates that angle information is more advantageous than band information in this task. The concatenation fusion mechanism weakens the results of angle attention. Therefore, we chose the approach of maximum fusion to obtain more valuable information from the band and angle in subsequent experiments.

Comparative Experiments with Other Methods
To prove the performance and detection effects, we carried out experiments to compare the proposed model with other well-known models. The numerical results are demonstrated in Table 4, and the visual results are exhibited in Figure 8. Here, cloud pixels correctly detected are marked in gray, while non-cloud pixels are marked in black. Misclassified pixels are marked in red. more misclassified pixels in the overall performance. Seg-net and U-nET have fewer mismarks than FCN and PSP-net. The result of the proposed method is closest to the ground data with fewer misclassifications, especially for the pixels marked with the yellow boxes. Meanwhile, our model keeps more texture features, which is superior to the other methods, visually. By analyzing this experiment, it is found that the comprehensive utilization of various information (bands and angles) makes the proposed method fulfill cloud detection with high performance. Gray represents the clouds, black represents the background, and red represents the pixels where the prediction was wrong. In ground truth, white represents the cloud, and black represents the background.
As can be seen from Figure 8, the method based on deep learning, FCN produces more misclassified pixels in the overall performance. Seg-net and U-nET have fewer mismarks than FCN and PSP-net. The result of the proposed method is closest to the ground data with fewer misclassifications, especially for the pixels marked with the yellow boxes. Meanwhile, our model keeps more texture features, which is superior to the other methods, visually. By analyzing this experiment, it is found that the comprehensive utilization of various information (bands and angles) makes the proposed method fulfill cloud detection with high performance.
We can verify our observations from Table 4 with quantitative experimental analysis. It can be seen from Table 1, that according to the evaluation indicators with the PA, CPA,  recall, F1, and IoU, our method outperformed these benchmarks. It is worth noting that our method has a recall rate of nearly 98%. It shows that our model has outstanding performance in cloud area detection. F1, as the trade-off between recall and CPA, has also been significantly improved, which indicates that the proposed model has a good ability to predict both cloud and non-cloud regions. IoU was numerically 2.4% higher than U-net, the best performing of the other four methods. It shows that we also achieved better results in the model prediction stage compared with other methods. However, for the efficiency of the prediction, we found that our prediction time on the two orbital images was almost twice as long as the other methods. This is because we have two pieces of data on the input side that we put into the network, and 3D convolution is more complicated than 2D convolution in tensor operation. At the same time, we added an attention mechanism. Although the prediction time has increased, it cannot be denied that the prediction accuracy has been greatly improved. Table 5 shows the detection results of images with large background pixels. We can think of it as the detection effect of small clouds. It can be seen that the value of recall changes greatly because when there are many pixels in the background, pixels that are not in the background are often predicted as the background. Recall means that we have no probability of misjudging. The other values have little overall change because the predicted values are all small clouds. If misjudgment occurs, it is only a few pixels that are misjudged. It can also be seen from the table that our method is still superior to the comparison method for each value. Compared to the other deep learning models, our method achieves better quantitative performance than the benchmark method, which benefits from our dual-attention focusing on both angle and band information. In comparison, our method is closest to the actual value, which indicates the universality and effectiveness of our method under such data. By analyzing the cloud detection results on the Gaofen-5 dataset, our network can reliably extract cloud information from remote sensing images. At the same time, experimental results show that the performance of our network is better than other basic methods based on CNN. Our method has a powerful semantic segmentation capability for this kind of data, which can be used for remote sensing imagery cloud detection.

Conclusions and Future Work
This paper presents a deep learning method for cloud detection of remote sensing images based on multi-scale input and dual attention. It fully utilized the characteristics of data structures and gave enough consideration to band information and angle information. Three-dimensional U-NET was used as the basic network to combine semantic information at a high level and spatial information at a low level to generate cloud boundaries correctly, whose encoding-decoding structure helps to restore the original resolution. The fine detection precision on this dataset indicated that the proposed network could be applied to the same type of remote sensing image data from other satellites.
This experimental result proves the high precision of our method, but it is also limited by the amount of data we had, so we had to augment the data. In future work, we will try to solve the problems of small amounts of data and complex manual annotation of data by using unsupervised learning or semi-supervised learning.
Author Contributions: H.L. designed and completed the experiments and drafted the manuscript. Q.Y. provided the research ideas and modified the manuscript. J.Z. and L.X. put forward the improvement suggestions for the experiment and the manuscript. All the authors assisted in writing and improving the paper. All authors have read and agreed to the published version of the manuscript.