Cloud Detection for Satellite Imagery Using Attention-Based U-Net Convolutional Neural Network

: Cloud detection is an important and difﬁcult task in the pre-processing of satellite remote sensing data. The results of traditional cloud detection methods are often unsatisfactory in complex environments or the presence of various noise disturbances. With the rapid development of artiﬁcial intelligence technology, deep learning methods have achieved great success in many ﬁelds such as image processing, speech recognition, autonomous driving, etc. This study proposes a deep learning model suitable for cloud detection, Cloud-AttU, which is based on a U-Net network and incorporates an attention mechanism. The Cloud-AttU model adopts the symmetric Encoder-Decoder structure, which achieves the fusion of high-level features and low-level features through the skip-connection operation, making the output results contain richer multi-scale information. This symmetrical network structure is concise and stable, significantly enhancing the effect of image segmentation. Based on the characteristics of cloud detection, the model is improved by introducing an attention mechanism that allows model to learn more effective features and distinguish between cloud and non-cloud pixels more accurately. The experimental results show that the method proposed in this paper has a signiﬁcant accuracy advantage over the traditional cloud detection method. The proposed method is also able to achieve great results in the presence of snow/ice disturbance and other bright non-cloud objects, with strong resistance to disturbance. The Cloud-AttU model proposed in this study has achieved excellent results in the cloud detection tasks, indicating that this symmetric network architecture has great potential for application in satellite image processing and deserves further research.


Introduction
With the development of remote sensing technology, a large number of high-resolution satellite data have been obtained, which can be applied to land cover monitoring, marine pollution monitoring, crop yield assessment, and other fields [1][2][3][4]. However, the presence of clouds, especially very thick ones, can contaminate captured satellite images and cause interference in the observation and identification of ground objects [5][6][7]. As a result, clouds create a lot of difficulties for tasks such as target identification, trajectory tracking, etc. On the other hand, clouds are the most uncertain factor in Earth's climate system. It is estimated that clouds cover about 67% of the Earth's surface [8,9]. Clouds can affect the energy and water cycles of global ecosystems at multiple scales by influencing solar irradiation transmission and precipitation, and thus have a significant impact on climate change convolutional neural network for the task of multi-sensor cloud and cloud shadow segmentation. Their experimental results show that the model achieves great results on multiple satellite sensors with excellent generalization performance. Besides, adding shortwave-infrared bands can improve the accuracy of the semantic segmentation task of cloud and cloud shadow. In cloud detection tasks, the number and distribution of clouds tend to present very complex randomness. In order to achieve accurate cloud detection, attention should be focused on the areas with clouds during the cloud detection process. In the field of medical image classification, the object detection, etc., attention mechanism is a very effective method [35][36][37][38] that can allocate more processing resources to the target. Attention mechanism originates from human beings visual cognitive science. When reading text or looking at objects, humans tend to pay more attention to detailed information about the target and suppress other useless information. Similarly, the basic idea of the attention mechanism is that the model learns to focus on important information and ignore the unimportant information. Some studies have shown that attention mechanisms can improve classification effects [39][40][41][42]. Therefore, attentional mechanisms in computer vision should be introduced into the construction of cloud detection models. In this study, the Cloud-AttU model is proposed on the basis of U-Net, which introduces the attention mechanism. In the experiment, it was found that the results of cloud detection were significantly improved as the attention mechanism guided the model to learn more cloud-related features to detect clouds.
The structure of this paper is as follows: the U-Net architecture, the attention mechanism, and the Cloud-AttU model proposed in this paper are described in Section 2. Subsequently, the experiment design and results are given in Section 3. The results of the experiment are discussed in Section 4. Finally, Section 5 summarizes the work of this paper and lists the work to be continued in the future.

Methodology
To better implement cloud detection, this study applies the attention mechanism to the U-Net network. This improvement allows the model to better focus on areas with clouds and ignore areas without clouds during the cloud detection process, ultimately improving the accuracy of cloud detection. In the following, we describe the U-Net architecture, the attention mechanism, and the Cloud-AttU method proposed in our study, respectively.

U-Net Architecture
U-Net [27] was developed on the basis of the Fully Convolutional Network architecture [43] and was first applied to biomedical image segmentation in 2015. U-Net uses a symmetric encoder-decoder structure, which is one of the most popular methods for semantic medical image segmentation. The diagram of the basic U-Net architecture is shown in Figure 1. The left half of Figure 1 is the encoding path and the right half is the decoding path. Specifically, each block in the encoding path of the U-Net contains two convolutional layers and a max-pooling layer. At each step of the downsampling, the number of feature channels is doubled. In addition, the effect of max-pooling makes the size of the feature map smaller. Each block in the decoding path of U-Net contains the up-sampling, the fusion operation, as well as two convolutional layers. As shown in Figure 1, the skip-connection transfers information from the encoding path to the decoding path, thereby improving the ability of the U-Net to segment the image. Finally, for the multi-channel feature maps, a 1 × 1 convolution is used to obtain segmentation results.  [27]. Green/yellow boxes indicate multi-channel feature maps; red arrows indicate 3 × 3 convolution for feature extraction; cyan arrows indicate skip-connection for feature fusion; downward orange arrows indicate max pooling for dimension reduction; upward orange arrows indicate up-sampling for dimension recovery.

Proposed Network Architecture
In this section, we describe in detail the proposed Cloud-AttU model for cloud detection tasks. The proposed architecture is obtained by modifying the original U-Net model, and the new model introduces the attention mechanism. The overall structure of Cloud-AttU model is shown in Figure 2, and like the original U-Net model, Cloud-AttU consists of two main paths: the contracting path that encodes the entire input image, and the extending path enables the feature maps to return to its original size by gradually step-wise up-sampling. Each step of the encoder consists of a structural dropout convolutional block and a 2 × 2 max pooling operation. As shown in Figure 2, each convolutional layer is followed by a DropBlock, a batch normalization (BN) layer, and a rectified linear unit (ReLU). The max pooling operation is then applied for down-sampling with a stride size of 2. At each down-sampling step, the number of feature channels is doubled. Each step of the decoder consists of a 2 × 2 transposition convolution operation which completes up-sampling, a concatenation operation with the matching feature map of the encoder, and then followed by a structural dropout convolutional block. Finally, the 1 × 1 convolution and Sigmoid activation functions are utilized to yield the final segmentation map.
In our proposed Cloud-AttU model, the encoder shares information about its feature map with the decoder by skipping connections. By the skip connections from the lower to the higher layers, the model has more information to carry out its tasks. However, too much invalid or interfering information can affect the performance of the model. The attention module enables the network to learn the useful parts of the input image and what important features it should focus on to accomplish its task. We weight the feature maps using an attentional gate [44] to emphasize relevant features and ignore invalid features. In attention gates, the attention coefficient α i highlights signal in the target region and weakens background signal. Thus attention gates do not need to cut the region of interest directly in the network but achieve suppression of information from irrelevant background regions. The architecture of attention gate which we used is shown in Figure 3. As shown in Figure 3, the attention gate has two inputs: the feature maps (g) in the decoder and the feature maps ( f ) in the encoder. The feature maps (g) are employed as gated signals to augment the learning of the feature maps ( f ). In summary, this gating feature (g) makes it possible to extract more useful features from the encoded features ( f ), while ignoring invalid features. The two inputs are merged pixel-by-pixel after convolutional operations (W g , W f ) and Batch Normalization (b g , b f ), respectively. The results obtained in the previous step are then activated using the Rectified Linear Unit (ReLU, σ 1 (x) = max(0, x)).
For the new results, we perform the convolutional operations (W θ ) and Batch Normalization (b θ ) on them. Next we use the sigmoid activation function σ 2 (x) = 1 1+e (−x) to train the parameters in the gate and get the attention coefficient (α). Finally, the output of the attention gate is gained by multiplying the encoder feature by the attention coefficient (α). The feature selection process in the attention gate can be expressed by the following formula: For image segmentation, the overall performance of the segmentation model is affected both by the architecture of the network and the loss function [45]. If the loss function is used differently during training, then the training model obtained may differ significantly. Especially in the case of high-class imbalance problems. Therefore, choosing a suitable loss function becomes more challenging. In many cases, only a small fraction of the pixels in satellite remote sensing images are clouds, where the ratio of cloud pixels to background pixels varies widely. If the loss function chosen does not adequately take into account the properties of cloud pixels, the model is prone to judge the background region as a cloud pixel during learning, resulting in incorrect pixel segmentation of the image. In the Cloud-AttU network, we choose the cross entropy loss function to train the model and optimize the loss function by Adam algorithm [46,47], and good results have been obtained. In future studies, we will choose other loss functions for further research and optimize the current loss function. The mathematical expression for the cross entropy loss function is defined as follows: where n represents the number of pixels in each image, y i represents the ground truth andŷ i is the prediction result.

Datasets and Preparation
To train and test the proposed model, we have used a dataset derived from Landsat 8 satellite data. Landsat 8 is a satellite in the Landsat series launched by the National Aeronautics and Space Administration (NASA) on 11 February 2013, and has a finer band division than the previous Landsat satellites. Landsat 8 satellite has two optical sensors, operational land imager (OLI) and the thermal infrared sensor (TIRS). The data from Landsat 8 satellite provides data support for remote sensing applications in various fields. This study uses Landsat 8 Operational Land Imager (OLI) data, which contains nine spectral bands. Specifically, the spatial resolution for bands 1 to 7 and 9 is 30 m, while the resolution for band 8 is 15 m. The information for all bands of the operational land imager (OLI) on the Landsat 8 is shown in Table 1. In this study, four of these bands, Band 2 to Band 5, were utilized as they belong to the common bands provided by most remote sensing satellites, such as Sentinel-2, FY-4, GF-4, etc. The dataset used in this study is called the Landsat-Cloud dataset, which was originally created by Mohajerani et al. [48,49] from the Landsat 8 OLI scenes. The dataset contains 38 scenes, of which 18 are for training and 20 for testing. The ground truths of these scenes were extracted by manual methods. A normal spectral band of the Landsat 8 scenes is very large, about 9000 × 9000 pixels. It is very difficult to train a fully convolutional network directly with such a large input, as this leads to a large number of parameters and a large number of convolutional layers. This cannot be achieved in practice. To overcome this problem, large input images need to be cropped into multiple smaller patches. Therefore, each spectral band was cut into 384 × 384 non-overlapping segments. The total number of patches for the training set and test set was 8400 and 9201, respectively. In data pre-processing, we refer to various data augmentation methods including horizontal flipping, random rotation, and random scale scaling, etc. In this study, we have applied geometric translations to randomly augment the input patches.

Training Methodology
In this study, all experiments were programmed and implemented on Ubuntu 16.04 using the PyTorch framework and trained using the NVIDIA GTX 1080 Ti GPU. Pycharm is used as the software environment of the experiment.
In the experiment, we used images of 384 × 384 pixels in size in the Landsat-Cloud dataset as input to the neural network. We set the training batch size as 4 and the maximum training epochs as 50. We use the Adam optimizer for optimization, with a learning rate of 0.01 for the first 40 epochs and 0.005 for the last 10 epochs. In the experiment, we set the initialization of the weights of the network to a uniform random distribution between [−1, 1]. Besides,we applied random rotation (0 • , 90 • , 180 • , 270 • ) as a data augmentation method before each training epoch.

Evaluation Metrics
In the experiment, when a cloud mask of a complete Landsat 8 scene was completed, it was compared to the corresponding Ground Truths (GT). The cloud mask obtained divides each pixel into either cloud or non-cloud categories. The performance of the model proposed in this paper was measured quantitatively by assessing the Jaccard Index, Precision, Recall, Specificity, and Overall Accuracy. The mathematical definition of these metrics is as follows: Jaccard Index = TP TP+FN+FP where FP, FN, TN, and TP, represent the total number of false positive, false negative, true negative, and true positive pixels, respectively. Since the Jaccard index is related to both recall and precision, it can measure the similarity between the two sets of images. Therefore, the Jaccard index is widely used to measure the performance of image segmentation algorithms [49,50].

Experimental Results
The designed neural network model was trained on the training dataset, and the unseen test scenes were predicted using the model obtained from the training. We evaluate the performance of the model through multiple sets of experiments. Figure 4 shows the cloud detection results for different kinds of test scenes. The first row of Figure 4 shows the RGB images, the second row shows the ground truths, and the third row shows the proposed model predictions. As can be seen from Figure 4, our proposed method can detect clouds regardless of whether the background is bright or dark, and the segmentation results obtained are very close to the real cloud distribution. Furthermore, we find that the method proposed in this paper is also able to distinguish clouds from snow/ice and obtain segmentation results that are very close to the real scenes. These results prove that the combination of the U-Net architecture and the attention mechanism is effective and that this new neural network architecture performs well in cloud detection tasks. To further validate the proposed neural network architecture, we compare the proposed cloud detection architecture with FCN [51], Fmask [16], the original U-Net architecture and Cloud-Net [48]. Table 2 shows the experimental results of the different methods on the Landsat-Cloud dataset. As shown in Table 2, the Jaccard index of Cloud-AttU is 4.5% higher than the Jaccard index of FCN and 3.8% higher than the Jaccard index of Fmask. Fmask is the most widely used cloud detection method at present, with an overall accuracy rate of 94.26%. As can be seen from the table, the performance of our proposed model also exceeds that of Fmask. We can find that the Cloud-AttU method proposed in this study is also superior to the advanced U-Net and Cloud-Net method, suggesting that our incorporation of attention mechanisms into the U-net architecture for cloud detection is very effective.
Since we train the model on the same training set and test it on the same test set, the experimental results of this study can prove the obvious superiority of our proposed architecture. From the above experimental results, it can be concluded that the Cloud-AttU network and Cloud-Net network have significant advantages compared to other methods. To visually analyze the performance of our proposed method, Figure 5 compares the cloud detection results of the Cloud-AttU network and the Cloud-Net network. As shown in Figure 5, both methods yield good detection results in a general land context, and the results of Cloud-AttU are superior to those of Cloud-Net. In addition, in cases where there are ice and snow interference on land, the Cloud-Net model is susceptible to ice and snow interference, and Cloud-AttU is more resistant to interference. As shown in the third row of Figure 5, the Cloud-AttU network can identify ice and snow without misclassifying it as clouds, while the Cloud-Net network misclassifies ice and snow on the ground as clouds. From the experimental results, it can be concluded that the Cloud-AttU model proposed in this study has excellent cloud detection capability and strong anti-interference capability in complex ground background. From the above experimental results, it can be concluded that cloud detection is susceptible to interference from factors such as ice and snow. In Figure 5, it is shown that the Cloud-AttU model proposed in this paper has obvious advantages over Cloud-Net. To further analyze the performance of the Cloud-AttU model under snow/ice disturbance, we have conducted several experiments on cloud detection in ice and snow environments. At the same time, the same experiments have been conducted using the Cloud-Net model. The results of the two sets of experiments have been compared and analyzed. Figure 6 shows a portion of the experimental results, where the first column is the RGB images. As shown in Figure 6, the snow/ice scenes that cause the interferences have been marked using the red boxes, which are useful to analyze the performance of two cloud detection methods in detail. As can be seen in Figure 6, the Cloud-AttU model can distinguish clouds from snow/ice in complex backgrounds. However, Cloud-Net has poor ability to distinguish between cloud and snow/ice and usually treats snow/ice as clouds in many cases. From the experimental results, it can be concluded that the Cloud-AttU model proposed in this paper has excellent cloud detection capability to resist interference from factors such as ice and snow in complex environments. Therefore, this model has great potential and deserves further optimization for early application to real business. Not only can snow/ice interfere with cloud detection, but other factors such as surface rivers, bright surfaces, and lakes can also interfere with cloud detection. Therefore, to analyze the model's ability to discriminate against other confounding factors, we added validation cases for further analysis. Figure 7 shows the cloud detection results under the interference of factors such as surface rivers and bright surfaces. It can be seen from the figure that under the interference of these factors, the Cloud-AttU model proposed in this study still has a good effect. From the experimental results, it can be concluded that the model proposed in this study is highly resistant to multiple types of interferers, with great application prospects and worthy of further study. At the same time, the consumption of computer resources during the training process is also a key issue to consider, where the time required to complete the training of the neural network model is a very important parameter. In this study, we need about 7.2 h to complete the training. Besides, the size of our saved training model is about 430 M. In this study, due to our primary focus on the combination of attention mechanism and U-Net architecture, insufficient attention was paid to training techniques, resulting in training less efficient. Therefore, our future work will focus more on training acceleration issues to improve training efficiency. For example, we plan to reduce training time and improve training efficiency by reducing training data and introducing appropriate regularization methods. We aim to apply the model to real business by continuously optimizing the model structure and training methods.

Discussions
In this study, we add attention mechanism to the U-Net network, adjust the network architecture according to the characteristics of the satellite data, choose the appropriate Loss function for optimization, and finally obtain a neural network model suitable for cloud detection of satellite data. We compared the model with other well-established models and the following discussions are obtained from the above experimental results. (1) From the experimental results, we can find that the U-Net network, the Cloud-Net network and Cloud-AttU network based on the U-Net architecture are significantly better than Fmask. The U-Net network adopts the symmetric Encoder-Decoder structure, which achieves the fusion of high-level features and low-level features through the skip-connection operation, making the output results contain richer multi-scale information. This symmetrical network structure is concise and stable, significantly enhancing the effect of image segmentation. The results of this study demonstrate the good performance of the U-Net architecture in cloud detection tasks, indicating that this symmetrical network architecture, which fuses multi-scale information, has great potential for applications in satellite image processing and deserves further research. (2) From the experimental results, it was found that U-Net with the attention mechanism can achieve better cloud detection results than the original U-Net. This performance boost should benefit from the attention gate. In the attention module, the output is obtained by multiplying the feature map by the attention coefficient in the attention gate. The attention coefficients tend to get larger values in the clouded region and smaller values in the cloudless region. This mechanism makes the value of the cloudless region of the feature map smaller and the value of the target region of the feature map larger, thus improving the performance of cloud detection.
From the experimental results, it can be concluded that Cloud-AttU with the attention mechanism has a stronger cloud detection capability compared to Cloud-Net and single U-Net. Cloud-AttU can better resist the interference of snow and ice, and has a stronger identification ability. It is well known that satellite remote sensing data are susceptible to interference from various noises, so data processing methods that are resistant to interference are highly desirable for satellite data. Attentional mechanisms have a clear advantage in recognizing and resisting noise interference, and thus hold great potential and research promise in numerous areas of satellite data processing.

Conclusions
Cloud detection is an important step in the pre-processing of satellite remote sensing data, and accurate cloud detection results are of great significance to improve the utilization of satellite data. With the development of artificial intelligence technology, deep learning methods have made great breakthroughs in the fields of image processing and computer vision. In this study, we propose an effective cloud detection method, Cloud-AttU neural network model, drawing on deep learning methods. It is a new neural network model based on the U-net architecture with the attention gate. In the process of designing the model, we modified and optimized the network structure considering the advantages of the U-net architecture in computer vision and the characteristics of the cloud in remote sensing data. At the same time, the attention gate improves the learning of target regions associated with the segmentation task while suppressing regions that were not associated with the task. Therefore, the attention gate is integrated into the proposed model to enhance the efficiency of semantic information dissemination by skipping connections. We have conducted a series of experiments using the Landsat 8 data and confirmed that the Cloud-AttU model that introduces the attention mechanism works well in the cloud detection task, and its performance is significantly better than other previous methods. The Cloud-AttU model proposed in this paper can still achieve excellent cloud detection results when there is interference information such as ice and snow. Given the success in cloud detection, the architecture proposed in this paper can also be applied to other areas of remote sensing imagery with appropriate modifications. In future studies, we will further optimize the proposed model and apply it to other satellite remote sensing data such as GF-4, FY-4A satellites.

Acknowledgments:
The authors would like to thank the reviewers for their very useful and detailed comments, which have greatly improved the content of this paper.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: