Compact Cloud Detection with Bidirectional Self-Attention Knowledge Distillation

: The deep convolutional neural network has made signiﬁcant progress in cloud detection. However, the compromise between having a compact model and high accuracy has always been a challenging task in cloud detection for large-scale remote sensing imagery. A promising method to tackle this problem is knowledge distillation, which usually lets the compact model mimic the cumbersome model’s output to get better generalization. However, vanilla knowledge distillation methods cannot properly distill the characteristics of clouds in remote sensing images. In this paper, we propose a novel self-attention knowledge distillation approach for compact and accurate cloud detection, named Bidirectional Self-Attention Distillation (Bi-SAD). Bi-SAD lets a model learn from itself without adding additional parameters or supervision. With bidirectional layer-wise features learning, the model can get a better representation of the cloud’s textural information and semantic information, so that the cloud’s boundaries become more detailed and the predictions become more reliable. Experiments on a dataset acquired by GaoFen-1 satellite show that our Bi-SAD has a great balance between compactness and accuracy, and outperforms vanilla distillation methods. Compared with state-of-the-art cloud detection models, the parameter size and FLOPs are reduced by 100 times and 400 times, respectively, with a small drop in accuracy.


Introduction
With the rapid development of remote sensing technology, many remote sensing images (RSIs) with high resolution can be obtained easily, and have been widely used in the fields of resource survey, disaster prevention, environmental pollution monitoring, urbanization studies, etc. [1,2]. However, as nearly 66% of the Earth's surface is covered with cloud [3], most RSIs encounter different levels of cloud contamination, which not only degrades the quality of RSIs, but also results a waste of storage and downlink bandwidth in satellite. With the on-board cloud detection, the cloud fraction in the image can be calculated, so that the cloudy image is removed before transmission, and the image with no cloud or less cloud fraction is transmitted, improving transmission efficiency and data utilization. As a result, it is important to develop compact and accurate cloud detection for practical applications. In this work, we propose a novel bidirectional self-attention distillation (Bi-SAD) method for compact and accurate cloud detection. It can get more reliable cloud detection results by reinforcing representation learning from itself without additional external supervision. As illustrated in Figure 1b, Bi-SAD presents a bidirectional attention maps learning, which contains two mimicking flows-forward semantic information learning and backward textural information learning. To get more reliable predictions, the semantic information of lower block mimics the semantic information of deeper block during the forward procedure. Meanwhile, to get more detailed boundaries, the textural information of deeper block learns the textural information of preceding feature block in the backward procedure. By introducing Bi-SAD, the network can strengthen its representations and acquire a significant performance gain.
Above all, the main contributions of this paper can be summarized as follows.
1. We present a novel self-attention knowledge distillation framework, which is called Bi-SAD, for compact and accurate cloud detection in remote sensing imagery. Compared with other deep learning-based cloud detection methods, our method greatly reduces the parameter size and FLOPs, making it more suitable for practical applications. 2. To enhance the feature learning of the cloud detection framework, we design a bidirectional distillation way, which is composed of backward boundaries distillation and forward inner distillation, to get detailed boundaries and reliable cloud detection results. Moreover, we systematically investigate the inner mechanism of Bi-SAD and analyze its complexity carefully. 3. We conduct sufficient experiments on Gaofen-1 cloud detection dataset and achieve a great balance between compact model and good accuracy. Besides, the time point of introducing Bi-SAD and the optimization of parameters are carefully studied in the distillation process, which further improves the performance on cloud detection.
The reminder of this paper is organized as follows. In Section 2, we introduce the Bi-SAD framework in details. In Section 3, we describe the dataset and present experiments results to validate the effectiveness of our proposed method. In Section 4, we discuss the results of proposed method. Finally, Section 5 gives a summary of our work.

Materials and Methods
In this section, we describe our method in detail. The framework of Bi-SAD is presented in Section 2.1. The detailed generation of semantic attention map and textural attention map is discussed in Sections 2.2. Section 2.3 presents the specific implementation process of Bi-SAD. Section 2.4 gives the complexity analysis of Bi-SAD.

Overview
The overall framework of our proposed Bi-SAD is shown in Figure 2a. It is a general distillation method and has no strict requirements on backbone. In this paper, we use ResNet-18 [35] as the backbone to illustrate our method. We denote the conv2_x, conv3_x, conv4_x, and conv5_x of ResNet [35] as block1, block2, block3, and block4, respectively.   Figure 2b,c shows where the network focuses and is based on the attention transfer operation [42]. Considering that the boundaries area of the attention map contains more textural information and the inner area contains more semantic information, we define the boundaries area of attention map as textural attention map, and the inner area of attention map as semantic attention map, respectively. We define Mask B and Mask I as the binary mask of the boundary area and the inner area of the prediction, respectively. The top row of Figure 2a represents the layer-wise top-down attention distillation, where the semantic attention map of preceding block to mimic that of a deeper block, e.g., block3 mimic − −− → block4 and block2 mimic − −− → block3, and the bottom row of Figure 2a represents the layer-wise bottom-up attention distillation, where the textural attention map of higher block to mimic that of a lower block, e.g., block2 mimic − −− → block1 and block3 mimic − −− → block2. We get our ideas from the following facts. First, when a cloud detection network is trained properly, attention maps of different layers would capture rich and diverse information. Second, directly conducting full feature attention map mimicking would unavoidably introduce some noise from background areas where there are snow, buildings, coast lines, roads, etc. Finally, the deeper layer has more powerful semantic information, which is of vital importance to classify the clouds, and the lower layer has more detailed textural information, which is helpful to get accurately localization and detailed boundaries. Considering that it is better to mimic a region-based attention map and integrate the semantic information and textural information in the network learning, we design a bidirectional self-attention distillation.

Generation of Attention Map
As depicted in Figure 2a, for the boundaries of the clouds and the inner area of the clouds, we design a backward learning flow and forward learning flow, respectively. First, we need to get the attention maps through the attention mapping function. Let us denote the activation output of the n-th layer of the network as A n ∈ R C n ×H n ×W n , where C n , W n , and H n represent the channel, width, and height, respectively. The attention map is generated by mapping function The absolute value of each element in attention map indicates the importance of the element on the output. As a result, we can design a mapping function via computing statistics of these values along the channel dimension. More specifically, we design a mapping function by summing the squared activations along the channel dimension. We denote F 2 sum (.) as the mapping function. The framework of the generation of textural and semantic attention map can be seen in Algorithm 1. In the following, we demonstrate the generation of textural attention map and semantic attention map in detail. 10: S i = F 2 sum (A i ) · I i ; 11: end for 12: return T, S;

Textural Attention Map
We use Mask B and the attention map to generate the textural attention map. For each individual image, as shown in Figure 3a, first, we use Laplace Operator to extract the boundaries of the prediction as shown in Figure 3c. Then, we use morphological expansion method to expand the boundaries and term it as Mask B 0 , as shown in Figure 3d.  Figure 2c shows the generation of textural attention map: for Block i, it first generates attention map by mapping function, and then generates textural attention map by using attention map to multiply the downsampled Mask B i (the downsampled Mask B i refers to downsampling Mask B 0 to match the spatial size of feature map A i ). We can see from Figure 2c that the textural attention map contains not only the boundaries of the big clouds, but also some small pieces of the clouds, and in these areas, the network needs refined textural information to capture detailed boundaries of clouds.

Semantic Attention Map
We use Mask I and the attention map to generate a semantic attention map. For each single image, as shown in Figure 3a, we use the prediction subtracting Mask B 0 to generate Mask I 0 , as shown in Figure 3e. Figure 2b shows the generation of semantic attention map: for Block i, similar to the generation of textural attention map, it generates semantic attention map by using attention map to multiply the downsampled Mask I i (the downsampled Mask I i refers to downsampling Mask I 0 to match the spatial size of feature map A i ). As shown in Figure 2b, semantic attention map contains the inner area of the big clouds, and in these areas, the network needs strong semantic information to make a reliable prediction of clouds.

Bidirectional Self-Attention Knowledge Distillation
The whole training procedure can be divided into two stages, i.e., the network training itself and adding Bi-SAD to training. In the former stage, the network does not capture useful information very well, and therefore these layers that previous layer want to mimic, i.e., the distillation targets are of low quality. Therefore, the network needs to learn by itself. When the detection network is trained to a reasonable level so that the distillation targets capture useful information, we add Bi-SAD to training. Here, we assume the network half-trained to 5000 epochs.
We term the forward semantic information learning as Inner-SAD, and the backward textural information learning as Boundary-SAD. The framework of the training procedure of Bi-SAD can be seen in Algorithm 2. In the following, we discuss the training procedure of Bi-SAD in details.
In order to obtain the semantic information and textural information acquired in Inner-SAD and Boundary-SAD, we set an attention transformer after each of block1, block2, block3, and block4 of the backbone, which is termed as AT-TRANS. As shown in Figure 2a, we use the black dotted line to represent the procedure of attention transformer. There are several operations in the attention transformer: First, we use F 2 sum (A n ) to get the 2D attention map from 3D tensor A n . Second, as the size of original attention maps is different from that of targets, we utilize bilinear upsampling B(.) to match the spatial dimensions. Then, we acquire area-of-interest by multiplying downsampled Mask B or downsampled Mask I. In Inner-SAD, we use the downsampled Mask I n (Mask I n represents the Mask I of Block n), denoted as M in to generate the semantic attention map, and in Boundary-SAD, we use the downsampled Mask B n (Mask B n represents the Mask B of Block n), denoted as M bn , to produce the textural attention map. Finally, we use the normalization function N (.) to normalize the vector above. AT-TRANS is represented by a function: where M jn represents M in or M bn . Upsample S i+1 to match the spatial size of S i ; 5: Normalize S i and S i+1 ; 6: L inner = L inner + L 2 (S i , S i+1 ); 7: end for 8: Step2: Boundary-SAD 9: Initialize L boundary = 0 ; 10: for i ∈ [2, 4] do 11: Upsample T i to match the spatial size of T i−1 ; 12: Normalize T i and T i−1 ; 13: where L d is usually defined as an L 2 loss, and Ω(A n+1 , M in+1 ) is the target of the top-down layer-wise distillation loss.

Boundary-SAD
Meanwhile, as shown in Figure 2a, the red dotted line at the bottom represents the flow of backward detailed boundaries information learning. During the backward learning procedure, the boundaries attention map of deeper block learns the boundaries attention map of preceding feature block, e.g., block2 mimic − −− → block1 and block3 mimic − −− → block2. A successive bottom-up layer-wise distillation loss, whose direction is backward, is formulated as follows, where L d is usually defined as an L 2 loss, and Ω(A n−1 , M bn−1 ) is the target of the bottom-up layer-wise distillation loss. Besides, as blocks represent conv2_x, conv3_x, conv4_x, and conv5_x of ResNet [35], N = 4. We do not assign different weights to different Bi-SAD paths, although it is possible. Besides, considering that the attention maps of adjacent layers are semantically closer than those of non-neighboring layers, we perform mimicking the attention maps of the adjacent layers successively instead of any other paths (e.g., block 2 The overall training loss of the detection model is where L gt is the standard entropy loss, and λ 1 and λ 2 are the distillation loss weight balancing factors.

Complexity Analysis of Bi-SAD
In order to evaluate the efficiency of our Bi-SAD, we analyze its complexity for the distillation operation. The computational cost of Bi-SAD mainly includes the generation of attention maps and the learning process. The cost of the former and the latter are O 1 and O 2 , respectively, where W 1 and H 1 , W 2 and H 2 , W 3 and H 3 , and W 4 and H 4 are 1/4, 1/8, 1/16, and 1/32 of width and height of the original input image, respectively, and C 1 < C 2 < C 3 < C 4 . By analysis, we can get the calculation complexity of Bi-SAD: In addition, compared with T-S methods, our Bi-SAD has a lower calculation complexity of the distillation operation, which can significantly reduce storage space and increase computing speed. Specifically, the method in [40] reaches O(W HN + 8W H(C + 1)), the method in [41] reaches O(W 2 H 2 (C 2 + 1) + W HN), and our Bi-SAD only reaches O(4W H(C + 1)), where C represents the number of channels in feature map; N represents the number of classes; and W and H represent the width and height dimensions of feature map, respectively.

Experiments
In this section, we comprehensively evaluate the proposed Bi-SAD on GaoFen-1 satellite images. Specifically, we first present description of experimental details. Then, we discuss the performance of Bi-SAD qualitatively and quantitatively. Finally, we also conduct a comparative experiment with the state-of-the-art distillation models and deep learning-based cloud detection models.

Dataset
In order to quantitatively evaluate the performance of our method, we use the public accessible GaoFen-1 dataset released by Li et al. [8] in the experiments, where there are three visible bands and a near-infrared band. The images were acquired from May 2013 to August 2016 in different global regions [8]. The resolution of the image is 16 m. The whole dataset contains 108 globally distributed scenes and covers different clouds types and land cover types, including water, urban areas, forest, barren, and ice/snow. Thus, we can have a comprehensive test of our method under different conditions. There are only clouds and background in our experiments, where small clouds, broken clouds, thick clouds and thin clouds are all marked as clouds, and background, clear-sky, cloud shadow, and other non-cloud bright objects are marked as background, as shown in Figure 4. The whole 108 scenes whose sizes are 10,000 × 9000 pixels are divided into training set (87 scenes) and testing set (21 scenes

Network Design
In the experiment, we evaluate the performance enhancement of our proposed method on a popular compact model, i.e., ResNet18 [35]. More specifically, to further increase the speed and reduce the amount of parameters, we denote the vanilla ResNet18 as 1× model, and directly halve channels of each layer to obtain the 0.5× model. Then halve again to obtain the 0.25× model, and the 0.25× model has a smaller model size than the vanilla ResNet18 by 41.75 MB.
Baseline Setup. We utilize the 0.25× ResNet18 as the backbone, and we employ 32× bilinear upsampling to predict, which is a very simple FCN [47] alike semantic segmentation model. We use the simple yet general model to verify the wide universality and the great generalization of our proposed method.

Evaluation Metrics
We evaluate the model in terms of accuracy and efficiency. For quantitative evaluation of the detection accuracy, we use mean intersection over union (mIoU), F 1 score, overall accuracy (OA) [47] as the measurement. Notably, a large F 1 score suggests a better result. Besides, mIoU and OA that indicates overall pixel accuracy, are also calculated for a comprehensive comparison with different models. Let p ij be the number of pixels of class i predicted to belong to class j, and k be the number of classes. We calculate mIoU, OA, and F 1 with the following formula, where CP is the number of pixels correctly detected as cloud, DP is the total number of pixels detected as cloud, and GN is the number of cloud pixels in ground truth.
For quantitative evaluation of the model efficiency, we use execution time [48], the model size and the calculation complexity [41] to measure. The execution time is represented by the inference time of the network. We input the slice with 513 × 513 pixels resolution of the whole testing sets, and calculate the average time of each image as the inference time. The model size is expressed in terms of the number of network parameters. And the calculation complexity is represented by the sum of the floating-point operations (FLOPs) in once forward on a fixed input size.

Training and Testing Details
Our experiments are performed on one RTX-2080Ti GPU with PyToch 1.1. During the training procedure, the input image is the slice with 513 × 513 pixels. In order to prevent overfitting, some data augmentation methods are applied, such as random scale, random horizontal, and vertical flipping. The stochastic gradient is selected as the optimizer for our experiments with a momentum of 0.9. We use a "poly" learning rate policy [20] in training with a base learning rate of 0.002 and a power of 0.9. Loss weight balancing factors λ 1 and λ 2 are empirically set as 0.5 and 0.5, respectively. We train our network from scratch for 30,000 iterations with a batch size of 40. For the method of without knowledge distillation approach, the total loss is cross-entropy loss between prediction and ground truth. For the methods based on T-S knowledge distillation and SAD, the total loss is cross-entropy loss [47] plus distillation loss.
During the stage of test, we keep the original resolution of image instead of resizing it to a fixed size, and all of our test results are keeping the same scale. In our experiments, we use the method of sliding window detection to implement the inference for the every whole image. In details, the size of the sliding window is the same as the input size in the training stage, i.e., 513 × 513 pixels.

Ablation Studies
In this section, we investigate the effects of Boundary-SAD and Inner-SAD. Besides, we also show the experiments of parameter optimization.

The Effect of Boundary-SAD
In order to capture cloud details more accurately, we add Boundary-SAD to the baseline model. Table 1 shows that there is a 1.28% enhancement in mIoU. As shown in Figure 5, after adding Boundary-SAD, the predictions of boundaries of clouds and small piece of clouds are more accurate. Boundary-SAD gives the model a better ability to capture details in clouds.

The Effect of Inner-SAD
To make more reliable predictions, we add Inner-SAD to the baseline model. Table 1 shows that the improvement of mIoU is 2.01%. As shown in Figure 6, after introducing Inner-SAD to the network, the results present less misclassified pixels than baseline model. Inner-SAD gives the model a better ability to distinguish between snow/ice and clouds. Besides, as shown in Table 1, our method achieves a high accuracy of 96.72%, with small model size, low calculation complexity, and fast inference time. When Bi-SAD is added to baseline, the network performance has been further improved, i.e., a 2.63% gain in mIoU, without increasing the amount of parameters, computational complexity and inference time. It proves that our method of combining forward semantic information learning and backward textural information learning is effective.

The Effect of Mimicking Direction
To investigate the effect of mimicking direction on performance, we reverse the direction: for the boundaries region of the cloud, the lower layers mimic higher layers, and for the inner region of the cloud, the higher layers mimic lower layers. It decreases the performance of the baseline model from 84.40% to 83.52%. This is because low-level attention maps contain more textural information, i.e., details and high-level attention maps contain more high-level semantic information. Reversing the mimicking direction will inevitably hamper the learning of the crucial clues for the cloud detection.

Parameter Optimization
During the generation of the masks, we use Laplace Operator to extract the boundaries, and use morphological expansion method to expand the boundaries to acquire the Mask B (Section 2.2), where there is a hyperparameter, i.e., the expansion iteration. Besides, we assume a half-trained model before we introduce Bi-SAD to the training. In this study, we did the experiments to find the fitting hyperparameter.
The hyperparameter of the masks. Figure 7 shows the resulting mIoU of the varying expansion iteration, and we can see that when the iteration is larger than 7 (iteration > 7), the performance of the Bi-SAD is lower than baseline. This is because as the expansion iteration grows, the boundaries of clouds will grow in Mask B and the center region of clouds will decrease in Mask I, which may cause Mask B contains more center region of clouds. If the boundaries and inner area represented by Mask B and Mask I are not accurate, the textural information and semantic information will not be learned very well. As a result, when the iteration is larger than 7, the performance will drop a lot. As shown in Figure 7, the results show that the value of 3 turns out to be optimal, and in the experiment, we use 3 as the expansion iteration to generate Mask B.  The time point to add Bi-SAD. Here, we research the time points to add Bi-SAD. As shown in Table 2, we can see that different time points of adding Bi-SAD almost converge to the same point, and 5000 is relatively better. We think that this it caused by the quality of the distillation targets produced by later layers and the optimization speed. In the earlier training stage, as the distillation targets produced by later layers are of low quality, this may introduce some noise to training. In the later training stage, the quality of the distillation targets is well, but as the learning rate drops, the optimization speed is slow. Besides, we find that after introducing Bi-SAD to network, the network has a more rapid speed of convergence. In the experiments, we add Bi-SAD to training when the network trained to 5000.

Comparison with The State-of-the-Art Distillation Methods
In this section, we make a comparison between our Bi-SAD with state-of-the-art self distillation method and T-S distillation methods.
For T-S distillation methods, we denote the 1.0× ResNet18 with 32× upsamping as the teacher model and denote the baseline model as the student model. Two state-of-the-art T-S distillation methods are selected in this experiment, named zero+first [40] and pixel+pair [41]. For self distillation method, we make a comparison between our Bi-SAD and SAD [43].
The results are shown in the Table 3. We find that after training with our Bi-SAD, the student model almost reaches the same performance as the teacher model and even better on mIoU and F 1 , and our Bi-SAD also outperforms the state-of-the-art distillation methods, which proves that our Bi-SAD is more powerful in cloud detection. Besides, as shown in Figure 8, as the teacher model is required to be trained well in advance in T-S methods, comparing with T-S methods in terms of the training time, the amount of parameters, and GPU memory usage, our Bi-SAD is more efficient in the training phase. Table 3. Comparison of the state-of-the-art distillation methods in [40,41,43] with ours. The segmentation is evaluated by mIoU. "Zero" in the third row of the table represents the pixel-wise L 2 norm distillation method in [40]. "first" in the third row represents the local similarity distillation method in [40]. "Pixel" in the fourth row represents the pixel-wise probability mimicking method in [41]. "pair" in the fourth row represents the global pair-wise distillation in [41]. "SAD" in the fifth row represents the self-attention distillation method in [43]. "Bi-SAD" in the last row represents our self-attention learning method.

Params(M)
Teacher-Student Self-distillation Figure 8. Comparison of self-distillation method (our Bi-SAD) and Teacher-Student distillation method [41] in terms of parameters (measured by M), training time (measured by h), and GPU memory (measured by MB). It can be seen that SAD [43] requires 17× less parameters, 2× less training time, and reduces GPU memory usage by 40%.
Further, as shown in Figure 9, we can see that (1) with the forward mimicking and the backward mimicking in Bi-SAD, the high-level attention map not only contains rich semantic information, but also incorporates textural information, which is vital to make a precise prediction. (2) After adding self-attention distillation, attention maps of the network become more explainable. Because the shape of attention maps are getting closer to the shape of the cloud, this also shows that the network focuses on the clouds, so the performance is better. And this phenomenon is more obvious in Bi-SAD than SAD.  When Bi-SAD is added to baseline model, the shape of attention maps and clouds are more similar, and the predictions are more precise.

Comparison with The State-of-the-Art Deep Learning-Based Cloud Detection Approaches
To comprehensively evaluate the proposed method from the model parameter amount, speed, and accuracy, we make a comparison with the state-of-the-art deep learning-based cloud detection models including MFFSNet [24], CDNet [27], PSPNet [23], DeeplabV3+ [22], and MF-CNN [25], as shown in Table 4. Figure 10 quantitatively shows the accuracy, parameters, and FLOPs of these methods; we can see that although there exists as small difference in performance, our method has fewer parameters, lower calculation complexity, and faster inference speed. Specifically, by comparing our method with the current highest precision MFFSNet, the parameter size and FLOPs are reduced by 100 times and 400 times, respectively, with a small drop in accuracy, and the speed is increased bỹ 7 times. It also shows that our method is more conducive to practical application.  Figure 10. The accuracy, parameters, and floating-point operations (FLOPs) of different deep convolutional neural networks (DCNNs) on the GaoFen-1 cloud detection dataset, including our method, MFFSNet [24], CDNet [27], PSPNet [23], DeeplabV3+ [22], and MFCNN [25]. The FLOPs are represented by the size of corresponding labels (circle or triangle in the picture), which means the bigger the label, the larger the FLOPs. Compared with with the state-of-the-art deep learning based cloud detection models, our method uses fewer parameters and FLOPs to achieve comparable performance.
Besides, Figure 11 qualitatively shows that our method performs well in the scene of forest, roads, water, and coastline, and can accurately capture thin clouds, small piece of clouds, and boundaries of clouds. Moreover, as shown in Figure 11, we can see that in big clouds, thin clouds, small pieces of clouds, and broken clouds our network has a competitive performance with the state-of-the-art cloud detection models. . Visual comparisons of cloud detection results of GaoFen-1 image with our method, MFFSNet [24], CDNet [27], PSPNet [23], DeeplabV3+ [22], and MFCNN [25].

Discussion
The experimental results in Sections 3.3 and 3.4 prove that the proposed method can effectively improve the performance of cloud detection network, our method has a performance enhancement over other distillation methods and the training efficiency of our method is higher. There are several reasons: First, through our proposed two mimicking flows-forward semantic information learning and backward detailed textural information learning-the model indeed enhances its representation of clouds, so the performance is improved. Second, in terms of cloud detection, compared with existing T-S distillation methods, the self-distillation method can more effectively capture useful information through self-attention mechanism. Third, as the distillation information of our proposed method comes from different layers of the network, no teacher model is required, so the training efficiency of our proposed method is higher.
Although our method achieves a good balance between accuracy and speed, the accuracy of the model needs to be further improved. We think the reason may be that the limited parameter amount limits the feature learning ability of the model to some extent.
We will investigate how to design a network structure with a small amount of parameters and low computational complexity, but with strong feature extraction capabilities to further improve the performance and speed in our future work.

Conclusions
In this work, we propose a novel bidirectional self-attention distillation method for compact and accurate cloud detection. Our method takes full use of the information of low-level and high-level attention map to improve the representation learning of DCNN-based cloud detection models. Experiments based on the GaoFen-1 cloud dataset demonstrate that our method outperforms other state-of-the-art distillation methods and achieves a great trade-off between accuracy and speed. Extensive experiments and analysis demonstrate the effectiveness of our approach. In future work, we will pay more attention to further improving the accuracy of the cloud detection model with limited parameters.