Next Article in Journal
A New Proof for Lyapunov-Type Inequality on the Fractional Boundary Value Problem
Previous Article in Journal
Decision Making for Project Appraisal in Uncertain Environments: A Fuzzy-Possibilistic Approach of the Expanded NPV Method
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

MCNet: Multi-Scale Feature Extraction and Content-Aware Reassembly Cloud Detection Model for Remote Sensing Images

1
College of Software, Xinjiang University, Urumqi 830000, China
2
Key Laboratory of Signal Detection and Processing, Xinjiang University, Urumqi 830046, China
3
Key Laboratory of Software Engineering, Xinjiang University, Urumqi 830046, China
*
Author to whom correspondence should be addressed.
Symmetry 2021, 13(1), 28; https://doi.org/10.3390/sym13010028
Submission received: 23 November 2020 / Revised: 12 December 2020 / Accepted: 22 December 2020 / Published: 26 December 2020

Abstract

:
Cloud detection plays a vital role in remote sensing data preprocessing. Traditional cloud detection algorithms have difficulties in feature extraction and thus produce a poor detection result when processing remote sensing images with uneven cloud distribution and complex surface background. To achieve better detection results, a cloud detection method with multi-scale feature extraction and content-aware reassembly network (MCNet) is proposed. Using pyramid convolution and channel attention mechanisms to enhance the model’s feature extraction capability, MCNet can fully extract the spatial information and channel information of clouds in an image. The content-aware reassembly is used to ensure that sampling on the network can recover enough in-depth semantic information and improve the model cloud detection effect. The experimental results show that the proposed MCNet model has achieved good detection results in cloud detection tasks.

1. Introduction

With the progress of remote sensing technology, remote sensing images are widely used in navigation and positioning [1,2], ground object detection [3,4], environmental surveillance [5,6], and many other fields. However, many clouds will affect the imaging quality of remote sensing images, resulting in less available information in remote sensing images and a negative impact on tasks, such as subsequent target detection and tracking, segmentation, and classification. Therefore, it is of great significance to study high precision remote sensing cloud detection.
The traditional remote sensing cloud detection methods mainly include the physical threshold method, detection method based on cloud texture and spatial characteristics, and detection method based on machine learning. Limited by the spectrum of early remote sensing images, the ISCCP (The International Satellite Cloud Climatology Project) method [7], CLAVR (The NOAA Cloud Advanced Very High Resolution Radiometer) method [8], and other methods based on physical thresholds have low detection accuracy. As the spectrum of remote sensing satellites increases, multi-spectral cloud detection technology has been developed, using the physical characteristics of clouds such as high reflectivity and low-temperature characteristics, cloud detection is achieved through physical radiation threshold filtering [9,10,11]. The cloud detection method based on image space information [12,13], using the characteristic that the radiation change degree of cloud cover image is higher than that of a clean appearance, compares the field radiation value change of each pixel with a certain threshold to achieve pixel-level cloud segmentation. This method is suitable for remote sensing cloud detection with a single background such as in the sea area scene or an explicit scene. Cloud detection methods based on texture information [14,15] use the improved fractal dimension method to express the difference between cloud texture features and surface object texture features to achieve a higher efficiency of cloud detection. The detection method based on cloud texture and spatial information relies on fixed features and has low robustness. With machine learning development, traditional machine learning methods such as decision trees and support vector machines (SVM) are used for cloud detection. The Fmask (Function of mask) method [16,17,18] uses a decision tree with optimized thresholds to classify each pixel. The Fmask method is widely used as a baseline for remote sensing image cloud detection and evaluation. The ACCA (automated cloud cover assessment) method [19] judges whether there is a cloud through a decision tree threshold. Latry et al. [20] use support vector machines to create a decision boundary for cloud detection by representing mapping data. The accuracy of traditional cloud detection and machine learning cloud detection algorithms depends on the correctness of training data and the effectiveness of features, and the detection accuracy of complex cloud layers is low.
In recent years, semantic segmentation combined with deep learning has achieved good results in cloud detection. Fully convolutional neural network (FCN) [21] is the pioneering work of image semantic segmentation algorithm. The image-level classification is extended to pixel-level classification through end-to-end training of the fully convolutional neural network, and semantic segmentation is realized. Mohajerani et al. [22] proposed a hybrid full convolution FCN and gradient identification algorithm and applied FCN to the cloud detection field. Mohajerani et al. [23] proposed that the Cloud-Net algorithm achieves a better detection effect by redesigning the convolution block based on FCN, and Cloud-Net is usually used as the baseline of deep learning cloud detection network. Chen et al. [24] proposed the Deeplab algorithm to expand the receptive field, extract more spatial features, and enrich network features’ expression by introducing dilated convolution in the FCN. However, dilated convolution led to the loss of some detailed information of clouds [25], affecting local consistency. Zhao et al. [26] propose that the PSPNet algorithm realizes multi-scale feature fusion by embedding a pyramid pooling module (PPM) in the FCN to solve the problem of insufficient feature extraction ability FCN. Inspired by PSPNet, Yan et al. [27] proposed an MFFSNet algorithm that uses PPM to aggregate feature information of different scales to improve the utilization rate of local and global features of clouds in images. However, this algorithm easily loses partial boundary information. Ronneberger et al. [28] proposed that the Unet uses skip connection to connect the encoding and decoding network to solve the problem of rough mask generation by the FCN. Gonzales et al. [29] used the Unet based on ResNet34 [30] pretraining for cloud detection. It uses transfer learning to transfer the common feature attributes of different types of images to the feature extraction step of cloud detection to strengthen the network feature extraction capability. Guo et al. [31] proposed that the Cloud-AttU algorithm adds a spatial attention mechanism to the Unet and learns more effective feature information expression through jump layer connections. Compared with the cloud detection algorithm based on the traditional method, the above-mentioned remote sensing cloud detection methods based on deep learning can better mine the in-depth semantic information in the cloud detection image and has higher robustness and higher precision. However, the above remote sensing cloud detection algorithms pay more attention to the multi-scale information of the downsampling operation, insufficient utilization of channel features. The receptive field size is insufficient in the upsampling operation, and the upsampling process cannot be adaptively optimized according to the characteristic information. In the field of remote sensing cloud detection, good detection performance has not yet been achieved.
From what has been discussed above, cloud detection in remote sensing images still faces enormous challenges because there are many types of clouds in remote sensing images and their distribution is uneven, which brings specific difficulties to detection, and the background contains many similar ground objects, which makes of cloud detection even harder. Currently, remote sensing cloud detection still has the following problems.
  • The remote sensing cloud image obtained in the real scene contains many surface objects (such as snow, ice, trees, and white human-made objects) with similar reflection characteristics to the clouds, and the background interference is serious. As a result, it is challenging to capture clouds under a large amount of background interference accurately.
  • The uneven distribution and thickness of clouds in remote sensing images make the detection accuracy low.
  • Clouds are affected by shooting angles and wind speeds, resulting in different scales and various shapes, and the accuracy of cloud mask generation in complex scenes is low.
This paper proposes a deep neural network cloud detection method that combines multi-scale feature extraction based on channel attention mechanism and content perception reorganization to solve the above problems. Experiments show that this method has excellent detection performance. The main contributions of this article are as follows.
  • To solve the detection problems of uneven cloud layer thickness, uneven cloud layer distribution, and background interference, a pyramidal convolution residual network with an efficient channel attention (ECA) module (EPResNet-50) is proposed, which uses pyramidal convolution [32] to capture multi-scale feature information and increase the network’s attention to effective channels through the ECA [33] module, comprehensively consider the channel characteristics and spatial characteristics, and enhance the ability of network feature extraction.
  • To solve the problem of low accuracy of cloud generation masks in complex scenes, the CARAFE (Content-Aware ReAssembly of FEatures) [34] upsampling module is introduced. The semantic information of feature maps is fully utilized for feature restoration through adaptive kernels, and the accuracy of network generation masks is improved.
  • Conduct a comparative experiment between the proposed algorithm and the current mainstream algorithm on the 38-cloud [22,23] dataset. Experimental results show that this method has better detection performance.
The other parts of the paper are organized as follows. In Section 2, we introduce an overview of the method in this paper. In Section 3, we compare and analyze the experimental results between the methods proposed in this paper and other methods. In Section 4, we provide a comprehensive conclusion.

2. Methods

This research introduces pyramidal convolution based on the efficient channel attention mechanism in the Unet to make full use of the red, green, blue, and near-infrared (RGBN) multi-spectral channel features and improves the network feature extraction ability. Content-aware reorganization is introduced in the upsampling. The network can adaptively optimize the upsampling operation according to the extracted feature information and improve mask generation accuracy.

2.1. MCNet

The proposed MCNet is based on the encoder and decoder structure, which connects the encoder and the decoder through a skip connection. The backbone of the MCNet encoder is a pyramidal convolutional residual network base on efficient channel attention (EPResNet-50). Each block of EPResNet-50 can process the input feature map at multiple scales and extract the interdependence of local channels to improve the encoder’s feature expression ability. Each step of the decoder consists of CARAFE upsampling operation and two convolution operations. CARAFE makes full use of the in-depth semantic information of features through an adaptive kernel, strengthens the decoder’s feature recovery capability, and improves the accuracy of mask generation. In the skip connection process, we use 1 × 1 convolution to adjust the number of output channels of each module of EPResNet-50. It has the same number of channels as the decoder as well as a convenient encoder–decoder combination. The MCNet conceptual structure is shown in Figure 1.

2.2. EPResNet-50

In the encoder, EPResNet-50 was used as the backbone network for feature extraction, which enhanced feature extraction capability by adding pyramidal convolution and efficient channel attention to take into account the channel characteristics and spatial characteristics.

2.2.1. Pyramidal Convolution

Due to the uneven distribution of clouds in remote sensing images and large-scale differences, the proportion of cloud coverage in different images varies greatly. Traditional convolution uses a single-scale convolution kernel, which cannot extract features from multiple scales and is not suitable for remote sensing cloud detection. In this paper, by introducing pyramidal convolution (PyConv) in the encoder, it is used to capture cloud information in different spaces and depths on multiple scales and improve the network’s ability to detect clouds uneven distribution and large scale differences. Pyramidal convolution operation is shown in Figure 2. For the input feature map of F M i channels, the convolution kernel with different sizes of K x and different depths of F M o x (x is the grouping convolution number) is used to conduct grouping convolution processing for the input features. Finally, the feature splice generates the final output feature.

2.2.2. ECA

Many ground objects in remote sensing images are covered by ice and snow, and the temperature at the covered area is low. This results in the near-infrared radiation of ground objects being similar to that of clouds, and the color of ice and snow is identical to that of clouds. Therefore, it is difficult to accurately distinguish clouds under a large amount of similar background interference. In order to reduce similar background interference and improve detection accuracy, we add channel attention module ECA to the encoder to realize local cross-channel information interaction without dimension reduction, allowing network to focus on the aggregation of image channel features, which improve the feature expression ability of the encoder, and enhance the network discrimination ability.
The ECA module is mainly composed of global-average-pooling (GAP), Conv1d, and Scale. GAP compresses W × H × C input features along the spatial dimension into 1 × 1 × C global pooling information, using the channel size adaptive function ψ C :
k = ψ C = log 2 C γ + b γ odd
Reach the number of adjacent channels k that participates in the channel weight calculation. The parameters b and γ are set to 1 and 2, respectively, in this article, and C is the number of available channels; odd means to take adjacent odd numbers. Conv1d obtains the local cross-channel weight value ω by taking the adjoining k-channel according to the formula 2, scale multiplies the weight value to weight the previous feature, and finally obtains a feature map with channel weights:
ω i = σ j = 1 k α j y i j , y i j Ω i k
where Ω i k represents the set of k adjacent channels of the channel y i , α is the channel sharing parameter, and σ is the Sigmoid function. The ECA model structure is shown in Figure 3.

2.2.3. EPResNet-50 Architectures

The basic model of the EPResNet-50 model in this article is ResNet-50. EPResNet-50 introduces feature pyramid convolution to capture more comprehensive spatial information by multi-scale extraction of cloud feature, and introduces the ECA module to aggregate feature channel information, using ECA to improve the effectiveness of feature expression by paying attention to the dependency between channels. EPResNet-50 consists of a stack of similar structural blocks, the overall structure of EPResNet-50 is shown in Table 1.
To extract cloud layer information at multiple scales, the depth and size of each block’s pyramid convolution kernel in EPResNet-50 are different. The stage1 block of EPResNet-50 is shown in Figure 4. The input features are first subjected to 1 × 1 convolution for feature mapping. After the mapping, the features are input into pyramidal convolution for multi-scale feature extraction. The number of feature channels is adjusted to 256 through 1 × 1 convolution. The 256 channels feature information is input to the ECA module to achieve local cross-channel interaction without dimension reduction, and to aggregate channel information, and finally, to combine the aggregated features with the original input to produce the output features. Compared with ResNet-50, EPResNet-50 can get more effective features in space and channel dimensions.

2.3. CARAFE

The decoder structure recovers cloud effective features through stack upsampling, and the feature upsampling module is the key operation module in the convolution network structure. Unet uses bilinear interpolation method for upsampling and uses a preset sampling core. The semantic information of feature map does not play a role, and the perceptual domain is only 2 × 2 , so the network cannot make full use of the information of nearby regions, resulting in the decoder cannot completely recover the effective features of the cloud layer. The CARAFE module adaptively generates the upsampling kernel based on the input features, which can not only make full use of the semantic information of the feature map, but also obtains a larger receptive field, which enhances the feature recovery capability of the decoder and improve the mask generation accuracy.
We use the CARAFE module for upsampling. CARAFE combines the kernel prediction module with the content perception recombination module, uses the underlying content information to predict the adaptively optimized recombination kernel, and performs feature recombination in the vicinity of each pixel to achieve upsampling of the content perception recombination. The CARAFE module is shown in Figure 5.
In CARAFE, the feature map X of size C × H × W and the upsampling ratio σ (assuming σ is an integer) are input into the kernel prediction module and the content reorganization module at the same time. After module calculation, the output feature map X with the size of C × σ H × σ W is obtained, and the upsampling operation of the feature map is realized. Any position l = i , j on the output feature map X . There is a corresponding l = i , j at the input feature map X , where i = i / σ , j = j / σ .
The kernel prediction module ψ adaptively generates a kernel W l for each target position l in the output feature map X , and the calculation is depicted as follows,
W l = ψ N X l , k encoder
where X l is a certain pixel position on the input feature map X , k encoder is the kernel, and N X l , k encoder is the target position l on the input feature map X , which corresponds to a square area with l as the center and size k encoder .
The kernel prediction module is composed of three sub-modules: channel compressor, content encoder, and kernel normalizer. First, the channel compressor compresses the feature map of C × H × W into C m × H × W . Then, through the k encoder × k encoder × C u p size convolution kernel convolution calculation in the content encoder, the feature map of size σ 2 × k u p 2 × H × W is obtained, where C u p = σ 2 × k u p 2 . Next, convert the feature map to the size of k u p 2 × σ H × σ W , finally, use the softmax function of the kernel normalizer to obtain the standardized output.
The recombination module uses a weighted operation to obtain the upsampling feature of the pixel l . The formula for weighting is as follows,
X l = ϕ N X l , k u p , W l = n = r r m = r r W l n , m · X i + n , j + m
where k u p is the size of the reorganized kernel, N X l , k u p is the target position l in the input feature X corresponding to a square area with l as the center and size of k u p × k u p , W l is the reorganized kernel generated by the kernel prediction module at the target location l , r = k u p / 2 .

3. Experiments

3.1. Datasets

In this paper, we use 38-cloud as the benchmark cloud detection data set for evaluating model performance. The 38-cloud dataset has 38 scenarios, including 18 training scenarios and 20 test scenarios. Each scene data is provided by the Landsat 8 satellite [35]. The Landsat 8 satellite carries two sensors, namely, the Operational Land Imager (OLI) and the Thermal Infrared Sensor (TIRS). OLI has 9 spectral bands, the spatial resolution for bands 1 to 7 and 9 is 30 m, the spatial resolution for band 8 is 15 m. TIRS has 2 spectral bands, the spatial resolution for bands 10 to 11 is 30 m. The information for all bands of the Landsat 8 is shown in Table 2. The 38-cloud dataset selects the red, green, blue, and near-infrared spectral bands of the landsat 8 satellite, and provides corresponding hand-labeled segmentation masks. The scene image is cropped into 384 × 384 patches. There are 8400 patches in the training set and 9200 patches in the test set. We use holdout cross-validation to partition dataset. We randomly use 85% of the patches in training set for training and 15% of patches for verification. To prevent the model from overfitting and enhance the robustness of the model, we use random translation, symmetry, rotation, and scaling to enhance the training set.

3.2. Evaluation Metrics

We use Overall Accuracy , Recall , Precision , Specificity , Jaccard Index , and F 1 - Score as evaluation indicators. The Precision indicator is used to judge the precision of the algorithm, and the Recall indicator is used to judge the precision of the algorithm. The Specificity index is used to measure the completeness of the error prediction, and the Overall Accuracy index is used to indicate the accuracy of the two classifications. The Jaccard Index indicator is used to describe the similarity between the predicted mask and the real mask. The F 1 - Score considers the relationship between Precision and Recall . Moreover, the Jaccard Index indicator is an important index to judge the performance of cloud detection algorithms [22,23,31].
Jaccard Index = T P T P + F N + F P
Precision = T P T P + F P
Recall = T P T P + F N
Specificity = T N T N + F P
Overall Accuracy = T P + T N T P + T N + F P + F N
F 1 - Score = 2 × Precision × Recall Precision + Recall
Among them, T P represents the number of positive samples whose judgment result is positive, T N represents the number of negative samples whose judgment result is negative, F P represents the number of negative samples whose judgment result is positive, and F N represents the number of positive samples whose judgment result is negative.

3.3. Experimental Details

In this study, all experiments are programmed and implemented using the PyTorch framework running on Ubuntu16.04 with NVIDIA RTX 2080Ti GPU. The experiment uses python3.6 as the software environment. In the experiment, we use the 384 × 384 pixel RGBN 4-channel patch image in 38-cloud as the input of the neural network. The training batch is 8, using the BCE (Binary Cross Entropy) loss function, a total of 400 epochs are trained, and the Adam (Adaptive Moment Optimization) optimizer is used for training optimization. The initial learning rate is set to 0.001, and the learning rate is attenuated by 0.1 at 200, 300, 350, and 390, respectively.

3.4. Experimental Result Analysis

3.4.1. Ablation Experiment

To verify the rationality of the module combination in the network, we use the same computing environment to perform eight sets of ablation experiments on the same data set. Table 3 shows the experimental results. The first line is Unet, which uses ResNet-50 as the backbone network. The ECA, PyConv, or CARAFE conditions were activated separately or conjunctively. The network performance was comparable to the reference network. The Jaccard Index increase indicator 0.58%, 2.62%, and 1.46%, indicating that the ECA, PyConv, and CARAFE modules’ validity added separately. Adding any two modules of ECA, PyConv, and CARAFE to Unet simultaneously, the test results are further improved than adding a single module, which verifies the mutual assistance effect between the three modules. Under the condition of adding three modules at the same time, all the index results are optimal. Jaccard Index reaches 83.05%, precision reaches 94.83%, Recall reaches 86.69%, Specificity reaches 98.67%, and Overall Accuracy reaches 96.44%, which proves the rationality of the network structure of this article. ECA considers the interaction of local channels to enhance the importance of effective channels and extract unique information from the cloud. PyConv extracts cloud spatial information from a multi-scale perspective. The PyConv integrated with the ECA module comprehensively considers the spatial relationship and channel relationship of the features extracts comprehensive and representative cloud information, and improves the feature expression ability of the encoder. The upsampling process of the CARAFE module can adaptively generate a kernel according to the input features, which can make full use of the semantic information of the feature map to restore more cloud layer details and improve the mask generation accuracy of the decoder.

3.4.2. Comparison with Some Popular Methods

To prove the effectiveness of the MCNet algorithm, we conducted a quantitative analysis with some popular cloud detection algorithms and semantic segmentation algorithms. Including Gradient Boosting Classifier (BGC), Random Forest (RF), FCN, Fmask, Cloud-net, Att-unet, PSPNet, and Deeplabv3, Table 4 shows the cloud detection results on the 38-cloud dataset. According to the experimental results in Table 4, MCNet’s Jaccard Index reached 83.05%, Precision reached 94.83%, Recall reached 86.69%, Specificity reached 98.67%, and Overall Accuracy reached 96.49%. MCNet algorithm is 6.8%, 10.4%, 1.5%, and 0.7% higher than Unet algorithm in Jaccard Index, Precision, Specification, and Overall Accuracy, respectively. The MCNet algorithm is compared with the Cloud-net algorithm of cloud detection deep learning, Jaccard Index is 4.5% higher, and the precision is 3.6% higher; compared with the sub-optimal PSPNet algorithm, Jaccard Index is 3.7% higher and Precision is 7.7% higher. The experimental results indicated the algorithm proposed in this paper has achieved higher performance accuracy under each index, which demonstrates the effectiveness of this algorithm.
Figure 6 and Figure 7 show the qualitative comparison between MCNet and other cloud detection algorithms. In Figure 6 and Figure 7, the first column is the test pseudo-color image; the second column is the ground truth; and the third, fourth, fifth, and sixth columns are masks generated by Cloud-net, PSPNet, Unet, and MCNet, respectively, in which white represents clouds and black represents background.
The blue box in Figure 6 is the detection result under complex clouds, the yellow box is the detection result under similar background interference such as ice and snow, and the red box is the detection result under the simultaneous existence of complex clouds and similar background interference such as ice and snow. Figure 8 shows the detailed information in the blue box, yellow box and red box from Figure 6. The experimental results show that the results from MCNet are less disturbed by similar backgrounds such as ice and snow, the generated mask is more real, closer to ground truth, and the detection accuracy is higher than other algorithms.
Seen from the experimental results of Figure 7, the masks generated by other detection algorithms have the problems of missing information and inaccurate positioning. The mask generated by MCNet retains more detailed cloud information, proving that MCNet can accurately and comprehensively capture cloud information, thus effectively avoiding this problem.

4. Conclusions

This paper proposes a deep neural network cloud detection method that combines multi-scale feature extraction and content perception reorganization based on the attention mechanism. To solve the interferences in cloud detection caused by cloud thickness, uneven cloud distribution, and ground objects in remote sensing images, the encoder uses EPResNet-50 to extract multi-scale spatial features and channel dependencies. It improves the model’s utilization of multi-spectral image features. The CARAFE module is introduced in the decoder stage, and the size of the upsampled receptive field is automatically adjusted through the adaptive kernel, and the deep semantic information is fully utilized for feature recovery to improve the mask generation accuracy. We use the 38-cloud data set captured by the Landsat 8 satellite for a validation. The experimental results prove the MCNet algorithm’s effectiveness, Jaccard Index reaches 83.05%, precision reaches 94.83%, Recall reaches 86.69%, and Specificity reaches 98.67%, even in complex background interference. It also has a better detection effect in scenes with large differences in cloud distribution. In future research, we will further optimize the model. By introducing an adaptive dynamic filter, the adaptability of the model will be enhanced, and therefore the model can adaptively optimize convolution operation in different scenes.

Author Contributions

All authors have worked on this manuscript together and All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the National Science Foundation of China under Grant 61966035, by the Autonomous Region graduate Student Innovation project “K-means Algorithm Initial Center Point Optimization Research under Spark Environment” (XJ2019G072), by the International Cooperation Project of the Science and Technology Department of the Autonomous Region “Data-Driven Construction of Sino-Russian Cloud Computing Sharing Platform” (2020E01023), and by the National Science Foundation of China under Grant U1803261.

Conflicts of Interest

The authors declare no conflict of interest. The funders helped reviewing the manuscript and provided the decision to publish the result.

References

  1. Yang, Y. Resilient PNT Concept Frame. J. Geod. Geoinf. Sin. 2019, 2, 1–7. [Google Scholar]
  2. Yang, Y. Concepts of comprehensive PNT and related key technologies. Acta Geod. Cartogr. Sin. 2016, 45, 505–510. [Google Scholar]
  3. Cleve, C.; Kelly, M.; Kearns, F.R.; Moritz, M. Classification of the wildland–urban interface: A comparison of pixel-and object-based classifications using high-resolution aerial photography. Comput. Environ. Urban Syst. 2008, 32, 317–326. [Google Scholar] [CrossRef]
  4. Mena, J.B. State of the art on automatic road extraction for GIS update: A novel classification. Pattern Recognit. Lett. 2003, 24, 3037–3058. [Google Scholar] [CrossRef]
  5. Friedrich, T.; Oschlies, A. Neural network-based estimates of North Atlantic surface pCO2 from satellite data: A methodological study. J. Geophys. Res. Ocean. 2009, 114. [Google Scholar] [CrossRef] [Green Version]
  6. Hall, R.; Skakun, R.; Arsenault, E.; Case, B. Modeling forest stand structure attributes using Landsat ETM+ data: Application to mapping of aboveground biomass and stand volume. For. Ecol. Manag. 2006, 225, 378–390. [Google Scholar] [CrossRef]
  7. Schiffer, R.A.; Rossow, W.B. The International Satellite Cloud Climatology Project (ISCCP): The first project of the world climate research programme. Bull. Am. Meteorol. Soc. 1983, 64, 779–784. [Google Scholar] [CrossRef] [Green Version]
  8. Price, J.C. Land surface temperature measurements from the split window channels of the NOAA 7 Advanced Very High Resolution Radiometer. J. Geophys. Res. Atmos. 1984, 89, 7231–7237. [Google Scholar] [CrossRef]
  9. Li, W.; Li, D. The universal cloud detection algorithm of MODIS data. In Proceedings of the Geoinformatics 2006: Remotely Sensed Data and Information, Wuhan, China, 28–29 October 2006; Volume 6419, p. 64190F. [Google Scholar] [CrossRef]
  10. Wu, X.; Cheng, Q. Study on methods of cloud identification and data recovery for MODIS data. In Proceedings of the Remote Sensing of Clouds and the Atmosphere XII, Florence, Italy, 17–20 September 2007; Volume 6745, p. 67450P. [Google Scholar] [CrossRef]
  11. Ren, R.; Guo, S.; Gu, L.; Wang, L.; Wang, X. An effective method for the detection and removal of thin clouds from MODIS image. In Proceedings of the Satellite Data Compression, Communication, and Processing V, San Diego, CA, USA, 2–6 August 2009; Volume 7455, p. 74550Z. [Google Scholar] [CrossRef]
  12. Solvsteen, C. Correlation-based cloud detection and an examination of the split-window method. In Proceedings of the Global Process Monitoring and Remote Sensing of the Ocean and Sea Ice, Paris, France, 25–28 September 1995; Volume 2586, pp. 86–97. [Google Scholar] [CrossRef]
  13. Ping, Q.W.; Qing, L.W.; Guo, L.J.; Huai, L.Y.; Jun, Z.; Min, Q.; Cheng, L. Application of Single-Band Brightness Variance Ratio to the Interference Dissociation of Cloud for Satellite Data. Spectrosc. Spectr. Anal. 2006, 26, 2011–2015. [Google Scholar]
  14. Shan, N.; Zheng, T.; Wang, Z. High-speed and high-accuracy algorithm for cloud detection and its application. J. Remote Sens. 2009, 13, 1138–1146. [Google Scholar]
  15. Chen, P.; Zhang, R.; Liu, Z. Feature detection for cloud classification in remote sensing images. J. Univ. Ence Technol. China 2009, 5, 484–488. [Google Scholar]
  16. Zhu, Z.; Woodcock, C.E. Object-based cloud and cloud shadow detection in Landsat imagery. Remote Sens. Environ. 2012, 118, 83–94. [Google Scholar] [CrossRef]
  17. Zhu, Z.; Wang, S.; Woodcock, C.E. Improvement and expansion of the Fmask algorithm: Cloud, cloud shadow, and snow detection for Landsats 4–7, 8, and Sentinel 2 images. Remote Sens. Environ. 2015, 159, 269–277. [Google Scholar] [CrossRef]
  18. Qiu, S.; Zhu, Z.; He, B. Fmask 4.0: Improved cloud and cloud shadow detection in Landsats 4–8 and Sentinel-2 imagery. Remote Sens. Environ. 2019, 231, 111205. [Google Scholar] [CrossRef]
  19. Irish, R.R.; Barker, J.L.; Goward, S.N.; Arvidson, T. Characterization of the Landsat-7 ETM+ automated cloud-cover assessment (ACCA) algorithm. Photogramm. Eng. Remote Sens. 2006, 72, 1179–1188. [Google Scholar] [CrossRef]
  20. Latry, C.; Panem, C.; Dejean, P. Cloud detection with SVM technique. In Proceedings of the 2007 IEEE International Geoscience and Remote Sensing Symposium, Barcelona, Spain, 23–28 July 2007; pp. 448–451. [Google Scholar]
  21. Shelhamer, E.; Long, J.; Darrell, T. Fully convolutional networks for semantic segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 640–651. [Google Scholar] [CrossRef]
  22. Mohajerani, S.; Krammer, T.A.; Saeedi, P. Cloud detection algorithm for remote sensing images using fully convolutional neural networks. arXiv 2018, arXiv:1810.05782. [Google Scholar]
  23. Mohajerani, S.; Saeedi, P. Cloud-Net: An end-to-end cloud detection algorithm for Landsat 8 imagery. In Proceedings of the IGARSS 2019—2019 IEEE International Geoscience and Remote Sensing Symposium, Yokohama, Japan, 28 July–2 August 2019; pp. 1029–1032. [Google Scholar]
  24. Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 834–848. [Google Scholar] [CrossRef]
  25. Segal-Rozenhaimer, M.; Li, A.; Das, K.; Chirayath, V. Cloud detection algorithm for multi-modal satellite imagery using convolutional neural-networks (CNN). Remote Sens. Environ. 2020, 237, 111446. [Google Scholar] [CrossRef]
  26. Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890. [Google Scholar]
  27. Yan, Z.; Yan, M.; Sun, H.; Fu, K.; Hong, J.; Sun, J.; Zhang, Y.; Sun, X. Cloud and cloud shadow detection using multilevel feature fused segmentation network. IEEE Geosci. Remote Sens. Lett. 2018, 15, 1600–1604. [Google Scholar] [CrossRef]
  28. Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
  29. Gonzales, C.; Sakla, W. Semantic Segmentation of Clouds in Satellite Imagery Using Deep Pre-trained U-Nets; Technical Report; Lawrence Livermore National Lab. (LLNL): Livermore, CA, USA, 2019.
  30. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar]
  31. Guo, Y.; Cao, X.; Liu, B.; Gao, M. Cloud Detection for Satellite Imagery Using Attention-Based U-Net Convolutional Neural Network. Symmetry 2020, 12, 1056. [Google Scholar] [CrossRef]
  32. Duta, I.C.; Liu, L.; Zhu, F.; Shao, L. Pyramidal Convolution: Rethinking Convolutional Neural Networks for Visual Recognition. arXiv 2020, arXiv:2006.11538. [Google Scholar]
  33. Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11534–11542. [Google Scholar]
  34. Wang, J.; Chen, K.; Xu, R.; Liu, Z.; Loy, C.C.; Lin, D. Carafe: Content-aware reassembly of features. In Proceedings of the IEEE International Conference on Computer Vision, Long Beach, CA, USA, 16–20 June 2019; pp. 3007–3016. [Google Scholar]
  35. Roy, D.P.; Wulder, M.A.; Loveland, T.R.; Woodcock, C.; Allen, R.G.; Anderson, M.C.; Helder, D.; Irons, J.R.; Johnson, D.M.; Kennedy, R.; et al. Landsat-8: Science and product vision for terrestrial global change research. Remote Sens. Environ. 2014, 145, 154–172. [Google Scholar] [CrossRef] [Green Version]
Figure 1. The MCNet conceptual structure.
Figure 1. The MCNet conceptual structure.
Symmetry 13 00028 g001
Figure 2. Pyramidal convolution model structure.
Figure 2. Pyramidal convolution model structure.
Symmetry 13 00028 g002
Figure 3. ECA model structure.
Figure 3. ECA model structure.
Symmetry 13 00028 g003
Figure 4. The Stage1 block of EPResNet-50.
Figure 4. The Stage1 block of EPResNet-50.
Symmetry 13 00028 g004
Figure 5. The CARAFE module structure.
Figure 5. The CARAFE module structure.
Symmetry 13 00028 g005
Figure 6. Visual results of two overall scene image obtained by different models on the 38-cloud dataset.
Figure 6. Visual results of two overall scene image obtained by different models on the 38-cloud dataset.
Symmetry 13 00028 g006
Figure 7. Some patch image visual examples obtained by different models on the 38-cloud dataset.
Figure 7. Some patch image visual examples obtained by different models on the 38-cloud dataset.
Symmetry 13 00028 g007
Figure 8. Details of blue box, yellow box, and red box in Figure 6.
Figure 8. Details of blue box, yellow box, and red box in Figure 6.
Symmetry 13 00028 g008
Table 1. ResNet-50, EPResNet-50 structure comparison.
Table 1. ResNet-50, EPResNet-50 structure comparison.
StageOutputResNet-50EPResNet-50
0 192 × 192 7 × 7 , 64 , s = 2
3 × 3 m a x p o o l , s = 2
7 × 7 , 64 , s = 2
1 96 × 96 1 × 1 , 64 3 × 3 , 64 1 × 1 , 256 × 3 1 × 1 , 64 P y c o n v 4 , 64 : ̲ 9 × 9 , 16 , G = 16 7 × 7 , 16 , G = 8 5 × 5 , 16 , G = 4 3 × 3 , 16 , G = 1 1 × 1 , 256 E C A _ b l o c k , k = 3 × 3
2 48 × 48 1 × 1 , 128 3 × 3 , 128 1 × 1 , 512 × 4 1 × 1 , 128 P y c o n v 3 , 128 : ̲ 7 × 7 , 64 , G = 8 5 × 5 , 32 , G = 4 3 × 3 , 32 , G = 1 1 × 1 , 512 E C A _ b l o c k , k = 3 × 4
3 24 × 24 1 × 1 , 256 3 × 3 , 256 1 × 1 , 1024 × 6 1 × 1 , 256 P y c o n v 2 , 256 : ̲ 5 × 5 , 128 , G = 4 3 × 3 , 128 , G = 1 1 × 1 , 1024 E C A _ b l o c k , k = 3 × 6
4 12 × 12 1 × 1 , 512 3 × 3 , 512 1 × 1 , 2048 × 3 1 × 1 , 512 P y c o n v 1 , 512 : ̲ 3 × 3 , 512 , G = 1 1 × 1 , 2048 E C A _ b l o c k , k = 3 × 3
Table 2. Landsat 8 Spectral Bands.
Table 2. Landsat 8 Spectral Bands.
Spectral BandsWavelength ( μ m)Resolution (m)
Band 1—Ultra Blue0.435–0.45130
Band 2—Blue0.452–0.51230
Band 3—Green0.533–0.59030
Band 4—Red0.636–0.67330
Band 5—Near Infrared0.851–0.87930
Band 6—Shortwave Infrared 11.566–1.65130
Band 7 -Shortwave Infrared 22.107–2.29430
Band 8—Panchromatic0.503–0.67615
Band 9—Cirrus1.363–1.38430
Band 10—Thermal Infrared 110.60–11.1930
Band 11—Thermal Infrared 211.50–12.5130
Table 3. Ablation test of ECA, PyConv, and CARAFE on the 38-cloud dataset.
Table 3. Ablation test of ECA, PyConv, and CARAFE on the 38-cloud dataset.
ECAPyConvCARAFEJaccard Index
[%]
Precision
[%]
Recall
[%]
Specificity
[%]
Overall Accuracy
[%]
F1-Score
[%]
78.0086.6187.9298.6596.3087.26
78.5890.8984.4098.5996.3787.52
80.6294.4384.5098.8796.4789.19
79.4690.7185.9598.6796.1588.27
80.9492.7686.0198.4896.4589.26
80.6392.7486.0698.4396.5089.28
81.2092.6986.1198.5996.5389.28
83.0594.8386.6998.6796.4990.58
Table 4. Comparative test of different methods on the 38-cloud dataset.
Table 4. Comparative test of different methods on the 38-cloud dataset.
MethodJaccard Index
[%]
Precision
[%]
Recall
[%]
Specificity
[%]
Overall Accuracy
[%]
F1-Score
[%]
GBC51.7765.3466.7888.7483.4966.05
RF56.5271.6568.1291.7987.1169.84
FCN72.1784.5981.3798.4595.2382.95
Fmask75.1677.7197.2293.9694.8986.38
Cloud-net78.5091.2384.8598.6796.4887.92
Att-unet74.5487.2784.5996.7494.6385.91
PSPNet79.3687.1286.0996.9796.0886.60
Deeplabv374.2092.3378.3098.5595.4584.74
Unet76.2084.4787.9197.1395.7286.16
MCNet83.0594.8386.6998.6796.4990.58
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Yao, Z.; Jia, J.; Qian, Y. MCNet: Multi-Scale Feature Extraction and Content-Aware Reassembly Cloud Detection Model for Remote Sensing Images. Symmetry 2021, 13, 28. https://doi.org/10.3390/sym13010028

AMA Style

Yao Z, Jia J, Qian Y. MCNet: Multi-Scale Feature Extraction and Content-Aware Reassembly Cloud Detection Model for Remote Sensing Images. Symmetry. 2021; 13(1):28. https://doi.org/10.3390/sym13010028

Chicago/Turabian Style

Yao, Ziqiang, Jinlu Jia, and Yurong Qian. 2021. "MCNet: Multi-Scale Feature Extraction and Content-Aware Reassembly Cloud Detection Model for Remote Sensing Images" Symmetry 13, no. 1: 28. https://doi.org/10.3390/sym13010028

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop