Next Article in Journal
Extraction of Spatiotemporal Information of Rainfall-Induced Landslides from Remote Sensing
Previous Article in Journal
Assessment of Hygroscopic Behavior of Arctic Aerosol by Contemporary Lidar and Radiosonde Observations
Previous Article in Special Issue
Single-Temporal Sentinel-2 for Analyzing Burned Area Detection Methods: A Study of 14 Cases in Republic of Korea Considering Land Cover
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

HA-Net for Bare Soil Extraction Using Optical Remote Sensing Images

1
School of Electrical & Information Engineering, Changsha University of Science & Technology, Changsha 410114, China
2
Hunan Key Laboratory of Meteorological Disaster Prevention and Reduction, Hunan Research Institute of Meteorological Sciences, Changsha 410118, China
*
Author to whom correspondence should be addressed.
Remote Sens. 2024, 16(16), 3088; https://doi.org/10.3390/rs16163088
Submission received: 17 July 2024 / Revised: 15 August 2024 / Accepted: 18 August 2024 / Published: 21 August 2024
(This article belongs to the Special Issue AI-Driven Satellite Data for Global Environment Monitoring)

Abstract

:
Bare soil will cause soil erosion and contribute to air pollution through the generation of dust, making the timely and effective monitoring of bare soil an urgent requirement for environmental management. Although there have been some researches on bare soil extraction using high-resolution remote sensing images, great challenges still need to be solved, such as complex background interference and small-scale problems. In this regard, the Hybrid Attention Network (HA-Net) is proposed for automatic extraction of bare soil from high-resolution remote sensing images, which includes the encoder and the decoder. In the encoder, HA-Net initially utilizes BoTNet for primary feature extraction, producing four-level features. The extracted highest-level features are then input into the constructed Spatial Information Perception Module (SIPM) and the Channel Information Enhancement Module (CIEM) to emphasize the spatial and channel dimensions of bare soil information adequately. To improve the detection rate of small-scale bare soil areas, during the decoding stage, the Semantic Restructuring-based Upsampling Module (SRUM) is proposed, which utilizes the semantic information from input features and compensate for the loss of detailed information during downsampling in the encoder. An experiment is performed based on high-resolution remote sensing images from the China–Brazil Resources Satellite 04A. The results show that HA-Net obviously outperforms several excellent semantic segmentation networks in bare soil extraction. The average precision and IoU of HA-Net in two scenes can reach 90.9% and 80.9%, respectively, which demonstrates the excellent performance of HA-Net. It embodies the powerful ability of HA-Net for suppressing the interference from complex backgrounds and solving multiscale issues. Furthermore, it may also be used to perform excellent segmentation tasks for other targets from remote sensing images.

1. Introduction

In the last few decades, with the rapid development of the Chinese economy, the speed, scale, and spatial extent of land cover have undergone unprecedented changes [1,2]. This transformation not only threatens the health of ecosystems, but also poses severe challenges to human survival and development [3,4]. Bare soil, referring to the exposed soil surface without vegetation or built-up structures [5], is one of the fundamental biophysical components of land cover and a key factor contributing to air pollution and soil erosion [6,7]. Large areas of bare soil, lacking the stabilizing effect of vegetation roots, become fragmented and loose in structure, leading to loss through water runoff and serious ecological problems such as dust pollution [8]. Therefore, a precise assessment of bare soil is crucial for preventing soil erosion, resource management, and environmental protection.
Currently, investigations of bare soil mainly rely on manual field surveys and hierarchical reporting, which are constrained by factors such as transportation conditions and high labor costs. As a result, survey results often suffer from subjectivity, lack comprehensiveness, and have poor timeliness. Satellite remote sensing, as a macroscopic, rapid, and effective monitoring method, offers advantages such as wide coverage, high accuracy, strong real-time capability, and immunity to human interference [9,10]. It is widely used to monitor land and environmental changes, including urban expansion [11,12,13], deforestation [14,15,16], and the impacts of climate change [17,18]. In recent years, the rapid development of high-resolution satellite systems has provided powerful tools for investigating bare soil on Earth’s surface.
The bare soil index (BSI) is the most commonly used method for extracting bare soil from remote sensing imagery. Its basic principle is to use spectral information for the identification and extraction of bare soil areas. Early studies mainly relied on the BSI for large-scale bare soil extraction. For example, in 2002, Rikimaru et al. [19] proposed a BSI that calculated the difference between the near-infrared and short-wave infrared bands for the identification of bare soil areas. Then, scholars began to explore more refined bare soil indices to facilitate the extraction of different types of bare soil in different regions. For example, in 2014, Li et al. [20] developed a new bare soil index and applied it to extract bare soil in the development areas of the Pearl River Delta region. In 2021, Nguyen et al. [21] introduced an improved bare soil index (modified bare soil index (MBI)) using Landsat 8 wavelengths to improve the separation of bare soil, and demonstrated higher accuracy in bare soil detection. Currently, researchers generally use shortwave infrared (SWIR) and near-infrared (NIR) bands to create the band sum index (BSI) [22]. In recent years, an increasing number of sensors only retain the red, green, blue, near-infrared bands, and a panchromatic band, which has limited the application of the BSI. In addition, the BSI typically relies on spectral information, making it difficult to distinguish between land cover types with similar spectral characteristics. Therefore, it is now difficult to establish an effective BSI.
In recent years, deep learning has made remarkable progress. Compared to traditional machine learning methods, deep learning techniques exhibit characteristics such as strong learning ability, minimal manual intervention, and good adaptability, making them widely applied in the radar images and remote sensing images for the extraction of surface objects. For instance, deep learning networks were constructed to extract water bodies and bridges from SAR images, achieving satisfactory results [23,24,25]. Due to the complex information contained in remote sensing images, many current methods consider combining a Transformer [26] with CNNs to achieve promising results. For example, Chen et al., proposed a hybrid architecture called SRCBTFusion-Net, which integrates a Transformer and a CNN to enhance remote sensing image segmentation performance [27].
Currently, preliminary achievements have been made using deep learning for bare soil detection. However, the complex interference in urban/agricultural areas [28] and the multiscale problem (especially the extraction of small-scale bare soil areas) [29] are important challenges in bare soil extraction. Bare soil is different from other surface objects, such as water, roads, and buildings, which have a uniform overall structure. Bare soil is often covered by a large amount of other ground objects, such as roads, rocks, and buildings, which are typically unevenly distributed. This results in the mixing of bare soil with the background, causing blurred and irregular edges, which presents challenges for bare soil extraction; moreover, the texture features of complex backgrounds (e.g., agricultural areas and impermeable surfaces) are similar to those of bare soil [30], which also makes it difficult to identify bare soil. In 2023, He et al. [31] introduced an attention mechanism using deep learning models to distinguish between bare soil and the background. However, this CNN-based approach restricts the model’s ability to recognize long-range relationships and encode global contextual information, thus limiting its capability to extract bare soil from more complex backgrounds. Additionally, the emergence of small-scale bare soil areas due to natural and anthropogenic factors presents a challenge due to their widespread distribution and the difficulty of field investigations. For the detection of small targets, this is more difficult due to limited semantic information and less pixels, which result in relatively worse detection performance [26]. Therefore, extracting small-scale bare soil is currently an issue that needs to be addressed.
To address these challenges, the high-resolution remote sensing images from the China–Brazil Resources Satellite with a resolution of 2 m is utilized, and a deep learning network is constructed for bare soil extraction. The main contributions are as follows:
  • The Hybrid Attention Network (HA-Net) is proposed in this paper, which possesses the excellent feature learning ability for bare soils, and effectively suppress the interference of the background, so excellent extraction performance for bare soil from remote sensing can be achieved.
  • By introducing the attention mechanism, the Spatial Information Perception Module (SIPM) and Channel Information Enhancement Module (CIEM) are proposed, which can effectively learn multiscale information of the bare soil and better suppress the complex background noise.
  • In the decoder, the Semantic Reconstructing-based Upsampling Module (SRUM) is constructed, which improves the ability of capturing the detailed information for bare soils during the downsampling process in the encoder.
  • HA-Net not only enables high-precision automatic extraction of bare soil targets, but also has significant application values in the semantic segmentation for other typical targets in remote sensing images.

2. Study Regions and Data

2.1. Study Regions

The research area of this paper covers the cities of Changsha, Zhuzhou, and Xiangtan in Hunan Province, China, as shown in Figure 1. The longitude range is from 112°31′30″E to 113°13′34″E, and the latitude range is from 27.557861°N to 28.554682°N. These three cities are collectively referred to as Chang-Zhu-Tan. Located in the central and eastern part of Hunan Province, the Chang-Zhu-Tan urban agglomeration is an important component of the urban agglomeration in the middle reaches of the Yangtze River. With a total area of 28,000 km2 (with the urban area covering 18,900 km2), it serves as the core area for the development of Hunan Province [32]. With continuous regional development and the unreasonable utilization of land resources, a large amount of bare soil has been generated, leading to air pollution and soil erosion. Therefore, conducting high-precision regular monitoring of complete bare soil in the research area is of great practical significance, which is much useful for land resource planning and environmental protection.

2.2. CBERS-04A Data

The China–Brazil Earth Resources Satellite (CBERS) is a cooperative program between the China Academy of Space Technology (CAST) and the National Institute for Space Research (INPE) of Brazil. The program was signed in July 1988 with the aim of establishing a comprehensive remote sensing system (both space and ground segments), to provide multispectral remote sensing images for both countries. The CBERS-04A satellite, also known as the High-Resolution Earth Observation Satellite 04A, was successfully launched by the China National Space Administration on 29 December 2020. It is part of the CBERS series of high-resolution Earth observation satellites, designed to provide remote sensing data with high-quality and high-precision, for land use, resource survey, environmental protection, and other fields. The CBERS-04A satellite is equipped with three optical payloads as follows: the Wide Field Panchromatic and Multi-spectral Camera (WPM), the Multi-spectral Camera (MUX), and the Wide Field Imager (WFI). It also includes space environment monitoring payloads (SEM) and data collection payloads (DCS). Table 1 shows the detailed parameters of the optical payloads of this satellite.

2.3. Data Set

The remote sensing data used in this study were provided by the Hunan Provincial Meteorological Bureau. The images were acquired by the CBERS-04A satellite between March and May 2023, with a spatial resolution of 2 m [33]. To enhance training efficiency, we extracted 40 images of size 5000 × 5000 pixels from the Chang-Zhu-Tan region, each exhibiting different backgrounds and complexities. These images serve as representative examples of actual ground conditions regarding bare soil. During manual labeling, each bare soil data sample was carefully examined and classified based on visible surface features. For example, bare soil surfaces with shrubs and grass were assessed for vegetation density to determine if they should be classified as bare soil. Sparse shrubs and grass do not effectively stabilize the soil, making it prone to dust emissions and soil erosion; therefore, such areas should be classified as bare soil and monitored accordingly.
Subsequently, the ground truth data and corresponding images were cropped into 512 × 512 pixels. Samples were generated using a sliding window approach, and a portion underwent data augmentation, including adjustments to brightness, mirroring, flipping, adding noise, etc. The specific effects are illustrated in Figure 2 (the corresponding labels for the augmented images are not shown here). In total, 6382 samples were obtained. The training and validation sets were split in an 8:2 ratio, with additional independent images reserved for the test set. In the ground truth examples depicted in Figure 2, the red color represents areas of bare soil, while the black color denotes background areas. From the examples of bare soil data samples, it is evident that due to natural and anthropogenic factors, bare soil exhibits diverse morphologies, varying in scale, and its textural features often resemble the background, making extraction challenging.

3. Methods

3.1. Overall Framework of HA-Net

Figure 3 illustrates the overall structure of the network. The training data are initially fed into the backbone network for feature extraction, which generates four feature maps at varying levels of semantic detail. The resolutions of these features are indicated in the figure. Among the outputs of the backbone network, the feature map with the lowest resolution contains the richest semantic information, while the feature map with the highest resolution retains the least semantic detail.
Following the preliminary feature extraction, the highest-level feature map is processed by the following two modules: the Spatial Information Perception Module (SIPM) and the Channel Information Enhancement Module (CIEM). The SIPM employs dilated convolutions [34] to expand the receptive field, thereby improving the representation of spatial information across the feature maps. The CIEM focuses on enhancing the information within each channel of the feature maps by decomposing and fusing channel information. The integration of these two modules results in a comprehensive extraction of the feature maps.
In the decoder stage, the Semantic Restructuring-based Upsampling Module (SRUM) performs pixel recombination and transpose convolution to achieve upsampling. This process integrates high-level and low-level features, progressively restoring the original image size and enabling the automatic extraction of bare soil areas.

3.2. The Encoder

3.2.1. Backbone

In deep learning, the backbone network is responsible for the initial extraction and encoding of critical features from input data, which is essential for processing complex visual information. Consequently, the design and selection of an appropriate backbone are fundamental to developing effective deep learning models.
In this paper, BoTNet [35] is selected as the backbone network for the framework (the structure is shown in Figure 4). BoTNet is a new model formed by incorporating a Transformer [26] into ResNet [36]. The principle involves replacing the 3 × 3 convolution in the last bottleneck block of ResNet with Multi-Head Self-Attention (MHSA). Understanding the contextual information of the entire bare soil area is crucial for accurately segmenting bare soil regions. MHSA allows the model to consider dependencies between pixels across the global scope during feature extraction, rather than being limited to local pixel windows. This facilitates a better understanding of the spatial distribution patterns of bare soil areas. Through this design, BoTNet effectively combines the strengths of ResNet and Transformer models. It not only inherits ResNet’s excellent characteristics, such as effectively preserving original features, but also enhances the model’s ability to capture global information through self-attention mechanisms. This enables BoTNet to exhibit excellent performance in various computer vision tasks, making it particularly suitable for constructing semantic segmentation networks. The input features first pass through a convolutional layer and a max-pooling layer to reduce image resolution, followed by preliminary semantic feature extraction through three ResNet Bottlenecks and one Bottleneck Transformer. In this paper, the output features of BoT-4 are used as inputs for the SIPM and the CIEM.

3.2.2. SIPM

In semantic segmentation, attention mechanisms can help the model focus on object boundaries and key regions, thereby accurately segmenting target objects and reducing the influence of background [37]. Therefore, to enhance the network’s ability to extract multiscale bare soil areas while better learning the essential features of bare soil areas in spatial positions, the SIPM is constructed.
Generally, to better learn spatial information, fully utilizing contextual information is crucial for a network, and having a large receptive field can effectively utilize contextual information. Additionally, different receptive fields can sense feature information at different scales. Hence, this module introduces multiple branches of dilated convolutions to efficiently expand the receptive field, thereby generating spatial attention, enhancing the multiscale extraction capability of bare soil areas, and highlighting bare soil area information while suppressing noise information at different locations.
The structure of the SIPM is illustrated in Figure 5. Firstly, for the input   F F R C × H × W , three branches of dilated convolutions with 3 × 3 kernels are utilized to extract multiscale and contextual information. To obtain multiscale information while saving computational costs, feature compression is performed twice in the channel dimension in the second and third branches, and then multiscale features are integrated through channel stacking. Subsequently, a 1 × 1 convolution is used to compress the feature size to 1 × H × W , and a Sigmoid function [38] is applied to obtain a spatial attention map ranging from 0 to 1. This highlights multiscale bare soil area information on the attention map, while also suppressing noise information. Finally, the attention map X is elementwise multiplied with the input features F to enhance the representation of bare soil area information at different locations. Then, it is fused with the original input feature map to improve the representation of feature information, resulting in the output of this module.
The expressions are illustrated in Equations (1) and (2):
SA_out = σ Conv D_Conv F ;   D_Conv D_Conv F ;   D_Conv D_Conv F
Out = F     ( SA_out + 1 )  
where F , SA_out , and Out represent the input, the output of attention, and the final output of this module, respectively; Conv and D_Conv denote regular convolution and dilated convolution, respectively; σ represents the Sigmoid function; and ⊗ is the matrix dot multiplication operation. The parameters of this module are shown in Table 2.

3.2.3. CIEM

During the initial feature extraction process in the backbone network, the resolution of the feature map decreases synchronously with downsampling operations, resulting in the simultaneous loss of detailed feature information. Therefore, increasing the number of feature channels is necessary to ensure that information loss is not excessive. BoTNet, after being processed by Transformer layers, outputs high-level features with abundant channel quantities and global information. Each channel encompasses a specific channel response. Consequently, the CIEM is proposed in this paper, which employs multiscale convolutional kernels and pooling channel attention to enhance the multiscale feature learning capability in the channel dimension, and to suppress complex background information.
The structure of the CIEM is illustrated in Figure 6. For the input feature   F F R C × H × W , it is initially split along the channel dimension into two features with a size of C / 2 × H × W . Then, two different sizes of depthwise separable convolutions are applied to extract multiscale features. Subsequently, on one hand, the two features are fused, and on the other hand, average pooling and max pooling operations are applied to compress the spatial dimensions of the two features, followed by applying the Sigmoid function to obtain two channel attentions, Y1 and Y2, each of them with size of C / 2 × 1 × 1 . Then, Y1 and Y2 are elementwise multiplied with the fused feature sequentially, and the ReLU function [39] is used to improve the nonlinear expression of this module. Finally, a 1 × 1 convolution is applied to restore the original channel size, yielding the final output of this module. In this module, average pooling and max pooling can be utilized to aggregate spatial information of feature maps. Therefore, each channel of attention contains a specific channel response, allowing each channel to focus on the desired information by weighting with the initial features. The principle of this module is shown in Equations (3)–(5):
DS_out i = DS_conv i Split F
CA_out i = σ PooL i DS_out i
Out = CA_out 1     Max 0 ,   CA_out 0     i = 0 1 DS_out i
where F and Out denote the input and final output of this module, respectively; Split represents the operation of splitting along the channel dimension; DS_conv i , DS_out i , Poo L i , and SA_out i (i ∈ {0, 1}) represent the operations of dual-branch depthwise separable convolution, the output of dual-branch depthwise separable convolution, dual-branch attention pooling, and the output of dual-branch attention, respectively; σ denotes the Sigmoid function; and ⊗ is the matrix dot multiplication operation. The parameters of this module are shown in Table 3.

3.3. The Decoder

Currently, in deep learning, commonly used methods for upsampling include nearest neighbor or bilinear interpolation. These methods solely rely on the spatial positions of pixels to determine the upsampling kernel, without leveraging the semantic information of feature maps. They can be regarded as a “uniform” upsampling approach, and typically have small perceptual fields (nearest neighbor upsampling: “1 × 1”, bilinear interpolation upsampling: “2 × 2”). An excellent upsampling module should be related to the semantic information of the feature map and should perform upsampling operations based on the input content. Therefore, the SRUM utilizes pixel rearrangement and transposed convolution to achieve feature upsampling, which can effectively leverage the semantic information of features.
For the high-level features input to this module, the SRUM employs two different branches for upsampling operations. The first branch involves rearranging the input features using a pixel rearrangement method depicted in Figure 7. This method fully utilizes feature channel information by dividing the pixels at each position of the feature map into four parts along the channel dimension and then reordering them. This process results in a 2× upsampling of the features, while reducing the number of channels to one-fourth of the original number. The second branch utilizes a 3 × 3 transposed convolution (Tconv) for 2× upsampling, simultaneously reducing the number of channels by fourfold. Subsequently, these two upsampled feature channels are stacked and blended using a 1 × 1 convolution, followed by fusion with low-level features processed through 3 × 3 convolution, forming the output of this module. In this upsampling process of the module, traditional interpolation methods are abandoned in favor of feature-based semantic information for more effectively restoring the resolution details of the features. The parameters of the SRUM are shown in Table 4.
According to Figure 3, multiple SRUMs are continuously used to combine high-level features with low-level features. Then, further refinement of the features is performed through a 3 × 3 convolution, followed by a 4× upsampling, resulting in the final extraction of bare soil.

4. Experiments and Results

4.1. The Stitching Strategy of the Sliding Windows

During experiments, it is necessary to segment the large-scale remote sensing images into smaller sample images for testing. However, the image segmentation often disrupts the integrity of the targets. Additionally, when stitching images together, the classification results at the junction of two adjacent windows may be discontinuous, potentially leading to detection errors in edge regions. Therefore, to obtain better results, this paper employs a sliding window stitching detection strategy to test large-scale images by the trained model. The specific procedure is as follows. For the image to be tested, it is first segmented using a sliding window of the same size as the training set samples, typically 512 × 512 pixels, with a step size of 412 pixels and a 100 pixels overlap between the adjacent windows, as shown in Figure 8a. After testing the two windows, the classification results in the overlapping region will be averaged by the adjacent windows. During the horizontal/vertical sliding window process, when the remaining area on the right/bottom side is less than the sliding window step size, a sliding window size will be taken from the right/bottom side as the starting point, moving left/upward, as illustrated in Figure 8b. By employing this repeated sliding window method, ensuring each area undergoes thorough detection, the resulting image with minimal classification boundary errors is obtained.

4.2. Experimental Environment and Training Parameter Settings

The experimental software environment for this study consists of PyTorch 1.20 and Python 3.7. The hardware environment includes an Intel Xeon Silver 4210 CPU and a single NVIDIA RTX 3090 GPU (as shown in Table 5). During the training process of the research model, the batch size for input images is set to 8, with the network trained for 150 iterations. The training process saves the results of the best epoch.

4.3. Experimental Results and the Analysis

In this paper, we adopt class pixel accuracy (CPA), intersection over union (IoU), recall, and the F1 score (F1) as the evaluation metrics. The specific mathematical expressions are as follows:
CPA = TP TP + FP
IoU = TP TP + FP + FN
Recall = TP TP + FN
F 1 = 2 ( CPA   ×   Recall ) CPA   +   Recall
where TP, TN, FP, and FN represent true positive, true negative, false positive, and false negative, respectively. To better illustrate the relationships between TP, FP, FN, and TN, we present the confusion matrix in Table 6.

4.3.1. Bare Soil Extraction by Different Networks

To validate the effectiveness of the proposed network, comparative experiments are conducted with DeepLabV3+ [40], DA-Net [41], BuildFormer [42], and YOSO [43]. Independent testing is performed on two remote sensing images of the Chang-Zhu-Tan region, whose sizes are 5000 × 4000 pixels and 5000 × 5000 pixels, respectively. These two scenes contain rich bare soil information and different complex backgrounds, serving as representatives. To evaluate the experimental results clearly, several typical regions are selected for analysis from each of the two scenes (as shown in Figure 9). Furthermore, to demonstrate the effectiveness of the proposed network, BoTNet50 and BoTNet101 (Hereinafter referred to as B50 and B101) are chosen as the backbone networks of HA-Net. Typical scene experimental results are presented in Figure 10 and Figure 11. In these figures, bare soil regions are indicated in an orange-red color, while the yellow and light blue boxes mainly demonstrate missed detections and false alarms.
Figure 10 depicts the extraction results of different networks for three representative regions in Scene I. From Figure 10a, it can be observed that all networks have fewer false alarms. However, in the complex region within the red box in (a), DA-Net misidentifies grassland as bare soil, while BuildFormer incompletely extracts bare soil from this area. DeepLabV3+, HA-Net and YOSO can effectively distinguish bare soil from grassland. In the areas surrounding buildings, DeepLabV3+, DA-Net, BuildFormer, and YOSO fail to effectively extract bare soil, resulting in significant area omissions, which does not meet the requirements for continuous monitoring and management of bare soil. In contrast, HA-Net effectively extracts small-scale bare soil near buildings.
In the green boxes of Figure 10b,c, within the clusters of buildings, it can be observed that due to the elongated shape of roads and some buildings and their similar texture features to bare soil, all networks misclassify buildings and roads to varying degrees as bare soil. HA-Net has a relatively lower false alarm rate, with only a few roads misidentified as bare soil.
For the areas of the red boxes in Figure 10b, due to the complex mixture of bare soil and background, the contrastive networks fail to effectively identify bare soil within the complex region. In the red boxes in Figure 10c, DeepLabV3+ and DA-Net incorrectly identify all areas as bare soil in complex backgrounds, while BuildFormer and YOSO can accurately extract bare soil. The HA-Net-B50 and HA-Net-B101 models can accurately extract bare soil from the red boxes in (b) and (c) despite the complex environment.
Figure 11 shows the results of different networks for extracting bare soil for the two display areas in scene II. For the green-boxed area of the ground truth in Figure 11a, we can observe that DeepLabV3+, DA-Net, YOSO, and BuildFormer identify the background (impervious surface) as bare soil to varying degrees, which indicates that the four networks do not distinguish bare soil from the background, which is not conducive to the continuous monitoring of the bare soil area; whereas, HA-Net can identify bare soil information efficiently due to the MHSA, SIPM, and CIEM.
From Figure 11b, we can observe that both two HA-Net models can effectively identify the elongated bare soil areas on either side of the road. Maybe the inclusion of the SRUM in HA-Net enhances its ability to detect some scattered small bare soil areas as well. In contrast, the remaining other four networks largely fail to detect these bare soil areas along the road and the small scattered patches. Additionally, BuildFormer also misses many bare soil areas, which can be attributed to its lack of local receptive fields.
In order to better evaluate the performance of the different networks in extracting the bare soil from the remote sensing images, the paper compares the four networks in terms of the extraction accuracy metrics to compare the five networks, and Table 7 gives the extraction accuracy of different networks in two experiments.
Based on Table 7, it can be observed that HA-Net outperforms the other four networks in terms of higher CPA, IoU, Recall, and F1 across the two scenes on average, with improvements of 0.763%, 4.756%, 4.077%, and 2.986% over the second-best network, respectively. However, in Scene I, HA-Net demonstrates a lower CPA compared to BuildFormer. According to the metric computation formulas, this is attributed to the lower false detection rate of BuildFormer during the bare soil extraction process when tested independently. In addition, due to the significantly higher false alarms rate of BuildFormer, it also leads to a lower Recall, indicating it has missed many bare soil areas. YOSO achieves the second-best average IoU, Recall, and F1, although lower than HA-Net, it still demonstrates good bare soil extraction capability. On average, DeepLabV3+, and DA-Net perform worse in terms of these indexes compared to HA-Net, with their lower CPA and Recall indicating that they are less accurate in extracting bare soil areas. HA-Net achieves the highest F1 score in Table 7, indicating the superior quality of the HA-Net model. Regarding the proposed network, HA-Net-B101 generally exhibits slightly higher metrics than HA-Net-B50, but BoTNet101, with its additional convolutional layers compared to BoTNet50, inevitably incurs more computational pressure.

4.3.2. Ablation Study

An ablation study is a systematic method that involves gradually removing different modules from a model to observe changes in its performance. This experimental design allows for a better understanding of the working principles of the model and identifies which parts play a critical role in improving its performance. Therefore, in this study, we will also conduct an ablation study to validate the effectiveness of the proposed model and assess the importance of its various components.
From Table 8, we can observe that without any modules (SIPM, CIEM, and SRUM) added, the baseline performance of BoTNet is relatively weak, with IoU slightly higher than DeepLabV3+, DA-Net, and BuildFormer. After adding the SIPM, the performance of HA-Net improves significantly. Although adding the CIEM results in less improvement compared to the SIPM, it still provides noticeable gains. When the SRUM is added alone, the performance improvement is relatively small, with HA-Net’s metrics lower than when the SIPM and the CIEM are added separately. This is because the SRUM focuses more on enhancing the network’s detail information for detecting small target areas. When both the SIPM and the CIEM are added simultaneously, the performance improvement is the most significant, approaching the complete performance of HA-Net. This indicates that the SIPM and the CIEM improve spatial and channel information, respectively, collectively enhancing the effectiveness of bare soil extraction.

4.3.3. Generalization Ability Experiments

To further validate the performance of the proposed network, we conducted qualitative and quantitative experiments using optical remote sensing data with a resolution of 1.2 m from Google Earth. The longitude range is 113.27178°E to 113.33771°E, and the latitude range is 28.19092°N to 28.23761°N. By which, five networks, including HA-Net, are evaluated and the results are analyzed carefully.
Figure 12 indicates the results of bare soil extraction for typical regions in Scene III by different networks. According to the results, it is observed that DeepLabV3+ has missed some bare soil areas along the sides of roads (the yellow boxes in Figure 12c). Additionally, both DeepLabV3+ and DA-Net (Figure 12d) have incorrectly identified certain buildings as bare soil (blue boxes). BuildFormer (Figure 12e) demonstrates relatively poor generalization on this data set, with several missed obvious detection areas (the yellow boxes). YOSO (Figure 12f) performs well overall but also has missed some bare soil regions, primarily along roadsides and around buildings (the yellow boxes). HA-Net-B50 (Figure 12g) exhibits a few instances of missed detections, while HA-Net-B101 shows a notable false positive. Overall, HA-Net has demonstrated excellent generalization capability in this experiment. It is also noteworthy that all networks, including HA-Net, misclassified roads as bare soil to varying degrees.
Table 9 indicates that our method achieves the best performance, with the IoU exceeding YOSO by 4.545%, the F1 score surpassing YOSO by 2.68%, CPA reaching an optimal 92.851%, and Recall attaining the highest value of 93.391%. Among the HA-Net models, HA-Net-B50 has a slightly higher CPA compared to HA-Net-B101, but the other three metrics are lower than HA-Net-B101. This table further confirms that the proposed method demonstrates strong robustness, effectively extracting bare soil across different sensors and resolution scenarios.
In summary, HA-Net can achieve better extraction performance. This allows us to accurately extract real bare soil from remote sensing images, reducing the costs associated with manual field surveys and improving the efficiency of bare soil monitoring. This contributes to the rational planning and management of bare soil.

5. Discussion

Traditional semantic segmentation tasks, such as classifying objects like buildings and roads, generally involve regular polygon shapes and consistent structures, making classification relatively straightforward. However, identifying bare soil is more challenging due to its complex shapes, varying sizes, and often serving as the background for other surface objects. These objects, such as rocks, buildings, and roads, are typically distributed disorderly, potentially disrupting the integrity of bare soil, thus posing significant challenges to the accuracy of bare soil identification tasks.
In this study, we considered how to accurately identify bare soil information amidst complex backgrounds, while simultaneously extracting multiscale, especially small-scale bare soil areas. We proposed a bare soil identification framework to extract bare soil areas from CBERS-04A imagery. In the encoder, we introduced the Spatial Information Perception Module (SIPM) and the Channel Information Enhancement Module (CIEM) to identify multiscale bare soil information and suppress noise information. In the decoder, we proposed the Semantic Restructuring-based Upsampling Module (SRUM) to integrate high-level semantic information and low-level detail information, thereby enhancing the extraction of small-scale bare soil. Compared to CNN and ViT models, our model combines the advantages of a CNN and a Transformer, retaining the efficient feature extraction capability of a CNN for local features and the global information capturing ability of a Transformer.
The CNN models used for comparison in this research, namely DeepLabv3+, DA-Net, and YOSO, rely solely on convolutional layers for extracting image features and do not incorporate Transformer layers. These models are limited by the small receptive fields of convolutional layers, which cannot accurately reflect the spatial continuity of bare soil in complex backgrounds. BuildFormer, as a purely ViT-type model, can achieve global receptive field coverage through MHSA when constructing long-range modeling, but it lacks the inductive bias capability of CNNs. Therefore, its generalization ability is relatively weak, resulting in poorer performance metrics. In HA-Net, Transformer layers can extract global information, which is then enhanced by the SIPM and the CIEM to strengthen the extraction of local bare soil information. Consequently, in bare soil recognition tasks, combining CNNs with Transformers within an appropriate range will be beneficial for bare soil detection and continuous monitoring.
Although our model demonstrates excellent performance under most testing cases, it does have certain limitations. While it performs well under standard lighting conditions, HA-Net may be impacted by variations in illumination, such as strong light or shadowed areas, which can lead to inaccuracies in feature extraction. In the future, we will focus on incorporating multispectral data to create a bare soil data set, while simultaneously introducing new attention mechanisms, thereby enhancing the model’s robustness to improve the extraction performance for bare soil in different lighting conditions.
In Figure 10b,c and Figure 11, all network models, including HA-Net, tend to misclassify roads as bare soil to some extent. This issue arises not only due to the elongated shape of the roads but also because their texture features closely resemble those of bare soil. Additionally, the accumulation of dust from bare soil, driven by numerous vehicles and wind, on the road surface has caused the color characteristics of cement roads to increasingly resemble those of bare soil, thereby presenting significant challenges for accurate bare soil identification. Therefore, a key task is to investigate how to enhance the accuracy of bare soil extraction when feature similarities of the background are very high. This task is not only crucial for bare soil identification but also has broad applicability to other remote sensing tasks that involve extracting foreground objects from visually similar backgrounds.
In this study, we used manually annotated ground truth data to train and evaluate the model. Bare soil areas often exhibit complex visual features, such as subtle variations in color, texture, and shape, which are crucial for accurate segmentation. If some annotations are not correct, the model will learn incorrect features during training. Consequently, the model is affected by these erroneous annotations and fails to accurately capture the inherent characteristics of bare soil. In summary, the accuracy of annotations directly impacts the model’s feature extraction and generalization capabilities. So, in our following work, further optimizing the annotation process and improving data quality will be performed, which can enhance the model’s performance and reliability in practical applications.
In future work, we plan to extend HA-Net to other satellite data, such as data from satellites like GF-2 [44]. Additionally, considering that optical remote sensing images often have cloud coverage, we should further use remote sensing images with a certain cloud coverage rate to extract bare soil. This will facilitate the further evaluation and optimization of the HA-Net model’s performance, ensuring high accuracy in the extraction process.

6. Conclusions

In this paper, we integrate deep learning with geospatial analysis and present a framework for automatic bare soil area detection called HA-Net, which possesses robust feature extraction capabilities, and its potential application is validated in bare soil identification. The proposed method has achieved excellent results in both qualitative and quantitative analysis, achieving high-precision extraction of bare soil areas.
This paper successfully integrates deep learning with the geographical spatial information of high-resolution remote sensing images, providing a reference for other scholars in bare soil monitoring and promoting the application of deep learning in bare soil extraction. On the other hand, the approach of combining deep learning techniques with domain knowledge, particularly in geospatial information science, is beneficial for future research in remote sensing image analysis.

Author Contributions

Conceptualization, J.Z. and L.C.; methodology, J.Z.; supervision, D.D. and L.C.; software, J.Z.; validation, H.C., Y.J. and X.L.; formal analysis, J.Z. and D.D.; data curation, J.Z. and D.D.; visualization, J.Z. and X.L.; writing—original draft preparation, J.Z.; writing—review and editing, J.Z., D.D. and L.C. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China (grant number: 42101468).

Data Availability Statement

Data are contained within the article.

Acknowledgments

The authors express their sincere gratitude to the Hunan Research Institute of Meteorological Sciences for providing the data used in this study, and also appreciate the valuable comments and suggestions provided by the anonymous reviewers for this article.

Conflicts of Interest

The authors declare that we have no conflicts of interest.

References

  1. Ellis, E.; Pontius, R. Land-use and land-cover change. Encycl. Earth 2007, 1, 1–4. [Google Scholar]
  2. Lambin, E.F.; Turner, B.L.; Geist, H.J.; Agbola, S.B.; Angelsen, A.; Bruce, J.W.; Coomes, O.T.; Dirzo, R.; Fischer, G.; Folke, C.; et al. The causes of land-use and land-cover change: Moving beyond the myths. Glob. Environ. Chang. 2001, 11, 26–269. [Google Scholar] [CrossRef]
  3. Müller, H.; Griffiths, P.; Hostert, P. Long-term deforestation dynamics in the Brazilian Amazon—Uncovering historic frontier development along the Cuiabá–Santarém highway. Int. J. Appl. Earth Obs. 2016, 44, 61–69. [Google Scholar] [CrossRef]
  4. Xian, G.; Crane, M. An analysis of urban thermal characteristics and associated land cover in Tampa Bay and Las Vegas using Landsat satellite data. Remote Sens. Environ. 2006, 104, 147–156. [Google Scholar] [CrossRef]
  5. Silvero, N.E.Q.; Demattê, J.A.M.; Amorim, M.T.A.; dos Santos, N.V.; Rizzo, R.; Safanelli, J.L.; Poppiel, R.R.; de Sousa Mendes, W.; Bonfatti, B.R. Soil variability and quantification based on Sentinel-2 and Landsat-8 bare soil images: A comparison. Remote Sens. Environ. 2021, 252, 112117. [Google Scholar] [CrossRef]
  6. Wang, W.X.; Chai, F.H.; Ren, Z.H.; Wang, X.; Wang, S.; Li, H.; Gao, R.; Xue, L.; Peng, L.; Zhang, X.; et al. Process, achievements and experience of air pollution control in China since the founding of the People's Republic of China 70 years ago. Res. Environ. Sci. 2019, 32, 1621–1635. [Google Scholar]
  7. Wuepper, D.; Borrelli, P.; Finger, R. Countries and the global rate of soil erosion. Nat. Sustain. 2020, 3, 51–55. [Google Scholar] [CrossRef]
  8. Xu, H. Dynamics of Bare Soil in A Typical Reddish Soil Loss Region of Southern China: Changting County, Fujian Province. Sci. Geogr. Sin. 2013, 33, 489–496. [Google Scholar]
  9. Liang, S.; Fang, H.; Chen, M. Atmospheric correction of landsat ETM+ land surface imagery-Part I: Methods. IEEE Trans. Geosci. Remote Sens. 2001, 39, 2490–2498. [Google Scholar] [CrossRef]
  10. Pianalto, F.S.; Yool, S.R. Monitoring fugitive dust emission sources arising from construction: A remote-sensing approach. GIsci. Remote Sens. 2013, 50, 251–270. [Google Scholar] [CrossRef]
  11. Dou, P.; Chen, Y. Dynamic monitoring of land-use/land-cover change and urban expansion in shenzhen using landsat imagery from 1988 to 2015. Int. J. Remote Sens. 2017, 38, 5388–5407. [Google Scholar] [CrossRef]
  12. Zhang, Y.; Shen, W.; Li, M.; Lv, Y. Assessing spatio-temporal changes in forest cover and fragmentation under urban expansion in Nanjing, eastern China, from long-term Landsat observations (1987–2017). Appl. Geogr. 2020, 117, 102190. [Google Scholar] [CrossRef]
  13. Chai, B.; Li, P. Annual Urban Expansion Extraction and Spatio-Temporal Analysis Using Landsat Time Series Data: A Case Study of Tianjin. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2018, 11, 2644–2656. [Google Scholar] [CrossRef]
  14. Schultz, M.; Clevers, J.G.P.W.; Carter, S.; Verbesselt, J.; Avitabile, V.; Quang, H.V.; Herold, M. Performance of vegetation indices from Landsat time series in deforestation monitoring. Int. J. Appl. Earth Obs. Geoinf. 2016, 52, 318–327. [Google Scholar] [CrossRef]
  15. Hamunyela, E.; Verbesselt, J.; Herold, M. Using spatial context to improve early detection of deforestation from Landsat time series. Remote Sens. Environ. 2016, 172, 126–138. [Google Scholar] [CrossRef]
  16. Pendrill, F.; Gardner, T.A.; Meyfroidt, P.; Persson, U.M.; Adams, J.; Azevedo, T.; Bastos Lima, M.G.; Baumann, M.; Curtis, P.G.; Sy, V.D.; et al. Disentangling the numbers behind agriculture-driven tropical deforestation. Science 2022, 377, eabm9267. [Google Scholar] [CrossRef]
  17. Zhu, F.; Wang, H.; Li, M.; Diao, J.; Shen, W.; Zhang, Y.; Wu, H. Characterizing the effects of climate change on short-term post-disturbance forest recovery in southern China from Landsat time-series observations (1988–2016). Front. Earth Sci. 2020, 14, 816–827. [Google Scholar] [CrossRef]
  18. Mo, Y.; Kearney, M.S.; Turner, R.E. Feedback of coastal marshes to climate change: Long-term phenological shifts. Ecol. Evol. 2019, 9, 6785–6797. [Google Scholar] [CrossRef]
  19. Rikimaru, A.; Roy, P.S.; Miyatake, S. Tropical forest cover density mapping. Trop. Ecol. 2002, 43, 39–47. [Google Scholar]
  20. Li, S.; Chen, X. A new bare-soil index for rapid mapping developing areas using landsat 8 data. ISPRS Arch. 2014, 40, 139–144. [Google Scholar] [CrossRef]
  21. Nguyen, C.T.; Chidthaisong, A.; Kieu Diem, P.; Huo, L.Z. A modified bare soil index to identify bare land features during agricultural fallow-period in southeast Asia using Landsat 8. Land 2021, 10, 231. [Google Scholar] [CrossRef]
  22. Rasul, A.; Balzter, H.; Ibrahim, G.R.F.; Hameed, H.M.; Wheeler, J.; Adamu, B.; Ibrahim, S.; Najmaddin, P.M. Applying Built-Up and Bare-Soil Indices from Landsat 8 to Cities in Dry Climates. Land 2018, 7, 81. [Google Scholar] [CrossRef]
  23. Chen, L.; Cai, X.; Xing, J.; Li, Z.; Zhu, W.; Yuan, Z.; Fang, Z. Towards transparent deep learning for surface water detection from SAR imagery. Int. J. Appl. Earth Obs. Geoinf. 2023, 118, 103287. [Google Scholar] [CrossRef]
  24. Chen, L.; Zhang, P.; Xing, J.; Li, Z.; Xing, X.; Yuan, Z. A multi-scale deep neural network for water detection from SAR images in the mountainous areas. Remote Sens. 2020, 12, 3205. [Google Scholar] [CrossRef]
  25. Chen, L.; Weng, T.; Xing, J.; Li, Z.; Yuan, Z.; Pan, Z.; Tan, S.; Luo, R. Employing deep learning for automatic river bridge detection from SAR images based on adaptively effective feature fusion. Int. J. Appl. Earth Obs. Geoinf. 2021, 102, 102425. [Google Scholar] [CrossRef]
  26. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems 30 (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
  27. Chen, J.; Yi, J.; Chen, A.; Lin, H. SRCBTFusion-Net: An Efficient Fusion Architecture via Stacked Residual Convolution Blocks and Transformer for Remote Sensing Image Semantic Segmentation. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–16. [Google Scholar] [CrossRef]
  28. Zhao, H.; Chen, X. Use of normalized difference bareness index in quickly mapping bare areas from TM/ETM+. In Proceedings of the International Geoscience and Remote Sensing Symposium, Seoul, Republic of Korea, 29 July 2005. [Google Scholar]
  29. Cheng, G.; Yuan, X.; Yao, X.; Yan, K.; Zeng, Q.; Xie, X.; Han, J. Towards large-scale small object detection: Survey and benchmarks. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 13467–13488. [Google Scholar] [CrossRef]
  30. Deng, Y.; Wu, C.; Li, M.; Chen, R. RNDSI: A ratio normalized difference soil index for remote sensing of urban/suburban environments. Int. J. Appl. Earth Obs. Geoinf. 2015, 39, 40–48. [Google Scholar] [CrossRef]
  31. He, C.; Liu, Y.; Wang, D.; Liu, S.; Yu, L.; Ren, Y. Automatic extraction of bare soil land from high-resolution remote sensing images based on semantic segmentation with deep learning. Remote Sens. 2023, 15, 1646. [Google Scholar] [CrossRef]
  32. Liu, D.; Chen, N. Satellite monitoring of urban land change in the middle Yangtze River Basin urban agglomeration, China between 2000 and 2016. Remote Sens. 2017, 9, 1086. [Google Scholar] [CrossRef]
  33. Jesus, G.T.; Itami, S.N.; Segantine, T.Y.F.; Junior, M.F.C. Innovation path and contingencies in the China-Brazil Earth Resources Satellite program. Acta Astronaut. 2021, 178, 382–391. [Google Scholar] [CrossRef]
  34. Cai, X.; Chen, L.; Xing, J.; Xing, X.; Luo, R.; Tan, S.; Wang, J. Automatic extraction of layover from InSAR imagery based on multilayer feature fusion attention mechanism. IEEE Geosci. Remote Sens. Lett. 2021, 19, 1–5. [Google Scholar] [CrossRef]
  35. Srinivas, A.; Lin, T.Y.; Parmar, N.; Shlens, J.; Abbeel, P.; Vaswani, A. Bottleneck transformers for visual recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021. [Google Scholar]
  36. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
  37. Chen, L.; Cai, X.; Li, Z.; Xing, J.; Ai, J. Where is my attention? An explainable AI exploration in water detection from SAR imagery. Int. J. Appl. Earth Obs. Geoinf. 2024, 130, 103878. [Google Scholar] [CrossRef]
  38. Zhang, Q.L.; Yang, Y.B. Sa-net: Shuffle attention for deep convolutional neural networks. In Proceedings of the ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021. [Google Scholar]
  39. Daubechies, I.; DeVore, R.; Foucart, S.; Hanin, B.; Petrova, G. Nonlinear approximation and (deep) ReLU networks. Constr. Approx. 2022, 55, 127–172. [Google Scholar] [CrossRef]
  40. Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
  41. Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual attention network for scene segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–21 June 2019. [Google Scholar]
  42. Wang, L.; Fang, S.; Meng, X.; Li, R. Building extraction with vision transformer. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–11. [Google Scholar] [CrossRef]
  43. Hu, J.; Huang, L.; Ren, T.; Zhang, S.; Ji, R.; Cao, L. You only segment once: Towards real-time panoptic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023. [Google Scholar]
  44. Tong, X.Y.; Lu, Q.; Xia, G.S.; Zhang, L. Large-scale land cover classification in Gaofen-2 satellite imagery. In Proceedings of the IGARSS 2018-2018 IEEE International Geoscience and Remote Sensing Symposium, Valencia, Spain, 22–27 July 2018. [Google Scholar]
Figure 1. Study Regions in Hunan Province.
Figure 1. Study Regions in Hunan Province.
Remotesensing 16 03088 g001
Figure 2. The sample examples of the bare soil data set. The red color represents areas of bare soil, while the black color represents background areas.
Figure 2. The sample examples of the bare soil data set. The red color represents areas of bare soil, while the black color represents background areas.
Remotesensing 16 03088 g002
Figure 3. Overall architecture of the HA-Net.
Figure 3. Overall architecture of the HA-Net.
Remotesensing 16 03088 g003
Figure 4. Overall structure of BoTNet.
Figure 4. Overall structure of BoTNet.
Remotesensing 16 03088 g004
Figure 5. The structure of the SIPM.
Figure 5. The structure of the SIPM.
Remotesensing 16 03088 g005
Figure 6. The structure of the CIEM.
Figure 6. The structure of the CIEM.
Remotesensing 16 03088 g006
Figure 7. The structure of the SRUM.
Figure 7. The structure of the SRUM.
Remotesensing 16 03088 g007
Figure 8. The stitching strategy: (a) sliding window stitching strategy; (b) final stitching strategy.
Figure 8. The stitching strategy: (a) sliding window stitching strategy; (b) final stitching strategy.
Remotesensing 16 03088 g008
Figure 9. Typical bare soil scene images.
Figure 9. Typical bare soil scene images.
Remotesensing 16 03088 g009
Figure 10. Different network extraction results for typical regions in test Scene I. (ac) correspond to the three regions shown in Scene I of Figure 9. The orange-red regions indicate areas identified as bare soil by different networks. The yellow and light blue boxes mainly demonstrate missed detections and false alarms.
Figure 10. Different network extraction results for typical regions in test Scene I. (ac) correspond to the three regions shown in Scene I of Figure 9. The orange-red regions indicate areas identified as bare soil by different networks. The yellow and light blue boxes mainly demonstrate missed detections and false alarms.
Remotesensing 16 03088 g010
Figure 11. Different network extraction results for typical regions in test Scene II. (a,b) correspond to the two regions shown in Scene II of Figure 9. The orange-red regions indicate areas identified as bare soil by different networks. The yellow and light blue boxes mainly demonstrate missed detections and false alarms.
Figure 11. Different network extraction results for typical regions in test Scene II. (a,b) correspond to the two regions shown in Scene II of Figure 9. The orange-red regions indicate areas identified as bare soil by different networks. The yellow and light blue boxes mainly demonstrate missed detections and false alarms.
Remotesensing 16 03088 g011aRemotesensing 16 03088 g011b
Figure 12. Different network extraction results for typical regions in Scene III: (a) Scene III; (b) ground truth for typical regions; (ch) are the extraction results of bare soil by DeepLabV3+, DA-Net, BuildFormer, YOSO, HA-Net-B50, and HA-Net-B101.
Figure 12. Different network extraction results for typical regions in Scene III: (a) Scene III; (b) ground truth for typical regions; (ch) are the extraction results of bare soil by DeepLabV3+, DA-Net, BuildFormer, YOSO, HA-Net-B50, and HA-Net-B101.
Remotesensing 16 03088 g012
Table 1. Optical payloads parameters of the CBERS-04A satellite.
Table 1. Optical payloads parameters of the CBERS-04A satellite.
PayloadSpectral BandSpectral
Range (μm)
Spatial Resolution (m)Swath Width (km)Revisit Cycle (Days)
WPM10.45 μm~0.9 μm29031
20.45 μm~0.52 μm8
30.52 μm~0.59 μm
40.63 μm~0.69 μm
50.77 μm~0.89 μm
MUX60.45 μm~0.52 μm179031
70.52 μm~0.59 μm
80.63 μm~0.69 μm
90.77 μm~0.89 μm
WFI100.45 μm~0.52 μm606855
110.52 μm~0.59 μm
120.63 μm~0.69 μm
130.77 μm~0.89 μm
Table 2. Parameters of the SIPM.
Table 2. Parameters of the SIPM.
Input (16, 16, 2048)
LayerParametersOutput Shape
Branch-1 Conv2DFilters = 256, Kernel_size = 3, Padding = 3, Dilation = 3(16, 16, 128)
Branch-2 Conv2DFilters = 256, Kernel_size = 3, Padding = 3, Dilation = 3(16, 16, 256)
Branch-2 Conv2DFilters = 128, Kernel_size = 3, Padding = 3, Dilation = 3(16, 16, 128)
Branch-3 Conv2DFilters = 256, Kernel_size = 3, Padding = 3, Dilation = 3(16, 16, 256)
Branch-3 Conv2DFilters = 128, Kernel_size = 3, Padding = 6, Dilation = 6(16, 16, 128)
ConcatNone(16, 16, 384)
Conv2DFilters = 1, Kernel_size = 1, Padding = 0, Dilation = 1(16, 16, 1)
SigmoidNone(16, 16, 1)
Dot Product and SUMNone(16, 16, 2048)
Table 3. Parameters of the CIEM.
Table 3. Parameters of the CIEM.
Input (16, 16, 2048)
LayerParametersOutput Shape
SplitNone(16, 16, 1024) × 2
Depthwise Separable Conv2D 1Filters = 1024, Kernel_size = 3, Padding = 1, Dilation = 1
Filters = 1024, Kernel_size = 1, Padding = 0, Dilation = 1
(16, 16, 1024)
Depthwise Separable Conv2D 1Filters = 1024, Kernel_size = 7, Padding = 3, Dilation = 1
Filters = 1024, Kernel_size = 1, Padding = 0, Dilation = 1
(16, 16, 1024)
SUMNone(16, 16, 1024)
Avg Pool × 2Kernel_size = 16(1, 1, 1024) × 2
Max Pool × 2Kernel_size = 16(1, 1, 1024) × 2
Sigmoid × 2None(1, 1, 1024) × 2
Dot ProductNone(16, 16, 1024)
ReLUNone(16, 16, 1024)
Dot ProductNone(16, 16, 1024)
Conv2DFilters = 2048, Kernel_size = 1, Padding = 0, Dilation = 1(16, 16, 2048)
1 Depthwise separable convolution consists of the following two steps: depthwise convolution and pointwise convolution.
Table 4. Parameters of the SRUM.
Table 4. Parameters of the SRUM.
Input [(N, N, C) and (2N, 2N, C/2)] 1
FeatureLayerParametersOutput Shape
High-level FeaturesPixel rearrangementUpsampling factor = 2(2N, 2N, C/4)
Transposed Conv2DFilters = C/4, Kernel_size = 2, Stride = 2, Padding = 0, Dilation = 1(2N, 2N, C/4)
ConcatNone(2N, 2N, C/2)
Conv2DFilters = C/2, Kernel_size = 1, Padding = 0, Dilation = 1(2N, 2N, C/2)
Low-level FeaturesConv2DFilters = C/2, Kernel_size = 3, Padding = 1, Dilation = 1(2N, 2N, C/2)
SUMNone(2N, 2N, C/2)
1 The input size for high-level features is (N, N, C), and for low-level features is (2N, 2N, C/2), where N represents the input height or width, and C represents the number of input channels.
Table 5. Experimental hardware and software configuration.
Table 5. Experimental hardware and software configuration.
LayerParameters
FramePyTorch 1.20
LanguagePython 3.7
CPUInter Xeon Silver 4210
GPU (Single)NVIDIA RTX 3090
Table 6. Confusion matrix.
Table 6. Confusion matrix.
Predicted Values
Bare SoilNon-Bare Soil
GroundTruthBare SoilTPFN
Non-Bare SoilFPTN
Table 7. Comparison of bare soil extraction accuracy among different networks.
Table 7. Comparison of bare soil extraction accuracy among different networks.
SceneMethodCPA (%)IoU (%)Recall (%)F1 (%)
Scene IDeepLabV3+89.71675.62882.80786.123
DA-Net89.19674.78082.22885.570
BuildFormer92.78474.57279.16385.434
YOSO90.87977.11883.58787.081
HA-Net-B5092.21681.42387.43289.760
HA-Net-B10192.59581.83387.56490.009
Scene IIDeepLabV3+85.41273.35183.85684.627
DA-Net86.24073.74383.57784.887
BuildFormer87.46374.68383.63585.507
YOSO87.31575.17184.38685.826
HA-Net-B5088.68978.68087.45588.068
HA-Net-B10189.17879.96988.56488.870
Average for Two ScenesDeepLabV3+87.56474.49083.33285.375
DA-Net87.71874.26282.90385.229
BuildFormer90.12474.62881.39985.471
YOSO89.09776.14583.98786.454
HA-Net-B5090.45380.05287.44488.914
HA-Net-B10190.88780.90188.06489.440
Bold numbers indicate the best performance in the same group of comparative experiments.
Table 8. Different strategies for the ablation study.
Table 8. Different strategies for the ablation study.
BackboneModule-1Module-2CPA (%)IoU (%)Recall (%)F1 (%)
Average for Two ScenesBoTNet50NoneNone87.26174.56283.69685.427
BoTNet101NoneNone87.75375.23384.08785.866
BoTNet50SIPMNone88.78577.25685.64287.167
BoTNet101SIPMNone89.21078.01886.16787.648
BoTNet50CIEMNone88.47876.77485.34186.858
BoTNet101CIEMNone88.76277.47885.92687.305
BoTNet50SRUMNone87.69475.32784.26185.927
BoTNet101SRUMNone88.20076.15084.82586.460
BoTNet50SIPMCIEM89.54778.69186.65888.070
BoTNet101SIPMCIEM89.72979.18987.09088.383
Table 9. Comparison of Generalization Ability across Different Networks.
Table 9. Comparison of Generalization Ability across Different Networks.
MethodCPA (%)IoU (%)Recall (%)F1 (%)
DeepLabV3+88.57080.90990.34189.447
DA-Net86.17480.37192.27089.118
BuildFormer92.57973.35277.93584.628
YOSO89.43781.93390.71190.069
HA-Net-B5092.85185.74491.80492.325
HA-Net-B10192.11686.47893.39192.749
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhao, J.; Du, D.; Chen, L.; Liang, X.; Chen, H.; Jin, Y. HA-Net for Bare Soil Extraction Using Optical Remote Sensing Images. Remote Sens. 2024, 16, 3088. https://doi.org/10.3390/rs16163088

AMA Style

Zhao J, Du D, Chen L, Liang X, Chen H, Jin Y. HA-Net for Bare Soil Extraction Using Optical Remote Sensing Images. Remote Sensing. 2024; 16(16):3088. https://doi.org/10.3390/rs16163088

Chicago/Turabian Style

Zhao, Junqi, Dongsheng Du, Lifu Chen, Xiujuan Liang, Haoda Chen, and Yuchen Jin. 2024. "HA-Net for Bare Soil Extraction Using Optical Remote Sensing Images" Remote Sensing 16, no. 16: 3088. https://doi.org/10.3390/rs16163088

APA Style

Zhao, J., Du, D., Chen, L., Liang, X., Chen, H., & Jin, Y. (2024). HA-Net for Bare Soil Extraction Using Optical Remote Sensing Images. Remote Sensing, 16(16), 3088. https://doi.org/10.3390/rs16163088

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop