HA-Net for Bare Soil Extraction Using Optical Remote Sensing Images

Zhao, Junqi; Du, Dongsheng; Chen, Lifu; Liang, Xiujuan; Chen, Haoda; Jin, Yuchen

doi:10.3390/rs16163088

Open AccessArticle

HA-Net for Bare Soil Extraction Using Optical Remote Sensing Images

by

Junqi Zhao

¹

,

Dongsheng Du

²,

Lifu Chen

^1,*

,

Xiujuan Liang

¹,

Haoda Chen

¹ and

Yuchen Jin

¹

School of Electrical & Information Engineering, Changsha University of Science & Technology, Changsha 410114, China

²

Hunan Key Laboratory of Meteorological Disaster Prevention and Reduction, Hunan Research Institute of Meteorological Sciences, Changsha 410118, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2024, 16(16), 3088; https://doi.org/10.3390/rs16163088

Submission received: 17 July 2024 / Revised: 15 August 2024 / Accepted: 18 August 2024 / Published: 21 August 2024

(This article belongs to the Special Issue AI-Driven Satellite Data for Global Environment Monitoring)

Download

Browse Figures

Versions Notes

Abstract

:

Bare soil will cause soil erosion and contribute to air pollution through the generation of dust, making the timely and effective monitoring of bare soil an urgent requirement for environmental management. Although there have been some researches on bare soil extraction using high-resolution remote sensing images, great challenges still need to be solved, such as complex background interference and small-scale problems. In this regard, the Hybrid Attention Network (HA-Net) is proposed for automatic extraction of bare soil from high-resolution remote sensing images, which includes the encoder and the decoder. In the encoder, HA-Net initially utilizes BoTNet for primary feature extraction, producing four-level features. The extracted highest-level features are then input into the constructed Spatial Information Perception Module (SIPM) and the Channel Information Enhancement Module (CIEM) to emphasize the spatial and channel dimensions of bare soil information adequately. To improve the detection rate of small-scale bare soil areas, during the decoding stage, the Semantic Restructuring-based Upsampling Module (SRUM) is proposed, which utilizes the semantic information from input features and compensate for the loss of detailed information during downsampling in the encoder. An experiment is performed based on high-resolution remote sensing images from the China–Brazil Resources Satellite 04A. The results show that HA-Net obviously outperforms several excellent semantic segmentation networks in bare soil extraction. The average precision and IoU of HA-Net in two scenes can reach 90.9% and 80.9%, respectively, which demonstrates the excellent performance of HA-Net. It embodies the powerful ability of HA-Net for suppressing the interference from complex backgrounds and solving multiscale issues. Furthermore, it may also be used to perform excellent segmentation tasks for other targets from remote sensing images.

Keywords:

deep learning; semantic segmentation; remote sensing images; bare soil extraction; attention mechanism

1. Introduction

In the last few decades, with the rapid development of the Chinese economy, the speed, scale, and spatial extent of land cover have undergone unprecedented changes [1,2]. This transformation not only threatens the health of ecosystems, but also poses severe challenges to human survival and development [3,4]. Bare soil, referring to the exposed soil surface without vegetation or built-up structures [5], is one of the fundamental biophysical components of land cover and a key factor contributing to air pollution and soil erosion [6,7]. Large areas of bare soil, lacking the stabilizing effect of vegetation roots, become fragmented and loose in structure, leading to loss through water runoff and serious ecological problems such as dust pollution [8]. Therefore, a precise assessment of bare soil is crucial for preventing soil erosion, resource management, and environmental protection.

Currently, investigations of bare soil mainly rely on manual field surveys and hierarchical reporting, which are constrained by factors such as transportation conditions and high labor costs. As a result, survey results often suffer from subjectivity, lack comprehensiveness, and have poor timeliness. Satellite remote sensing, as a macroscopic, rapid, and effective monitoring method, offers advantages such as wide coverage, high accuracy, strong real-time capability, and immunity to human interference [9,10]. It is widely used to monitor land and environmental changes, including urban expansion [11,12,13], deforestation [14,15,16], and the impacts of climate change [17,18]. In recent years, the rapid development of high-resolution satellite systems has provided powerful tools for investigating bare soil on Earth’s surface.

The bare soil index (BSI) is the most commonly used method for extracting bare soil from remote sensing imagery. Its basic principle is to use spectral information for the identification and extraction of bare soil areas. Early studies mainly relied on the BSI for large-scale bare soil extraction. For example, in 2002, Rikimaru et al. [19] proposed a BSI that calculated the difference between the near-infrared and short-wave infrared bands for the identification of bare soil areas. Then, scholars began to explore more refined bare soil indices to facilitate the extraction of different types of bare soil in different regions. For example, in 2014, Li et al. [20] developed a new bare soil index and applied it to extract bare soil in the development areas of the Pearl River Delta region. In 2021, Nguyen et al. [21] introduced an improved bare soil index (modified bare soil index (MBI)) using Landsat 8 wavelengths to improve the separation of bare soil, and demonstrated higher accuracy in bare soil detection. Currently, researchers generally use shortwave infrared (SWIR) and near-infrared (NIR) bands to create the band sum index (BSI) [22]. In recent years, an increasing number of sensors only retain the red, green, blue, near-infrared bands, and a panchromatic band, which has limited the application of the BSI. In addition, the BSI typically relies on spectral information, making it difficult to distinguish between land cover types with similar spectral characteristics. Therefore, it is now difficult to establish an effective BSI.

In recent years, deep learning has made remarkable progress. Compared to traditional machine learning methods, deep learning techniques exhibit characteristics such as strong learning ability, minimal manual intervention, and good adaptability, making them widely applied in the radar images and remote sensing images for the extraction of surface objects. For instance, deep learning networks were constructed to extract water bodies and bridges from SAR images, achieving satisfactory results [23,24,25]. Due to the complex information contained in remote sensing images, many current methods consider combining a Transformer [26] with CNNs to achieve promising results. For example, Chen et al., proposed a hybrid architecture called SRCBTFusion-Net, which integrates a Transformer and a CNN to enhance remote sensing image segmentation performance [27].

Currently, preliminary achievements have been made using deep learning for bare soil detection. However, the complex interference in urban/agricultural areas [28] and the multiscale problem (especially the extraction of small-scale bare soil areas) [29] are important challenges in bare soil extraction. Bare soil is different from other surface objects, such as water, roads, and buildings, which have a uniform overall structure. Bare soil is often covered by a large amount of other ground objects, such as roads, rocks, and buildings, which are typically unevenly distributed. This results in the mixing of bare soil with the background, causing blurred and irregular edges, which presents challenges for bare soil extraction; moreover, the texture features of complex backgrounds (e.g., agricultural areas and impermeable surfaces) are similar to those of bare soil [30], which also makes it difficult to identify bare soil. In 2023, He et al. [31] introduced an attention mechanism using deep learning models to distinguish between bare soil and the background. However, this CNN-based approach restricts the model’s ability to recognize long-range relationships and encode global contextual information, thus limiting its capability to extract bare soil from more complex backgrounds. Additionally, the emergence of small-scale bare soil areas due to natural and anthropogenic factors presents a challenge due to their widespread distribution and the difficulty of field investigations. For the detection of small targets, this is more difficult due to limited semantic information and less pixels, which result in relatively worse detection performance [26]. Therefore, extracting small-scale bare soil is currently an issue that needs to be addressed.

To address these challenges, the high-resolution remote sensing images from the China–Brazil Resources Satellite with a resolution of 2 m is utilized, and a deep learning network is constructed for bare soil extraction. The main contributions are as follows:

The Hybrid Attention Network (HA-Net) is proposed in this paper, which possesses the excellent feature learning ability for bare soils, and effectively suppress the interference of the background, so excellent extraction performance for bare soil from remote sensing can be achieved.
By introducing the attention mechanism, the Spatial Information Perception Module (SIPM) and Channel Information Enhancement Module (CIEM) are proposed, which can effectively learn multiscale information of the bare soil and better suppress the complex background noise.
In the decoder, the Semantic Reconstructing-based Upsampling Module (SRUM) is constructed, which improves the ability of capturing the detailed information for bare soils during the downsampling process in the encoder.
HA-Net not only enables high-precision automatic extraction of bare soil targets, but also has significant application values in the semantic segmentation for other typical targets in remote sensing images.

2. Study Regions and Data

2.1. Study Regions

The research area of this paper covers the cities of Changsha, Zhuzhou, and Xiangtan in Hunan Province, China, as shown in Figure 1. The longitude range is from 112°31′30″E to 113°13′34″E, and the latitude range is from 27.557861°N to 28.554682°N. These three cities are collectively referred to as Chang-Zhu-Tan. Located in the central and eastern part of Hunan Province, the Chang-Zhu-Tan urban agglomeration is an important component of the urban agglomeration in the middle reaches of the Yangtze River. With a total area of 28,000 km² (with the urban area covering 18,900 km²), it serves as the core area for the development of Hunan Province [32]. With continuous regional development and the unreasonable utilization of land resources, a large amount of bare soil has been generated, leading to air pollution and soil erosion. Therefore, conducting high-precision regular monitoring of complete bare soil in the research area is of great practical significance, which is much useful for land resource planning and environmental protection.

2.2. CBERS-04A Data

The China–Brazil Earth Resources Satellite (CBERS) is a cooperative program between the China Academy of Space Technology (CAST) and the National Institute for Space Research (INPE) of Brazil. The program was signed in July 1988 with the aim of establishing a comprehensive remote sensing system (both space and ground segments), to provide multispectral remote sensing images for both countries. The CBERS-04A satellite, also known as the High-Resolution Earth Observation Satellite 04A, was successfully launched by the China National Space Administration on 29 December 2020. It is part of the CBERS series of high-resolution Earth observation satellites, designed to provide remote sensing data with high-quality and high-precision, for land use, resource survey, environmental protection, and other fields. The CBERS-04A satellite is equipped with three optical payloads as follows: the Wide Field Panchromatic and Multi-spectral Camera (WPM), the Multi-spectral Camera (MUX), and the Wide Field Imager (WFI). It also includes space environment monitoring payloads (SEM) and data collection payloads (DCS). Table 1 shows the detailed parameters of the optical payloads of this satellite.

2.3. Data Set

The remote sensing data used in this study were provided by the Hunan Provincial Meteorological Bureau. The images were acquired by the CBERS-04A satellite between March and May 2023, with a spatial resolution of 2 m [33]. To enhance training efficiency, we extracted 40 images of size 5000 × 5000 pixels from the Chang-Zhu-Tan region, each exhibiting different backgrounds and complexities. These images serve as representative examples of actual ground conditions regarding bare soil. During manual labeling, each bare soil data sample was carefully examined and classified based on visible surface features. For example, bare soil surfaces with shrubs and grass were assessed for vegetation density to determine if they should be classified as bare soil. Sparse shrubs and grass do not effectively stabilize the soil, making it prone to dust emissions and soil erosion; therefore, such areas should be classified as bare soil and monitored accordingly.

Subsequently, the ground truth data and corresponding images were cropped into 512 × 512 pixels. Samples were generated using a sliding window approach, and a portion underwent data augmentation, including adjustments to brightness, mirroring, flipping, adding noise, etc. The specific effects are illustrated in Figure 2 (the corresponding labels for the augmented images are not shown here). In total, 6382 samples were obtained. The training and validation sets were split in an 8:2 ratio, with additional independent images reserved for the test set. In the ground truth examples depicted in Figure 2, the red color represents areas of bare soil, while the black color denotes background areas. From the examples of bare soil data samples, it is evident that due to natural and anthropogenic factors, bare soil exhibits diverse morphologies, varying in scale, and its textural features often resemble the background, making extraction challenging.

3. Methods

3.1. Overall Framework of HA-Net

Figure 3 illustrates the overall structure of the network. The training data are initially fed into the backbone network for feature extraction, which generates four feature maps at varying levels of semantic detail. The resolutions of these features are indicated in the figure. Among the outputs of the backbone network, the feature map with the lowest resolution contains the richest semantic information, while the feature map with the highest resolution retains the least semantic detail.

Following the preliminary feature extraction, the highest-level feature map is processed by the following two modules: the Spatial Information Perception Module (SIPM) and the Channel Information Enhancement Module (CIEM). The SIPM employs dilated convolutions [34] to expand the receptive field, thereby improving the representation of spatial information across the feature maps. The CIEM focuses on enhancing the information within each channel of the feature maps by decomposing and fusing channel information. The integration of these two modules results in a comprehensive extraction of the feature maps.

In the decoder stage, the Semantic Restructuring-based Upsampling Module (SRUM) performs pixel recombination and transpose convolution to achieve upsampling. This process integrates high-level and low-level features, progressively restoring the original image size and enabling the automatic extraction of bare soil areas.

3.2. The Encoder

3.2.1. Backbone

In deep learning, the backbone network is responsible for the initial extraction and encoding of critical features from input data, which is essential for processing complex visual information. Consequently, the design and selection of an appropriate backbone are fundamental to developing effective deep learning models.

In this paper, BoTNet [35] is selected as the backbone network for the framework (the structure is shown in Figure 4). BoTNet is a new model formed by incorporating a Transformer [26] into ResNet [36]. The principle involves replacing the

3 \times 3

convolution in the last bottleneck block of ResNet with Multi-Head Self-Attention (MHSA). Understanding the contextual information of the entire bare soil area is crucial for accurately segmenting bare soil regions. MHSA allows the model to consider dependencies between pixels across the global scope during feature extraction, rather than being limited to local pixel windows. This facilitates a better understanding of the spatial distribution patterns of bare soil areas. Through this design, BoTNet effectively combines the strengths of ResNet and Transformer models. It not only inherits ResNet’s excellent characteristics, such as effectively preserving original features, but also enhances the model’s ability to capture global information through self-attention mechanisms. This enables BoTNet to exhibit excellent performance in various computer vision tasks, making it particularly suitable for constructing semantic segmentation networks. The input features first pass through a convolutional layer and a max-pooling layer to reduce image resolution, followed by preliminary semantic feature extraction through three ResNet Bottlenecks and one Bottleneck Transformer. In this paper, the output features of BoT-4 are used as inputs for the SIPM and the CIEM.

3.2.2. SIPM

In semantic segmentation, attention mechanisms can help the model focus on object boundaries and key regions, thereby accurately segmenting target objects and reducing the influence of background [37]. Therefore, to enhance the network’s ability to extract multiscale bare soil areas while better learning the essential features of bare soil areas in spatial positions, the SIPM is constructed.

Generally, to better learn spatial information, fully utilizing contextual information is crucial for a network, and having a large receptive field can effectively utilize contextual information. Additionally, different receptive fields can sense feature information at different scales. Hence, this module introduces multiple branches of dilated convolutions to efficiently expand the receptive field, thereby generating spatial attention, enhancing the multiscale extraction capability of bare soil areas, and highlighting bare soil area information while suppressing noise information at different locations.

The structure of the SIPM is illustrated in Figure 5. Firstly, for the input

F (F \in R^{C \times H \times W})

, three branches of dilated convolutions with

3 \times 3

kernels are utilized to extract multiscale and contextual information. To obtain multiscale information while saving computational costs, feature compression is performed twice in the channel dimension in the second and third branches, and then multiscale features are integrated through channel stacking. Subsequently, a

1 \times 1

convolution is used to compress the feature size to

1 \times H \times W

, and a Sigmoid function [38] is applied to obtain a spatial attention map ranging from 0 to 1. This highlights multiscale bare soil area information on the attention map, while also suppressing noise information. Finally, the attention map X is elementwise multiplied with the input features F to enhance the representation of bare soil area information at different locations. Then, it is fused with the original input feature map to improve the representation of feature information, resulting in the output of this module.

The expressions are illustrated in Equations (1) and (2):

SA_out = σ \{Conv [D_Conv (F); D_Conv (D_Conv (F)); D_Conv (D_Conv (F))]\}

(1)

Out = F \otimes (SA_out + 1)

(2)

where

F

,

SA_out

, and

Out

represent the input, the output of attention, and the final output of this module, respectively;

Conv

and

D_Conv

denote regular convolution and dilated convolution, respectively;

σ

represents the Sigmoid function; and ⊗ is the matrix dot multiplication operation. The parameters of this module are shown in Table 2.

3.2.3. CIEM

During the initial feature extraction process in the backbone network, the resolution of the feature map decreases synchronously with downsampling operations, resulting in the simultaneous loss of detailed feature information. Therefore, increasing the number of feature channels is necessary to ensure that information loss is not excessive. BoTNet, after being processed by Transformer layers, outputs high-level features with abundant channel quantities and global information. Each channel encompasses a specific channel response. Consequently, the CIEM is proposed in this paper, which employs multiscale convolutional kernels and pooling channel attention to enhance the multiscale feature learning capability in the channel dimension, and to suppress complex background information.

The structure of the CIEM is illustrated in Figure 6. For the input feature

F (F \in R^{C \times H \times W})

, it is initially split along the channel dimension into two features with a size of

C / 2 \times H \times W

. Then, two different sizes of depthwise separable convolutions are applied to extract multiscale features. Subsequently, on one hand, the two features are fused, and on the other hand, average pooling and max pooling operations are applied to compress the spatial dimensions of the two features, followed by applying the Sigmoid function to obtain two channel attentions, Y1 and Y2, each of them with size of

C / 2 \times 1 \times 1

. Then, Y1 and Y2 are elementwise multiplied with the fused feature sequentially, and the ReLU function [39] is used to improve the nonlinear expression of this module. Finally, a

1 \times 1

convolution is applied to restore the original channel size, yielding the final output of this module. In this module, average pooling and max pooling can be utilized to aggregate spatial information of feature maps. Therefore, each channel of attention contains a specific channel response, allowing each channel to focus on the desired information by weighting with the initial features. The principle of this module is shown in Equations (3)–(5):

{DS_out}_{i} = {DS_conv}_{i} [Split (F)]

(3)

{CA_out}_{i} = σ [{PooL}_{i} ({DS_out}_{i})]

(4)

Out = {CA_out}_{1} \otimes \{Max [0, {CA_out}_{0} \otimes (\sum_{i = 0}^{1} {DS_out}_{i})]\}

(5)

where F and Out denote the input and final output of this module, respectively;

Split

represents the operation of splitting along the channel dimension;

{DS_conv}_{i}

,

{DS_out}_{i}

,

{Poo L}_{i}

, and

{SA_out}_{i}

(i ∈ {0, 1}) represent the operations of dual-branch depthwise separable convolution, the output of dual-branch depthwise separable convolution, dual-branch attention pooling, and the output of dual-branch attention, respectively;

σ

denotes the Sigmoid function; and ⊗ is the matrix dot multiplication operation. The parameters of this module are shown in Table 3.

3.3. The Decoder

Currently, in deep learning, commonly used methods for upsampling include nearest neighbor or bilinear interpolation. These methods solely rely on the spatial positions of pixels to determine the upsampling kernel, without leveraging the semantic information of feature maps. They can be regarded as a “uniform” upsampling approach, and typically have small perceptual fields (nearest neighbor upsampling: “1 × 1”, bilinear interpolation upsampling: “2 × 2”). An excellent upsampling module should be related to the semantic information of the feature map and should perform upsampling operations based on the input content. Therefore, the SRUM utilizes pixel rearrangement and transposed convolution to achieve feature upsampling, which can effectively leverage the semantic information of features.

For the high-level features input to this module, the SRUM employs two different branches for upsampling operations. The first branch involves rearranging the input features using a pixel rearrangement method depicted in Figure 7. This method fully utilizes feature channel information by dividing the pixels at each position of the feature map into four parts along the channel dimension and then reordering them. This process results in a 2× upsampling of the features, while reducing the number of channels to one-fourth of the original number. The second branch utilizes a

3 \times 3

transposed convolution (Tconv) for 2× upsampling, simultaneously reducing the number of channels by fourfold. Subsequently, these two upsampled feature channels are stacked and blended using a

1 \times 1

convolution, followed by fusion with low-level features processed through

3 \times 3

convolution, forming the output of this module. In this upsampling process of the module, traditional interpolation methods are abandoned in favor of feature-based semantic information for more effectively restoring the resolution details of the features. The parameters of the SRUM are shown in Table 4.

According to Figure 3, multiple SRUMs are continuously used to combine high-level features with low-level features. Then, further refinement of the features is performed through a 3 × 3 convolution, followed by a 4× upsampling, resulting in the final extraction of bare soil.

4. Experiments and Results

4.1. The Stitching Strategy of the Sliding Windows

During experiments, it is necessary to segment the large-scale remote sensing images into smaller sample images for testing. However, the image segmentation often disrupts the integrity of the targets. Additionally, when stitching images together, the classification results at the junction of two adjacent windows may be discontinuous, potentially leading to detection errors in edge regions. Therefore, to obtain better results, this paper employs a sliding window stitching detection strategy to test large-scale images by the trained model. The specific procedure is as follows. For the image to be tested, it is first segmented using a sliding window of the same size as the training set samples, typically 512 × 512 pixels, with a step size of 412 pixels and a 100 pixels overlap between the adjacent windows, as shown in Figure 8a. After testing the two windows, the classification results in the overlapping region will be averaged by the adjacent windows. During the horizontal/vertical sliding window process, when the remaining area on the right/bottom side is less than the sliding window step size, a sliding window size will be taken from the right/bottom side as the starting point, moving left/upward, as illustrated in Figure 8b. By employing this repeated sliding window method, ensuring each area undergoes thorough detection, the resulting image with minimal classification boundary errors is obtained.

4.2. Experimental Environment and Training Parameter Settings

The experimental software environment for this study consists of PyTorch 1.20 and Python 3.7. The hardware environment includes an Intel Xeon Silver 4210 CPU and a single NVIDIA RTX 3090 GPU (as shown in Table 5). During the training process of the research model, the batch size for input images is set to 8, with the network trained for 150 iterations. The training process saves the results of the best epoch.

4.3. Experimental Results and the Analysis

In this paper, we adopt class pixel accuracy (CPA), intersection over union (IoU), recall, and the F1 score (F1) as the evaluation metrics. The specific mathematical expressions are as follows:

CPA = \frac{TP}{TP + FP}

(6)

IoU = \frac{TP}{TP + FP + FN}

(7)

Recall = \frac{TP}{TP + FN}

(8)

F 1 = \frac{2 (CPA \times Recall)}{CPA + Recall}

(9)

where TP, TN, FP, and FN represent true positive, true negative, false positive, and false negative, respectively. To better illustrate the relationships between TP, FP, FN, and TN, we present the confusion matrix in Table 6.

4.3.1. Bare Soil Extraction by Different Networks

To validate the effectiveness of the proposed network, comparative experiments are conducted with DeepLabV3+ [40], DA-Net [41], BuildFormer [42], and YOSO [43]. Independent testing is performed on two remote sensing images of the Chang-Zhu-Tan region, whose sizes are 5000 × 4000 pixels and 5000 × 5000 pixels, respectively. These two scenes contain rich bare soil information and different complex backgrounds, serving as representatives. To evaluate the experimental results clearly, several typical regions are selected for analysis from each of the two scenes (as shown in Figure 9). Furthermore, to demonstrate the effectiveness of the proposed network, BoTNet50 and BoTNet101 (Hereinafter referred to as B50 and B101) are chosen as the backbone networks of HA-Net. Typical scene experimental results are presented in Figure 10 and Figure 11. In these figures, bare soil regions are indicated in an orange-red color, while the yellow and light blue boxes mainly demonstrate missed detections and false alarms.

Figure 10 depicts the extraction results of different networks for three representative regions in Scene I. From Figure 10a, it can be observed that all networks have fewer false alarms. However, in the complex region within the red box in (a), DA-Net misidentifies grassland as bare soil, while BuildFormer incompletely extracts bare soil from this area. DeepLabV3+, HA-Net and YOSO can effectively distinguish bare soil from grassland. In the areas surrounding buildings, DeepLabV3+, DA-Net, BuildFormer, and YOSO fail to effectively extract bare soil, resulting in significant area omissions, which does not meet the requirements for continuous monitoring and management of bare soil. In contrast, HA-Net effectively extracts small-scale bare soil near buildings.

In the green boxes of Figure 10b,c, within the clusters of buildings, it can be observed that due to the elongated shape of roads and some buildings and their similar texture features to bare soil, all networks misclassify buildings and roads to varying degrees as bare soil. HA-Net has a relatively lower false alarm rate, with only a few roads misidentified as bare soil.

For the areas of the red boxes in Figure 10b, due to the complex mixture of bare soil and background, the contrastive networks fail to effectively identify bare soil within the complex region. In the red boxes in Figure 10c, DeepLabV3+ and DA-Net incorrectly identify all areas as bare soil in complex backgrounds, while BuildFormer and YOSO can accurately extract bare soil. The HA-Net-B50 and HA-Net-B101 models can accurately extract bare soil from the red boxes in (b) and (c) despite the complex environment.

Figure 11 shows the results of different networks for extracting bare soil for the two display areas in scene II. For the green-boxed area of the ground truth in Figure 11a, we can observe that DeepLabV3+, DA-Net, YOSO, and BuildFormer identify the background (impervious surface) as bare soil to varying degrees, which indicates that the four networks do not distinguish bare soil from the background, which is not conducive to the continuous monitoring of the bare soil area; whereas, HA-Net can identify bare soil information efficiently due to the MHSA, SIPM, and CIEM.

From Figure 11b, we can observe that both two HA-Net models can effectively identify the elongated bare soil areas on either side of the road. Maybe the inclusion of the SRUM in HA-Net enhances its ability to detect some scattered small bare soil areas as well. In contrast, the remaining other four networks largely fail to detect these bare soil areas along the road and the small scattered patches. Additionally, BuildFormer also misses many bare soil areas, which can be attributed to its lack of local receptive fields.

In order to better evaluate the performance of the different networks in extracting the bare soil from the remote sensing images, the paper compares the four networks in terms of the extraction accuracy metrics to compare the five networks, and Table 7 gives the extraction accuracy of different networks in two experiments.

Based on Table 7, it can be observed that HA-Net outperforms the other four networks in terms of higher CPA, IoU, Recall, and F1 across the two scenes on average, with improvements of 0.763%, 4.756%, 4.077%, and 2.986% over the second-best network, respectively. However, in Scene I, HA-Net demonstrates a lower CPA compared to BuildFormer. According to the metric computation formulas, this is attributed to the lower false detection rate of BuildFormer during the bare soil extraction process when tested independently. In addition, due to the significantly higher false alarms rate of BuildFormer, it also leads to a lower Recall, indicating it has missed many bare soil areas. YOSO achieves the second-best average IoU, Recall, and F1, although lower than HA-Net, it still demonstrates good bare soil extraction capability. On average, DeepLabV3+, and DA-Net perform worse in terms of these indexes compared to HA-Net, with their lower CPA and Recall indicating that they are less accurate in extracting bare soil areas. HA-Net achieves the highest F1 score in Table 7, indicating the superior quality of the HA-Net model. Regarding the proposed network, HA-Net-B101 generally exhibits slightly higher metrics than HA-Net-B50, but BoTNet101, with its additional convolutional layers compared to BoTNet50, inevitably incurs more computational pressure.

4.3.2. Ablation Study

An ablation study is a systematic method that involves gradually removing different modules from a model to observe changes in its performance. This experimental design allows for a better understanding of the working principles of the model and identifies which parts play a critical role in improving its performance. Therefore, in this study, we will also conduct an ablation study to validate the effectiveness of the proposed model and assess the importance of its various components.

From Table 8, we can observe that without any modules (SIPM, CIEM, and SRUM) added, the baseline performance of BoTNet is relatively weak, with IoU slightly higher than DeepLabV3+, DA-Net, and BuildFormer. After adding the SIPM, the performance of HA-Net improves significantly. Although adding the CIEM results in less improvement compared to the SIPM, it still provides noticeable gains. When the SRUM is added alone, the performance improvement is relatively small, with HA-Net’s metrics lower than when the SIPM and the CIEM are added separately. This is because the SRUM focuses more on enhancing the network’s detail information for detecting small target areas. When both the SIPM and the CIEM are added simultaneously, the performance improvement is the most significant, approaching the complete performance of HA-Net. This indicates that the SIPM and the CIEM improve spatial and channel information, respectively, collectively enhancing the effectiveness of bare soil extraction.

4.3.3. Generalization Ability Experiments

To further validate the performance of the proposed network, we conducted qualitative and quantitative experiments using optical remote sensing data with a resolution of 1.2 m from Google Earth. The longitude range is 113.27178°E to 113.33771°E, and the latitude range is 28.19092°N to 28.23761°N. By which, five networks, including HA-Net, are evaluated and the results are analyzed carefully.

Figure 12 indicates the results of bare soil extraction for typical regions in Scene III by different networks. According to the results, it is observed that DeepLabV3+ has missed some bare soil areas along the sides of roads (the yellow boxes in Figure 12c). Additionally, both DeepLabV3+ and DA-Net (Figure 12d) have incorrectly identified certain buildings as bare soil (blue boxes). BuildFormer (Figure 12e) demonstrates relatively poor generalization on this data set, with several missed obvious detection areas (the yellow boxes). YOSO (Figure 12f) performs well overall but also has missed some bare soil regions, primarily along roadsides and around buildings (the yellow boxes). HA-Net-B50 (Figure 12g) exhibits a few instances of missed detections, while HA-Net-B101 shows a notable false positive. Overall, HA-Net has demonstrated excellent generalization capability in this experiment. It is also noteworthy that all networks, including HA-Net, misclassified roads as bare soil to varying degrees.

Table 9 indicates that our method achieves the best performance, with the IoU exceeding YOSO by 4.545%, the F1 score surpassing YOSO by 2.68%, CPA reaching an optimal 92.851%, and Recall attaining the highest value of 93.391%. Among the HA-Net models, HA-Net-B50 has a slightly higher CPA compared to HA-Net-B101, but the other three metrics are lower than HA-Net-B101. This table further confirms that the proposed method demonstrates strong robustness, effectively extracting bare soil across different sensors and resolution scenarios.

In summary, HA-Net can achieve better extraction performance. This allows us to accurately extract real bare soil from remote sensing images, reducing the costs associated with manual field surveys and improving the efficiency of bare soil monitoring. This contributes to the rational planning and management of bare soil.

5. Discussion

Traditional semantic segmentation tasks, such as classifying objects like buildings and roads, generally involve regular polygon shapes and consistent structures, making classification relatively straightforward. However, identifying bare soil is more challenging due to its complex shapes, varying sizes, and often serving as the background for other surface objects. These objects, such as rocks, buildings, and roads, are typically distributed disorderly, potentially disrupting the integrity of bare soil, thus posing significant challenges to the accuracy of bare soil identification tasks.

In this study, we considered how to accurately identify bare soil information amidst complex backgrounds, while simultaneously extracting multiscale, especially small-scale bare soil areas. We proposed a bare soil identification framework to extract bare soil areas from CBERS-04A imagery. In the encoder, we introduced the Spatial Information Perception Module (SIPM) and the Channel Information Enhancement Module (CIEM) to identify multiscale bare soil information and suppress noise information. In the decoder, we proposed the Semantic Restructuring-based Upsampling Module (SRUM) to integrate high-level semantic information and low-level detail information, thereby enhancing the extraction of small-scale bare soil. Compared to CNN and ViT models, our model combines the advantages of a CNN and a Transformer, retaining the efficient feature extraction capability of a CNN for local features and the global information capturing ability of a Transformer.

The CNN models used for comparison in this research, namely DeepLabv3+, DA-Net, and YOSO, rely solely on convolutional layers for extracting image features and do not incorporate Transformer layers. These models are limited by the small receptive fields of convolutional layers, which cannot accurately reflect the spatial continuity of bare soil in complex backgrounds. BuildFormer, as a purely ViT-type model, can achieve global receptive field coverage through MHSA when constructing long-range modeling, but it lacks the inductive bias capability of CNNs. Therefore, its generalization ability is relatively weak, resulting in poorer performance metrics. In HA-Net, Transformer layers can extract global information, which is then enhanced by the SIPM and the CIEM to strengthen the extraction of local bare soil information. Consequently, in bare soil recognition tasks, combining CNNs with Transformers within an appropriate range will be beneficial for bare soil detection and continuous monitoring.

Although our model demonstrates excellent performance under most testing cases, it does have certain limitations. While it performs well under standard lighting conditions, HA-Net may be impacted by variations in illumination, such as strong light or shadowed areas, which can lead to inaccuracies in feature extraction. In the future, we will focus on incorporating multispectral data to create a bare soil data set, while simultaneously introducing new attention mechanisms, thereby enhancing the model’s robustness to improve the extraction performance for bare soil in different lighting conditions.

In Figure 10b,c and Figure 11, all network models, including HA-Net, tend to misclassify roads as bare soil to some extent. This issue arises not only due to the elongated shape of the roads but also because their texture features closely resemble those of bare soil. Additionally, the accumulation of dust from bare soil, driven by numerous vehicles and wind, on the road surface has caused the color characteristics of cement roads to increasingly resemble those of bare soil, thereby presenting significant challenges for accurate bare soil identification. Therefore, a key task is to investigate how to enhance the accuracy of bare soil extraction when feature similarities of the background are very high. This task is not only crucial for bare soil identification but also has broad applicability to other remote sensing tasks that involve extracting foreground objects from visually similar backgrounds.

In this study, we used manually annotated ground truth data to train and evaluate the model. Bare soil areas often exhibit complex visual features, such as subtle variations in color, texture, and shape, which are crucial for accurate segmentation. If some annotations are not correct, the model will learn incorrect features during training. Consequently, the model is affected by these erroneous annotations and fails to accurately capture the inherent characteristics of bare soil. In summary, the accuracy of annotations directly impacts the model’s feature extraction and generalization capabilities. So, in our following work, further optimizing the annotation process and improving data quality will be performed, which can enhance the model’s performance and reliability in practical applications.

In future work, we plan to extend HA-Net to other satellite data, such as data from satellites like GF-2 [44]. Additionally, considering that optical remote sensing images often have cloud coverage, we should further use remote sensing images with a certain cloud coverage rate to extract bare soil. This will facilitate the further evaluation and optimization of the HA-Net model’s performance, ensuring high accuracy in the extraction process.

6. Conclusions

In this paper, we integrate deep learning with geospatial analysis and present a framework for automatic bare soil area detection called HA-Net, which possesses robust feature extraction capabilities, and its potential application is validated in bare soil identification. The proposed method has achieved excellent results in both qualitative and quantitative analysis, achieving high-precision extraction of bare soil areas.

This paper successfully integrates deep learning with the geographical spatial information of high-resolution remote sensing images, providing a reference for other scholars in bare soil monitoring and promoting the application of deep learning in bare soil extraction. On the other hand, the approach of combining deep learning techniques with domain knowledge, particularly in geospatial information science, is beneficial for future research in remote sensing image analysis.

Author Contributions

Conceptualization, J.Z. and L.C.; methodology, J.Z.; supervision, D.D. and L.C.; software, J.Z.; validation, H.C., Y.J. and X.L.; formal analysis, J.Z. and D.D.; data curation, J.Z. and D.D.; visualization, J.Z. and X.L.; writing—original draft preparation, J.Z.; writing—review and editing, J.Z., D.D. and L.C. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China (grant number: 42101468).

Data Availability Statement

Data are contained within the article.

Acknowledgments

The authors express their sincere gratitude to the Hunan Research Institute of Meteorological Sciences for providing the data used in this study, and also appreciate the valuable comments and suggestions provided by the anonymous reviewers for this article.

Conflicts of Interest

The authors declare that we have no conflicts of interest.

References

Ellis, E.; Pontius, R. Land-use and land-cover change. Encycl. Earth 2007, 1, 1–4. [Google Scholar]
Lambin, E.F.; Turner, B.L.; Geist, H.J.; Agbola, S.B.; Angelsen, A.; Bruce, J.W.; Coomes, O.T.; Dirzo, R.; Fischer, G.; Folke, C.; et al. The causes of land-use and land-cover change: Moving beyond the myths. Glob. Environ. Chang. 2001, 11, 26–269. [Google Scholar] [CrossRef]
Müller, H.; Griffiths, P.; Hostert, P. Long-term deforestation dynamics in the Brazilian Amazon—Uncovering historic frontier development along the Cuiabá–Santarém highway. Int. J. Appl. Earth Obs. 2016, 44, 61–69. [Google Scholar] [CrossRef]
Xian, G.; Crane, M. An analysis of urban thermal characteristics and associated land cover in Tampa Bay and Las Vegas using Landsat satellite data. Remote Sens. Environ. 2006, 104, 147–156. [Google Scholar] [CrossRef]
Silvero, N.E.Q.; Demattê, J.A.M.; Amorim, M.T.A.; dos Santos, N.V.; Rizzo, R.; Safanelli, J.L.; Poppiel, R.R.; de Sousa Mendes, W.; Bonfatti, B.R. Soil variability and quantification based on Sentinel-2 and Landsat-8 bare soil images: A comparison. Remote Sens. Environ. 2021, 252, 112117. [Google Scholar] [CrossRef]
Wang, W.X.; Chai, F.H.; Ren, Z.H.; Wang, X.; Wang, S.; Li, H.; Gao, R.; Xue, L.; Peng, L.; Zhang, X.; et al. Process, achievements and experience of air pollution control in China since the founding of the People's Republic of China 70 years ago. Res. Environ. Sci. 2019, 32, 1621–1635. [Google Scholar]
Wuepper, D.; Borrelli, P.; Finger, R. Countries and the global rate of soil erosion. Nat. Sustain. 2020, 3, 51–55. [Google Scholar] [CrossRef]
Xu, H. Dynamics of Bare Soil in A Typical Reddish Soil Loss Region of Southern China: Changting County, Fujian Province. Sci. Geogr. Sin. 2013, 33, 489–496. [Google Scholar]
Liang, S.; Fang, H.; Chen, M. Atmospheric correction of landsat ETM+ land surface imagery-Part I: Methods. IEEE Trans. Geosci. Remote Sens. 2001, 39, 2490–2498. [Google Scholar] [CrossRef]
Pianalto, F.S.; Yool, S.R. Monitoring fugitive dust emission sources arising from construction: A remote-sensing approach. GIsci. Remote Sens. 2013, 50, 251–270. [Google Scholar] [CrossRef]
Dou, P.; Chen, Y. Dynamic monitoring of land-use/land-cover change and urban expansion in shenzhen using landsat imagery from 1988 to 2015. Int. J. Remote Sens. 2017, 38, 5388–5407. [Google Scholar] [CrossRef]
Zhang, Y.; Shen, W.; Li, M.; Lv, Y. Assessing spatio-temporal changes in forest cover and fragmentation under urban expansion in Nanjing, eastern China, from long-term Landsat observations (1987–2017). Appl. Geogr. 2020, 117, 102190. [Google Scholar] [CrossRef]
Chai, B.; Li, P. Annual Urban Expansion Extraction and Spatio-Temporal Analysis Using Landsat Time Series Data: A Case Study of Tianjin. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2018, 11, 2644–2656. [Google Scholar] [CrossRef]
Schultz, M.; Clevers, J.G.P.W.; Carter, S.; Verbesselt, J.; Avitabile, V.; Quang, H.V.; Herold, M. Performance of vegetation indices from Landsat time series in deforestation monitoring. Int. J. Appl. Earth Obs. Geoinf. 2016, 52, 318–327. [Google Scholar] [CrossRef]
Hamunyela, E.; Verbesselt, J.; Herold, M. Using spatial context to improve early detection of deforestation from Landsat time series. Remote Sens. Environ. 2016, 172, 126–138. [Google Scholar] [CrossRef]
Pendrill, F.; Gardner, T.A.; Meyfroidt, P.; Persson, U.M.; Adams, J.; Azevedo, T.; Bastos Lima, M.G.; Baumann, M.; Curtis, P.G.; Sy, V.D.; et al. Disentangling the numbers behind agriculture-driven tropical deforestation. Science 2022, 377, eabm9267. [Google Scholar] [CrossRef]
Zhu, F.; Wang, H.; Li, M.; Diao, J.; Shen, W.; Zhang, Y.; Wu, H. Characterizing the effects of climate change on short-term post-disturbance forest recovery in southern China from Landsat time-series observations (1988–2016). Front. Earth Sci. 2020, 14, 816–827. [Google Scholar] [CrossRef]
Mo, Y.; Kearney, M.S.; Turner, R.E. Feedback of coastal marshes to climate change: Long-term phenological shifts. Ecol. Evol. 2019, 9, 6785–6797. [Google Scholar] [CrossRef]
Rikimaru, A.; Roy, P.S.; Miyatake, S. Tropical forest cover density mapping. Trop. Ecol. 2002, 43, 39–47. [Google Scholar]
Li, S.; Chen, X. A new bare-soil index for rapid mapping developing areas using landsat 8 data. ISPRS Arch. 2014, 40, 139–144. [Google Scholar] [CrossRef]
Nguyen, C.T.; Chidthaisong, A.; Kieu Diem, P.; Huo, L.Z. A modified bare soil index to identify bare land features during agricultural fallow-period in southeast Asia using Landsat 8. Land 2021, 10, 231. [Google Scholar] [CrossRef]
Rasul, A.; Balzter, H.; Ibrahim, G.R.F.; Hameed, H.M.; Wheeler, J.; Adamu, B.; Ibrahim, S.; Najmaddin, P.M. Applying Built-Up and Bare-Soil Indices from Landsat 8 to Cities in Dry Climates. Land 2018, 7, 81. [Google Scholar] [CrossRef]
Chen, L.; Cai, X.; Xing, J.; Li, Z.; Zhu, W.; Yuan, Z.; Fang, Z. Towards transparent deep learning for surface water detection from SAR imagery. Int. J. Appl. Earth Obs. Geoinf. 2023, 118, 103287. [Google Scholar] [CrossRef]
Chen, L.; Zhang, P.; Xing, J.; Li, Z.; Xing, X.; Yuan, Z. A multi-scale deep neural network for water detection from SAR images in the mountainous areas. Remote Sens. 2020, 12, 3205. [Google Scholar] [CrossRef]
Chen, L.; Weng, T.; Xing, J.; Li, Z.; Yuan, Z.; Pan, Z.; Tan, S.; Luo, R. Employing deep learning for automatic river bridge detection from SAR images based on adaptively effective feature fusion. Int. J. Appl. Earth Obs. Geoinf. 2021, 102, 102425. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems 30 (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Chen, J.; Yi, J.; Chen, A.; Lin, H. SRCBTFusion-Net: An Efficient Fusion Architecture via Stacked Residual Convolution Blocks and Transformer for Remote Sensing Image Semantic Segmentation. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–16. [Google Scholar] [CrossRef]
Zhao, H.; Chen, X. Use of normalized difference bareness index in quickly mapping bare areas from TM/ETM+. In Proceedings of the International Geoscience and Remote Sensing Symposium, Seoul, Republic of Korea, 29 July 2005. [Google Scholar]
Cheng, G.; Yuan, X.; Yao, X.; Yan, K.; Zeng, Q.; Xie, X.; Han, J. Towards large-scale small object detection: Survey and benchmarks. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 13467–13488. [Google Scholar] [CrossRef]
Deng, Y.; Wu, C.; Li, M.; Chen, R. RNDSI: A ratio normalized difference soil index for remote sensing of urban/suburban environments. Int. J. Appl. Earth Obs. Geoinf. 2015, 39, 40–48. [Google Scholar] [CrossRef]
He, C.; Liu, Y.; Wang, D.; Liu, S.; Yu, L.; Ren, Y. Automatic extraction of bare soil land from high-resolution remote sensing images based on semantic segmentation with deep learning. Remote Sens. 2023, 15, 1646. [Google Scholar] [CrossRef]
Liu, D.; Chen, N. Satellite monitoring of urban land change in the middle Yangtze River Basin urban agglomeration, China between 2000 and 2016. Remote Sens. 2017, 9, 1086. [Google Scholar] [CrossRef]
Jesus, G.T.; Itami, S.N.; Segantine, T.Y.F.; Junior, M.F.C. Innovation path and contingencies in the China-Brazil Earth Resources Satellite program. Acta Astronaut. 2021, 178, 382–391. [Google Scholar] [CrossRef]
Cai, X.; Chen, L.; Xing, J.; Xing, X.; Luo, R.; Tan, S.; Wang, J. Automatic extraction of layover from InSAR imagery based on multilayer feature fusion attention mechanism. IEEE Geosci. Remote Sens. Lett. 2021, 19, 1–5. [Google Scholar] [CrossRef]
Srinivas, A.; Lin, T.Y.; Parmar, N.; Shlens, J.; Abbeel, P.; Vaswani, A. Bottleneck transformers for visual recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Chen, L.; Cai, X.; Li, Z.; Xing, J.; Ai, J. Where is my attention? An explainable AI exploration in water detection from SAR imagery. Int. J. Appl. Earth Obs. Geoinf. 2024, 130, 103878. [Google Scholar] [CrossRef]
Zhang, Q.L.; Yang, Y.B. Sa-net: Shuffle attention for deep convolutional neural networks. In Proceedings of the ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021. [Google Scholar]
Daubechies, I.; DeVore, R.; Foucart, S.; Hanin, B.; Petrova, G. Nonlinear approximation and (deep) ReLU networks. Constr. Approx. 2022, 55, 127–172. [Google Scholar] [CrossRef]
Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual attention network for scene segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–21 June 2019. [Google Scholar]
Wang, L.; Fang, S.; Meng, X.; Li, R. Building extraction with vision transformer. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–11. [Google Scholar] [CrossRef]
Hu, J.; Huang, L.; Ren, T.; Zhang, S.; Ji, R.; Cao, L. You only segment once: Towards real-time panoptic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023. [Google Scholar]
Tong, X.Y.; Lu, Q.; Xia, G.S.; Zhang, L. Large-scale land cover classification in Gaofen-2 satellite imagery. In Proceedings of the IGARSS 2018-2018 IEEE International Geoscience and Remote Sensing Symposium, Valencia, Spain, 22–27 July 2018. [Google Scholar]

Figure 1. Study Regions in Hunan Province.

Figure 2. The sample examples of the bare soil data set. The red color represents areas of bare soil, while the black color represents background areas.

Figure 3. Overall architecture of the HA-Net.

Figure 4. Overall structure of BoTNet.

Figure 5. The structure of the SIPM.

Figure 6. The structure of the CIEM.

Figure 7. The structure of the SRUM.

Figure 8. The stitching strategy: (a) sliding window stitching strategy; (b) final stitching strategy.

Figure 9. Typical bare soil scene images.

Figure 10. Different network extraction results for typical regions in test Scene I. (a–c) correspond to the three regions shown in Scene I of Figure 9. The orange-red regions indicate areas identified as bare soil by different networks. The yellow and light blue boxes mainly demonstrate missed detections and false alarms.

Figure 11. Different network extraction results for typical regions in test Scene II. (a,b) correspond to the two regions shown in Scene II of Figure 9. The orange-red regions indicate areas identified as bare soil by different networks. The yellow and light blue boxes mainly demonstrate missed detections and false alarms.

Figure 12. Different network extraction results for typical regions in Scene III: (a) Scene III; (b) ground truth for typical regions; (c–h) are the extraction results of bare soil by DeepLabV3+, DA-Net, BuildFormer, YOSO, HA-Net-B50, and HA-Net-B101.

Table 1. Optical payloads parameters of the CBERS-04A satellite.

Payload	Spectral Band	Spectral Range (μm)	Spatial Resolution (m)	Swath Width (km)	Revisit Cycle (Days)
WPM	1	0.45 μm~0.9 μm	2	90	31
	2	0.45 μm~0.52 μm	8
	3	0.52 μm~0.59 μm
	4	0.63 μm~0.69 μm
	5	0.77 μm~0.89 μm
MUX	6	0.45 μm~0.52 μm	17	90	31
	7	0.52 μm~0.59 μm
	8	0.63 μm~0.69 μm
	9	0.77 μm~0.89 μm
WFI	10	0.45 μm~0.52 μm	60	685	5
	11	0.52 μm~0.59 μm
	12	0.63 μm~0.69 μm
	13	0.77 μm~0.89 μm

Table 2. Parameters of the SIPM.

Input (16, 16, 2048)
Layer	Parameters	Output Shape
Branch-1 Conv2D	Filters = 256, Kernel_size = 3, Padding = 3, Dilation = 3	(16, 16, 128)
Branch-2 Conv2D	Filters = 256, Kernel_size = 3, Padding = 3, Dilation = 3	(16, 16, 256)
Branch-2 Conv2D	Filters = 128, Kernel_size = 3, Padding = 3, Dilation = 3	(16, 16, 128)
Branch-3 Conv2D	Filters = 256, Kernel_size = 3, Padding = 3, Dilation = 3	(16, 16, 256)
Branch-3 Conv2D	Filters = 128, Kernel_size = 3, Padding = 6, Dilation = 6	(16, 16, 128)
Concat	None	(16, 16, 384)
Conv2D	Filters = 1, Kernel_size = 1, Padding = 0, Dilation = 1	(16, 16, 1)
Sigmoid	None	(16, 16, 1)
Dot Product and SUM	None	(16, 16, 2048)

Table 3. Parameters of the CIEM.

Input (16, 16, 2048)
Layer	Parameters	Output Shape
Split	None	(16, 16, 1024) × 2
Depthwise Separable Conv2D ¹	Filters = 1024, Kernel_size = 3, Padding = 1, Dilation = 1 Filters = 1024, Kernel_size = 1, Padding = 0, Dilation = 1	(16, 16, 1024)
Depthwise Separable Conv2D ¹	Filters = 1024, Kernel_size = 7, Padding = 3, Dilation = 1 Filters = 1024, Kernel_size = 1, Padding = 0, Dilation = 1	(16, 16, 1024)
SUM	None	(16, 16, 1024)
Avg Pool × 2	Kernel_size = 16	(1, 1, 1024) × 2
Max Pool × 2	Kernel_size = 16	(1, 1, 1024) × 2
Sigmoid × 2	None	(1, 1, 1024) × 2
Dot Product	None	(16, 16, 1024)
ReLU	None	(16, 16, 1024)
Dot Product	None	(16, 16, 1024)
Conv2D	Filters = 2048, Kernel_size = 1, Padding = 0, Dilation = 1	(16, 16, 2048)

¹ Depthwise separable convolution consists of the following two steps: depthwise convolution and pointwise convolution.

Table 4. Parameters of the SRUM.

Input [(N, N, C) and (2N, 2N, C/2)] ¹
Feature	Layer	Parameters	Output Shape
High-level Features	Pixel rearrangement	Upsampling factor = 2	(2N, 2N, C/4)
	Transposed Conv2D	Filters = C/4, Kernel_size = 2, Stride = 2, Padding = 0, Dilation = 1	(2N, 2N, C/4)
	Concat	None	(2N, 2N, C/2)
	Conv2D	Filters = C/2, Kernel_size = 1, Padding = 0, Dilation = 1	(2N, 2N, C/2)
Low-level Features	Conv2D	Filters = C/2, Kernel_size = 3, Padding = 1, Dilation = 1	(2N, 2N, C/2)
	SUM	None	(2N, 2N, C/2)

¹ The input size for high-level features is (N, N, C), and for low-level features is (2N, 2N, C/2), where N represents the input height or width, and C represents the number of input channels.

Table 5. Experimental hardware and software configuration.

Layer	Parameters
Frame	PyTorch 1.20
Language	Python 3.7
CPU	Inter Xeon Silver 4210
GPU (Single)	NVIDIA RTX 3090

Table 6. Confusion matrix.

		Predicted Values
		Bare Soil	Non-Bare Soil
GroundTruth	Bare Soil	TP	FN
GroundTruth	Non-Bare Soil	FP	TN

Table 7. Comparison of bare soil extraction accuracy among different networks.

Scene	Method	CPA (%)	IoU (%)	Recall (%)	F1 (%)
Scene I	DeepLabV3+	89.716	75.628	82.807	86.123
	DA-Net	89.196	74.780	82.228	85.570
	BuildFormer	92.784	74.572	79.163	85.434
	YOSO	90.879	77.118	83.587	87.081
	HA-Net-B50	92.216	81.423	87.432	89.760
	HA-Net-B101	92.595	81.833	87.564	90.009
Scene II	DeepLabV3+	85.412	73.351	83.856	84.627
	DA-Net	86.240	73.743	83.577	84.887
	BuildFormer	87.463	74.683	83.635	85.507
	YOSO	87.315	75.171	84.386	85.826
	HA-Net-B50	88.689	78.680	87.455	88.068
	HA-Net-B101	89.178	79.969	88.564	88.870
Average for Two Scenes	DeepLabV3+	87.564	74.490	83.332	85.375
	DA-Net	87.718	74.262	82.903	85.229
	BuildFormer	90.124	74.628	81.399	85.471
	YOSO	89.097	76.145	83.987	86.454
	HA-Net-B50	90.453	80.052	87.444	88.914
	HA-Net-B101	90.887	80.901	88.064	89.440

Bold numbers indicate the best performance in the same group of comparative experiments.

Table 8. Different strategies for the ablation study.

	Backbone	Module-1	Module-2	CPA (%)	IoU (%)	Recall (%)	F1 (%)
Average for Two Scenes	BoTNet50	None	None	87.261	74.562	83.696	85.427
	BoTNet101	None	None	87.753	75.233	84.087	85.866
	BoTNet50	SIPM	None	88.785	77.256	85.642	87.167
	BoTNet101	SIPM	None	89.210	78.018	86.167	87.648
	BoTNet50	CIEM	None	88.478	76.774	85.341	86.858
	BoTNet101	CIEM	None	88.762	77.478	85.926	87.305
	BoTNet50	SRUM	None	87.694	75.327	84.261	85.927
	BoTNet101	SRUM	None	88.200	76.150	84.825	86.460
	BoTNet50	SIPM	CIEM	89.547	78.691	86.658	88.070
	BoTNet101	SIPM	CIEM	89.729	79.189	87.090	88.383

Table 9. Comparison of Generalization Ability across Different Networks.

Method	CPA (%)	IoU (%)	Recall (%)	F1 (%)
DeepLabV3+	88.570	80.909	90.341	89.447
DA-Net	86.174	80.371	92.270	89.118
BuildFormer	92.579	73.352	77.935	84.628
YOSO	89.437	81.933	90.711	90.069
HA-Net-B50	92.851	85.744	91.804	92.325
HA-Net-B101	92.116	86.478	93.391	92.749

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhao, J.; Du, D.; Chen, L.; Liang, X.; Chen, H.; Jin, Y. HA-Net for Bare Soil Extraction Using Optical Remote Sensing Images. Remote Sens. 2024, 16, 3088. https://doi.org/10.3390/rs16163088

AMA Style

Zhao J, Du D, Chen L, Liang X, Chen H, Jin Y. HA-Net for Bare Soil Extraction Using Optical Remote Sensing Images. Remote Sensing. 2024; 16(16):3088. https://doi.org/10.3390/rs16163088

Chicago/Turabian Style

Zhao, Junqi, Dongsheng Du, Lifu Chen, Xiujuan Liang, Haoda Chen, and Yuchen Jin. 2024. "HA-Net for Bare Soil Extraction Using Optical Remote Sensing Images" Remote Sensing 16, no. 16: 3088. https://doi.org/10.3390/rs16163088

APA Style

Zhao, J., Du, D., Chen, L., Liang, X., Chen, H., & Jin, Y. (2024). HA-Net for Bare Soil Extraction Using Optical Remote Sensing Images. Remote Sensing, 16(16), 3088. https://doi.org/10.3390/rs16163088

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

HA-Net for Bare Soil Extraction Using Optical Remote Sensing Images

Abstract

1. Introduction

2. Study Regions and Data

2.1. Study Regions

2.2. CBERS-04A Data

2.3. Data Set

3. Methods

3.1. Overall Framework of HA-Net

3.2. The Encoder

3.2.1. Backbone

3.2.2. SIPM

3.2.3. CIEM

3.3. The Decoder

4. Experiments and Results

4.1. The Stitching Strategy of the Sliding Windows

4.2. Experimental Environment and Training Parameter Settings

4.3. Experimental Results and the Analysis

4.3.1. Bare Soil Extraction by Different Networks

4.3.2. Ablation Study

4.3.3. Generalization Ability Experiments

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI