Multi-Attention-Based Semantic Segmentation Network for Land Cover Remote Sensing Images

Jia, Jintong; Song, Jiarui; Kong, Qingqiang; Yang, Huan; Teng, Yunhe; Song, Xuan

doi:10.3390/electronics12061347

Open AccessArticle

Multi-Attention-Based Semantic Segmentation Network for Land Cover Remote Sensing Images

by

Jintong Jia

¹,

Jiarui Song

²

,

Qingqiang Kong

³,

Huan Yang

⁴,

Yunhe Teng

¹ and

Xuan Song

^1,*

¹

School of Cyber Science and Engineering, Zhengzhou University, Zhengzhou 450001, China

²

Department of Mathematics, University of Toronto, Toronto, ON M5S 1A1, Canada

³

The First Geological Survey Exploration Institute of Henan Bureau of Geo-Exploration and Mineral Development, Zhengzhou 450001, China

⁴

School of Water Conservancy Science and Engineering, Zhengzhou University, Zhengzhou 450001, China

^*

Author to whom correspondence should be addressed.

Electronics 2023, 12(6), 1347; https://doi.org/10.3390/electronics12061347

Submission received: 11 February 2023 / Revised: 9 March 2023 / Accepted: 9 March 2023 / Published: 12 March 2023

(This article belongs to the Special Issue Advanced Techniques in Computing and Security)

Download

Browse Figures

Versions Notes

Abstract

:

Semantic segmentation is a key technology for remote sensing image analysis widely used in land cover classification, natural disaster monitoring, and other fields. Unlike traditional image segmentation, there are various targets in remote sensing images, with a large feature difference between the targets. As a result, segmentation is more difficult, and the existing models retain low accuracy and inaccurate edge segmentation when used in remote sensing images. This paper proposes a multi-attention-based semantic segmentation network for remote sensing images in order to address these problems. Specifically, we choose UNet as the baseline model, using a coordinate attention-based residual network in the encoder to improve the extraction capability of the backbone network for fine-grained features. We use a content-aware reorganization module in the decoder to replace the traditional upsampling operator to improve the network information extraction capability, and, in addition, we propose a fused attention module for feature map fusion after upsampling, aiming to solve the multi-scale problem. We evaluate our proposed model on the WHDLD dataset and our self-labeled Lu County dataset. The model achieved an mIOU of 63.27% and 72.83%, and an mPA of 74.86% and 84.72%, respectively. Through comparison and confusion matrix analysis, our model outperformed commonly used benchmark models on both datasets.

Keywords:

remote sensing image; attention mechanism; image segmentation; deep learning; semantic segmentation

1. Introduction

Semantic segmentation is a difficult task in remote sensing image analysis and processing [1,2]. As a classical computer vision problem, full-pixel semantic segmentation means that each individual pixel in an image is assigned a different class ID according to the object of interest to which it belongs. Such algorithms have been deployed in various fields, especially resource planning, and are proven to be an improvement. In recent years, along with the rapid development of related technologies and optical sensors in the field of remote sensing, the cost of acquiring remote sensing images is now lower, the resolution is higher, and the information contained in remote sensing images is richer [3], which creates better conditions for remote sensing image information extraction, in conjunction with the demand of researchers to extract detailed information from the region of interest in remote sensing images gradually increasing, and the combination of computer vision and remote sensing fields becoming more and more intense [4]. Land use cover is gradually becoming one of the hot spots [5], or LUCC for short, and is currently at the center of global environmental change assessments [6].

For the realization of remote sensing image land use classification, the target remote sensing image must be segmented at the pixel level. The most accurate method to get land feature classification information is manual annotation; however, this method is costly and inefficient, and cannot be applied to a large amount of data [7]. Some intuitive methods have been used earlier to reduce the workload. Semantic segmentation, for example, is usually based on low-level image features such as gray level and texture information, threshold-based segmentation, edge-based segmentation, region-based image segmentation [8], and specific theory-based image segmentation, which use the low-order visual information of images themselves, such as grayscale information and edge information, but often only extract the low-level features of the image, and it is difficult to cope with data with richer feature information and more image classes. Therefore, traditional segmentation methods are only applicable to specific tasks and rely too much on expertise for setting hyperparameters, which are not very generalizable and not very efficient [9].

In modern times, deep convolutional neural networks (CNNs) [10] have been widely used in remote sensing image analysis tasks and show good performances in segmentation tasks due to their powerful recognition ability [11]. Advances in remote sensing technology enable us to obtain higher quality images, creating sufficient conditions for the integration of this field with deep learning [12,13]. Still, unlike the semantic segmentation of traditional images, the land cover classification based on deep learning is still a challenging task [14], and the main reasons are as follows:

(1): Unlike ordinary images, the composition of remote sensing images is more complex, in which there are many kinds of targets and the features between different targets that vary greatly, so it is more difficult to carry out segmentation compared with ordinary scenes [15].
(2): The large differences in imaging conditions can easily cause objects with the same semantic classification ID to show different features at different spatial locations, which can also easily produce more noise when conducting imaging. For example, highways with regular shapes and colors and rural roads have different features, but they all belong to the same semantic concept [16].
(3): The network model with a high classification accuracy usually has a complex structure, high model complexity, and slow inference speed, which are not conducive to application in practical scenarios, so it is necessary to trade off between algorithm complexity and segmentation accuracy [17].

In summary, a powerful segmentation model should not only be able to accurately identify the semantic information of the target, but also enhance the edge extraction of adjacent targets, which puts higher demands on the feature extraction effect.

The powerful feature representation and learning capability of convolutional networks are more suitable for more centralized processing of large-scale data, saving labor and improving recognition efficiency. Deep neural networks have strong spatial context information mining abilities and better feature fusion effects compared with traditional methods, so more and more researchers use deep learning as the preferred method.

FCN is a framework for image semantic segmentation proposed by Jonathan Long et al. in 2015 [18]. FCN replaces all the fully connected layers of CNN with convolutional layers, the output of the network is a heat map, and the size of the input image is not limited. SegNet [19] is also based on FCN and modifies the VGG-16 network to obtain a semantic segmentation network, which is more accurate than the traditional model. DeepLab v3+ [20] is a classical architecture, which features the elimination of repeated upsampling and proposes a multi-scale structure based on extended convolution; dilated convolution is introduced to balance model accuracy and time consumption. Ronnerberge et al. propose U-Net [21], a U-type network based on an encoder and decoder, using skpi-connections to recover lost information during downsampling. The PSPNet [22] proposed based on a pyramid pooling module for scene parsing is also commonly used. This network can aggregate contextual information from different regions to mine global contextual information.

In order to investigate the application of deep learning in the task of high-resolution remote sensing image segmentation and to gain a comprehensive understanding of the latest developments in this field, we conducted a search on the Google Scholar, WoS, and Scopus databases using the keywords “high-resolution remote sensing image”, “deep learning”, “semantic segmentation”, “network architecture”, and other related terms. By combining the search results of these databases, we collected a significant number of methods based on deep learning for the segmentation of high-resolution remote sensing images. UNet++ [23] appears in which densely nested skpi-connections are used to connect the encoder and decode, and is supplemented with a deep supervision mechanism to accelerate network training convergence, but its use of large intermediate convolution leads to computational costs and dense skip-connections introduce some redundant information. FarSeg [24], proposed by Zheng et al. in 2020, addresses the problem of false positives and foreground-background imbalance using a relationship-based and optimization-based foreground modeling approach, where the relationship-based approach is somewhat similar to the self-attentive mechanism, and the approach is worthwhile. Chen proposed Trans-UNet [25] in 2021 and, through the global-based self-attention mechanism and skip connection, it can effectively obtain the overall information, so that the model can focus on more detailed features to identify fine spatial information. Attention-UNet [26] focuses more on the focused regions of the image, but lacks attention to semantic concepts, causing the encoder and decoder a large semantic gap between the feature maps, which may not capture different levels of features during decoding. SegFormer [27] is an improved approach based on Transformer, which enables semantic segmentation of images by dividing them into a series of blocks, each of which is processed by a set of Transformer encoders and decoders. The model uses a self-attentive mechanism to handle spatial relationships in images. In remote sensing image segmentation, the frequently used single-scale convolutional kernels can limit the range of feature extraction, so some researchers proposed the multi-scale full convolutional network (MSFCN) [28], which has multi-scale convolutional kernels, a channel attention block (CAB), and a global pooling module (GPM), thus improving the performance and stability of the model. The DIResUNet [29] model is based on UNet and consists of an initial module, an optimized residual block, and a dense global spatial pyramid pooling (DGSPP) module. The model further improves the performance and stability of the model by extracting multi-level features and extracting global information in parallel, which is an efficient overall approach. DPPNet [30] is an efficient image segmentation model which consists of a deep pyramid pool (DPP) block and a dense block of deep multi-expansion residual connections. The model takes full advantage of the dense pyramid structure, while considering multiple levels of features in prediction.

Based on the aforementioned discussion, because the high-resolution remote sensing images are richer in feature types and details compared with ordinary images, there is an unbalanced target distribution. In addition, some targets with small sizes or small sample sizes are easily misclassified, which increases the classification difficulty of the network. The features in remote sensing images have complex structures and rich texture information, and there are some cases with the same target class but large feature differences and some with different target classes but similar features, which put forward higher requirements on the feature extraction and abstraction ability of the network. In this study, we adopt UNet as the benchmark model and combine various attention mechanisms, and the attention fusion module for feature fusion at skip-connection, to improve the generalization ability of the model and to better cope with these problems. Due to the rich contextual information of remote sensing images, the algorithm requires upsampling of non-uniformly sampled data. Therefore, in this paper, we adopt a different upsampling operator from the traditional method, CARAFE (Content-Aware Reassembly of Features), which can adaptively adjust the size and shape of the convolution kernel, and, by learning the interrelationship of features, can better adapt to non-uniformly sampled data and improve the segmentation performance of the network with less computation to optimize the accuracy of target edge segmentation.

(1): Based on the coordinated attention mechanism, the residual network was used in the backbone network to alleviate the problems of gradient dispersion and gradient explosion as the network deepened. At the same time, the generalization performance and fine-grained feature extraction ability of the backbone network were improved without increasing the amount of calculations.
(2): We propose an attention fusion module for skip-connection to improve the network feature fusion capability.
(3): We use a content-aware reorganization module CARAFE [31] in the decoder instead of the traditional upsampling method to improve the contextual information aggregation capability without increasing the computational effort.

2. Methods

This model utilizes the UNet architecture as the baseline for its structure design. The UNet model is symmetrical in structure and follows the classical encoder-decoder structure. After the encoder performs successive convolution and downsampling, a feature map with a small resolution but condensed high-dimensional semantic representation is generated, and the upsampling part is called expanding path, which actually localizes the target object and is symmetric with the downsampling part. Additionally, each layer uses the copy and crop method to complete the skip-connection operation, and the segmentation result with the same size as the original image is obtained after continuous upsampling. The combination of high-dimensional and low-dimensional feature maps makes the information extracted by the model more comprehensive, enabling the model to be more sensitive to detailed information. The skip-connection operation directly concatenates the more accurate gradient, point, line, and other information in the encoder of the same layer into the decoder of the same layer.

When applying UNet to remote sensing image segmentation, several drawbacks arise. Firstly, its architecture, based on a full convolutional neural network and multiple upsampling and downsampling operations, incurs a significant amount of memory and computational overhead. The resolution degradation caused by scaling is not conducive to improving the segmentation effect. Secondly, convolution and pooling operations tend to cause the loss of boundary information. Finally, the baseline model lacks corresponding measures to cope with the noise in the image. This task is challenging when applied to remote sensing images with a large number of targets with complex features.

To address these problems, this study introduces multiple attention mechanisms applied at different locations in the network. Additionally, an efficient upsampling method is embedded to propose a UNet based on CARAFE and multiple attention mechanisms, which we call CA-UNet.

2.1. Structure of CA-UNet

In CA-UNet, we use the residual block and coordinate attention module to construct the encoder to improve the fine-grained extraction ability of the target in the backbone network, and to reduce the detail loss caused by the pooling layer during downsampling. For the downsampling operation, this model uses a 3 × 3, step 2 convolution for the downsampling operation to improve the feature extraction capability of the model. Instead of the traditional upsampling method, CA-UNet uses the CARAFE operator in the decoder. Attention fusion modules are also constructed to combine spatial and channel attention fusion features in different feature layers. The overall network structure is shown in Figure 1. The functionality and construction method of each module are described below.

2.2. Residual Encoder Based on Coordinate Attention

Deep networks have a higher error convergence rate compared to shallow networks, which can lead to degradation problems. Hence, adding more layers to the model may sometimes result in reduced performance. The residual block [32] solves these problems without increasing the number of parameters and the computational complexity of the model. Part of the input in this structure is not passed through the convolutional network to the output, which retains part of the shallow information and avoids the loss of feature information due to the deepening of the feature extraction network; previous experiments also show that the identity shortcut connection structure can achieve better results. The specific process is shown in Figure 2. Batch normalization is a popular regularization technique used in deep neural networks to normalize inputs in each batch, resulting in better fitting of the activation function of the network, reduced gradient disappearance or explosion, accelerated model training, and improved model accuracy. In residual networks, BN is often combined with the ReLU activation function to address the problem of gradient disappearance caused by ReLU’s gradient of 0 in the negative part. Furthermore, BN can mitigate model instability that may arise when the gradient of ReLU activation becomes large due to large input, and limit the input range to a suitable range, thereby improving model robustness and generalizability.

However, only utilizing the residual module in the backbone does not effectively improve the performance of the encoder. When we consider the computational effort and the network performance, the attention module is an essential part used to tell a model “what” and “where” to attend and it has been widely used in various deep learning models to improve network performance; the most commonly used attention modules [33] are squeeze-and-excitation (SE) attention [34] and CBAM [35]. SE block has been extensively applied in remote sensing image segmentation tasks in recent years. However, the SE block design only considers the influence of channel relationships on features, ignoring spatial location information. In remote sensing image segmentation tasks, spatial location information is crucial for generating spatially selective attention maps. The lack of spatial location information modeling may lead to the inability of attention maps to accurately represent the spatial relationships between pixels, thus affecting the segmentation accuracy. Although CBAM considers information in both spatial and channel dimensions, it is computationally intensive and increases the model’s complexity and computational cost, making it less suitable for scenes requiring high computational resources. Additionally, CBAM performance is highly dependent on the convolutional layer input, and for some special scenes or irregular remote sensing image inputs, it may not be able to extract features accurately or make effective adjustments, resulting in degraded image segmentation performances. We introduce a lightweight coordinate attention module [36], which is different from the previous mechanism. Coordinate attention is an attention mechanism that can focus on specific locations in remote sensing images that require special attention. It has been shown to improve the recognition of multi-scale target information and enhance model robustness by adjusting the weights for each location and channel to enhance the variability of features. Additionally, it does not add a significant computational cost, making it a practical choice for various applications. By using coordinate attention, a model can make more comprehensive use of image information.

As shown in Figure 3, for the coordinate information embedding part, given any input tensor, the module uses two pooling kernels (H, 1) and (1, W) to encode and calculate each channel along the horizontal and vertical coordinates, and aggregates features in the spatial direction through the transformation of the two directions to generate a pair of feature maps. For each pixel point of the feature map, a coordinate vector is generated to encode its position within the feature map. This coordinate vector is then embedded into the feature vector using a fully connected layer to enhance the location information’s influence on the feature representation. The resulting feature vector and the location-embedded coordinate vector are then concatenated and embedded into a vector of dimension C using another fully connected layer. This vector represents the pixel point’s weight on different channels. The channel-embedded vector is then normalized using a function to obtain the weight of the pixel on all channels, which is subsequently multiplied with the feature vector to obtain the weighted feature vector of the pixel. This process enables the network to learn which locations require special attention during computation. The resulting weighted feature maps are globally pooled and averaged over the channel dimensions to obtain the average of each channel. Next, the weights of each channel are obtained through two fully connected layers and the sigmoid function. Finally, these weights are multiplied with the feature maps to obtain the weighted feature maps, allowing the network to learn which channels require special attention during computation. The resulting weighted feature maps are then concatenated in the channel dimension to obtain the final feature maps. This approach captures the dependence of one spatial direction while retaining the precise information of the other, which helps the model to locate the desired area of attention. In the coordinated attention generation stage, the second transformation is used to encode location information using the globalized receptive field, which satisfies the following three conditions. Firstly, this transformation should be as simple as possible to ensure that it does not increase the computational burden; secondly, to improve the information extraction ability and conversion efficiency, the information captured in the previous stage should be fully utilized; finally, the relationship between the channels should not be ignored. This encoding process allows us to coordinate attention to more accurately locate the exact position of the target object, thus facilitating better recognition throughout the model.

2.3. Feature Fusion Decoder Based on Attention Mechanism and CARAFE

During the down-sampling process, the resolution of the image is gradually reduced to obtain image information of different scales. This allows the network to extract features at different levels of abstraction, starting from low-level information, such as points, lines, and gradients in the underlying features, and gradually moving towards more abstract information such as contours and shapes. The entire network combines these “fine to coarse” features to obtain a comprehensive representation of the image. In the decoding stage, the upsampling operator is used to restore the feature map to its original size, enabling the network to generate precise predictions. Feature upsampling is a crucial component of modern convolutional network architecture and is particularly important in tasks such as instance segmentation. The methods commonly used in previous studies for upsampling, namely bilinear interpolation, deconvolution, and transposed convolution, have limitations when applied to feature-rich and detailed images such as remote sensing images. Bilinear interpolation uses a weighted average of neighboring pixels, but this method fails to capture the intricate texture information in the image, resulting in a less-detailed image after upsampling. Furthermore, bilinear interpolation is a global upsampling method that applies zoom to all pixels of the image, potentially introducing noise or artifacts that can degrade the accuracy of remote sensing image segmentation. Deconvolution, while capable of producing high-quality upscaled images, requires many reverse convolution operations and is computationally expensive for larger input images used in remote sensing image segmentation. Moreover, the upsampling process may cause pixel misalignment, further degrading image quality. Transposed convolution also suffers from computational intensity and produces a tessellation effect or high-frequency noise in the reconstructed high-resolution images. Therefore, alternative methods, such as sub-pixel convolution, nearest-neighbor interpolation, and pixel shuffle, have been proposed to address these issues in remote sensing image segmentation. In this model, the lightweight CARAFE [31] is used as the upsampling operator. Compared to traditional upsampling methods, CARAFE can effectively suppress the tessellation effect and noise. By introducing an asymmetric convolution kernel and deformable convolution operation, it can better preserve the spatial feature information of the input feature map and improve the quality of the reconstructed image. Instead of manually specifying the output size, the output size is adaptively adjusted according to the input feature map, making it more flexible for different remote sensing image segmentation tasks of varying sizes. Additionally, it boasts a higher computational efficiency.

As shown in Figure 4, given a feature mapping χ of size C × H × W and an upsampling rate σ, CARAFE will generate a new feature mapping of size C × σH × σW for the new feature mapping

χ^{'}

. For each target position,

χ^{'}

, of the output,

l^{'} = (i^{'}, j^{'})

, there is a corresponding position,

l = (i, j)

, at the input, where

i = [i^{'} / σ]

and

j = [j^{'} / σ]

, and here

N (χ_{l}, k)

represents the neighborhood of k × k size centered at position

l

. The kernel prediction module predicts an associated position,

ω_{l^{'}}

, for each position,

l^{'}

, based on the neighbors of

χ_{l}

, as Equations (1) and (2) show:

ω_{l^{'}} = ψ (N (χ_{l}, k_{e n c o d e r}))

(1)

χ_{l^{'}}^{'} = ϕ (N (χ_{l}, k_{u p}), ω_{l^{'}})

(2)

Each location on χ corresponds to

σ^{2}

target locations on

χ^{'}

. Each target location requires a recombination kernel of size

k_{u p} {\times k}_{u p}

. The content encoder takes the compressed feature mapping as the input and encodes the content to generate the reorganization kernel; the kernel normalization applies the softmax function to each reorganization kernel. For each reorganization kernel,

ω_{l^{'}}

, the content-aware reganization module reassembles the features within the local region by means of the ϕ function, which is simply a weighting operator. For the target location of

l^{'}

and the corresponding square region centered at

l = (i, j)

,

N (χ_{l}, k_{u p})

the reorganization process is shown in the following Equation (3):

r = [k_{u p} / 2], χ_{l^{'}}^{'} = \sum_{n = - r}^{r} \sum_{m = - r}^{r} ω_{l^{'} (n, m)} \cdot χ_{(i + n, j + m)}

(3)

In the reorganized kernel,

N (χ_{l}, k_{u p})

, each pixel in the region affects the upsampled pixels,

l^{'}

, differently according to the feature content rather than the location distance. Since the relevant regions can get more attention, the semantic expression ability of the reorganized feature map is stronger. The parameters for the three submodules in the Kernel Prediction Module are set as follows. The Channel Compressor module uses a 1 × 1 convolutional layer to compress the input feature channels. The Content Encoder module sets the encoder parameters as

k_{e n c o d e r} \times k_{e n c o d e r} \times C_{m} \times C_{u p}

, where increasing the kencoder expands the perceptual field of the encoder. However, the computational complexity increases exponentially with the increase in kernel size, so it was decided to use the researcher’s experience and set

k_{e n c o d e r} = k_{u p} - 2

to achieve a trade-off between performance and efficiency. Finally, in the Kernel Normalizer module, each

k_{u p} \times k_{u p}

recombination kernel is spatially normalized with a softmax function before being applied to the input feature map to achieve true mapping.

The skip-connection of UNet is an effective way to retain the feature information at different scales; however, the simple stitching method has some disadvantages, which may result in the loss of critical feature information, and may not handle the relationship between feature maps of different resolutions effectively, leading to the loss of detailed information. Remote sensing images have a higher resolution than traditional images, containing more feature scales, and, thus, cross-layer connection methods need be selected and optimized carefully when being applied to remote sensing image segmentation to address the limitations of the simple stitching method. This paper designs an attention fusion module that combines different attention modules to enhance the recognition capability, strengthen the recognition of the distinction between the targets at different scales, refine the features between the different classes of targets, and solve the multi-scale problem.

As shown in Figure 5. This module takes feature maps extracted from the encoder and decoder as inputs. The final feature map is then restored to the same spatial resolution as the input image, and the output of the attention mechanism is re-connected with the feature map after the corresponding operation. All three convolutional kernels are 1 × 1 in size, and the information from multiple channels is fused through channel conversion to enhance the network’s information extraction capability.

For feature maps with different dimensions and different features, an appropriate attention mechanism should be selected to enable the network to locate the regions of interest for refinement more easily. In U-shaped networks, shallow feature maps are characterized by having large resolutions, where the spatial feature distribution has a significant impact on fusion. To address this issue, we have employed a spatial attention-based fusion module in CA-UNet. First, we reduce the number of channels by passing the input feature map through a 1 × 1 convolutional layer to reduce the computational effort. Next, we compute the Query (the output of the previous layer), Key (indicating the feature associated with Query), and Value (indicating the object to be weighted average, representing the intermediate feature map of the current layer) using three different 1 × 1 convolutional layers. We then compute the Attention Score based on the Query and Key, with Attention Score = Query × Key. The attention score is normalized by softmax to obtain the Attention Map, whose value ranges from 0 to 1, indicating the weight of each position under the attention mechanism. We weigh and fuse the Attention Map with Value to obtain the final output feature map. The fusion is performed in the channel or spatial dimension, and we use a 1 × 1 convolutional layer in this paper.

High-dimensional features are typically compressed based on their channels before being fused using the channel attention module for the smaller-scaled feature maps. In this study, we propose an enhanced channel attention mechanism, as traditional channel attention modules that capture the dependencies of all channels are not always necessary or efficient. This model uses ECANet [37], which removes the fully connected layer of the original channel attention module and learns directly by a 1D convolution on the globally averaged pooled features. The size of the convolution kernel affects the number of channels to be considered for the calculation of each weight of the attention mechanism, the coverage of cross-channel interactions. The process is shown in Figure 6, in which two convolutional layers are used, a 1 × 1 convolutional layer for feature map compression and a 3 × 3 convolutional layer for generating the attention map. In addition, the padding of the convolution kernel is set to 1 in order to keep the size of the input and output feature maps the same. The weights of each position in the attention map are mapped by the Sigmoid activation function.

3. Experiments and Results

This section introduces the two datasets and experimental settings used in our study. We also compare the performance of current commonly used models on these two datasets to verify the effectiveness of CA-UNet. To fairly validate the effectiveness of our model, we conducted experiments on the WHDLD public dataset and the multi-classified land cover dataset we produced for Lu’s County.

3.1. Datasets

3.1.1. WHDLD Datasets

The WHDLD dataset [12] is a widely used public dataset in remote sensing image segmentation experiments. The Gaofen-1 satellite captures images of Wuhan City using panchromatic mode, which involves imaging with a single wavelength band typically within the visible or near-infrared range to achieve high spatial resolution. These images are then combined with multi-spectral images captured at multiple wavelength bands to obtain RGB images with high resolution and color accuracy through image fusion and image resampling technologies. The spatial sampling step for the panchromatic images is 2 m per pixel, with each image eventually cropped into 6 categories labeled 256 × 256. The data are bare soil, building, road, water, pavement, and vegetation. The total number of data is 4940. Figure 7 displays the sample images and their corresponding labels.

3.1.2. Lushi County Datasets

The multiclassification land cover dataset of Lushi County is produced by the high-resolution remote sensing image processing and labeling of Lushi County. The dataset is also an RGB dataset. The original remote sensing image example is shown in Figure 8, the image resolution is 2 m × 2 m, pre-processed with arcgis software, and labelme is used for the data labeling.

The multi-category land cover remote sensing image dataset of Lushi County was labeled by us according to the Third Land Survey Manual combined with the high-resolution remote sensing image of Lushi County with a spatial resolution of 2 m/pixel, and the image was segmented into 256 × 256 by cropping, labeled into 5 categories: trees, road, building, water, and field, with a total of 5100 datapoints. Figure 9 displays the sample images and their corresponding labels.

From the above two pictures, we can see that the remote sensing image multiclassification land cover dataset is different from the dataset used for general semantic segmentation, its scene is more complex with more kinds of recognition, and there are big differences between the different targets, such as farmland, road, and field features. In addition, the same targets also have huge feature differences, such as highways and countryside dirt roads, which brings huge difficulties to segmentation, and how to cope with the chaotic distribution between the targets and big feature differences becomes an important problem.

3.2. Evaluation Metrics

In this experiment, four commonly used semantic segmentation evaluation metrics are selected for evaluating the performance of the classification algorithm and comparing it with other advanced models. The calculation method of each metric is described below.

In the problem of semantic segmentation, the intersection and union ratio represent the intersection and union of the true label and the predicted value of a class. The mean intersection over union (mIOU) is defined as the ratio of the intersection of the model predictions and the true values of each category. In other words, it is the average of the IOU values for each class on the dataset. In this experiment, mIOU has been selected as the main metric, which reflects the overall accuracy and consistency degree. Equation (4) illustrates the calculation method for mIOU. Specifically, TP represents that the true value is positive and that the prediction is positive. FN represents that the true value is positive and that the predicted value is negative. FP represents that the true value is negative and the that prediction is positive. TN represents that the true value is negative and that the prediction is negative.

m I O U = \frac{1}{k + 1} \sum_{i = 0}^{k} \frac{T P}{F N + F P + T P}

(4)

The mean pixel accuracy, abbreviated as MPA, calculates the percentage of pixels correctly classified for each category separately for each category, then sums and averages these values, as in Equation (5).

M P A = \frac{1}{k + 1} \sum_{i = 0}^{k} \frac{T P + T N}{F N + F P + T P + T N}

(5)

Precision, also known as accuracy, indicates the percentage of samples identified by the model as positive, and recall, also known as completeness, indicates the ratio of the number of samples correctly identified by the model as positive to the total number of positive samples, as shown in Equation (6).

P = \frac{T P}{T P + F P}

(6)

R = \frac{T P}{T P + F N}

(7)

3.3. Implementation Details

In this experiment, all models were tested using a computer configured with CUDA 10.0, PyTorch 1.8.0, and Python 3.6 experimental environments. To ensure the fairness of the experimental conditions, uniform data augmentation, including random cropping and rotation, was performed on the experimental data, and the final input model size was 256 × 256 × 3 for both datasets. All experiments were run on a server equipped with a Tesla P100 GPU with a 16G video memory size and Windows 10 as the operating system. The adaptive learning rate optimizer, Adam, has the advantage of low memory requirements for high computational efficiency. Therefore, it was used for adaptive network updates and parameter optimization. The initial learning rate during training was 0.001 with 8 randomly selected batches and 150 epochs. The cross-validation training method was used.

3.4. Results and Analysis

3.4.1. Experimental Results Obtained on the WHDLD Dataset and Lushi County Dataset

CA-UNet was used to conduct experiments on two high-resolution remote sensing image datasets under the same conditions, and both showed good performances; our model obtained an mIOU of 63.27 on the WHDLD dataset, a result that was 4.03 higher than the baseline model. The experimental metrics for each categorical category in the WHDLD dataset are shown in Table 1.

The experimental metrics for each category in the Lushi County dataset are shown in Table 2.

3.4.2. Comparison of Predicted Results

In order to further compare the experimental results of our method with the baseline UNet, we used the training results of the two models on the WHDLD dataset and Lu County dataset to predict some complex scenes to achieve the purpose of visualizing the experimental results, and the prediction results are shown in Figure 10 and Figure 11.

The segmentation results for Lushi County are shown in Figure 11, and, according to the experimental results, we can see that the segmentation effect of the present model was significantly improved. In the Lu’s County dataset, as shown in the black box, the original model classifies large areas of water and the land boundary proximity into the wrong category, which is well mitigated in the improved model, and the segmentation results of the original model also misclassify trees into agricultural land. In addition, the improved model performs better in classifying target edges, as shown in the second row of Figure 10 where the forest boundary is handled better, and the misclassified area is smaller.

The segmentation results on the WHDLD dataset are shown in Figure 11. The classification results on slender targets are significantly better than U-Net; as shown in the black box in the first row, the vegetation segmentation near the edge of the image is better, while UNet directly loses the targets. In addition, the edge segmentation results of buildings are also more accurate. In addition, in the second row of the small target recognition results, our model can extract the contours of small targets more accurately, while U-Net does not recognize these targets. In the third row of roads and buildings and the bare soil adjacent to each other, our model performs better and not only correctly identifies the targets, but also more accurately depicts the adjacent edges.

To further analyze the improvements made by our model, we evaluated its performance on the more convincing WHDLD public dataset in comparison to the baseline models. A confusion matrix was used to display the performance of each pixel when classified, enabling the evaluation of the segmentation accuracy of the whole image. By analyzing the categories that were more commonly misclassified by the model, we were able to target specific areas for improvement and enhance the overall performance of the model. The experimental results of the baseline models UNet and CA-UNet on the WHDLD dataset are presented in Table 3 and Table 4, respectively.

After analysis, we observed that the percentage of pixels correctly classified for all target types was improved by 1.69%, 3.96%, 9.59%, 1.75%, 11.29%, and 1.85%, respectively, with significant improvements in the accuracy of small targets such as roads and buildings. Moreover, the percentage of pixels misclassified by the model decreased.

3.4.3. Comparison with Other Methods

To further validate the effectiveness of our model, we selected several classical models and recently proposed methods by researchers for semantic segmentation of remote sensing images to conduct a comparative analysis. The experimental results are presented in Table 5 and Table 6, respectively.

The U-Net is used as the baseline model, and the encoder continuously reduces the resolution in the process of downsampling to obtain image information of different scales; the information of the image gradually changes from points, lines, and gradients in the bottom information to contours and more abstract information in the top information. The skip-connection structure fuses the same scale information, which is equivalent to adding more detailed information and is beneficial for the information-rich task of the semantic segmentation of remote sensing images, which is the reason for its good performance. With improved networks after that, they make the following contributions, respectively, where PsPNet has a hierarchical global priority, contains modules with different scale information between different sub-regions, fuses four different pyramid scale features, and improves the contextual inference ability of the model, and SegNet’s innovation lies in the way the decoder upsamples the low-feature map to eliminate the process of upsampling learning in the input image to obtain the full context pixels. The lost context is extrapolated through the mirrored input image, so, for remote sensing images with more features, UNet’s UNet++ indirectly fuses multiple features at different levels through operations such as short joins and up/down sampling, and achieves better results on both datasets. UNet++, although indirectly fusing features of different sensory fields, only fuses the information of the next layer without fusing the upper layer, causing its decoder part to be inadequately fine-grained, enabling the potential loss of marginal information in the segmentation results. For a remote-sensing dataset that is rich in marginal information, this can be considered a major flaw, which causes the fine granularity of its decoder part to still not be fine enough. This involves using the spatial pyramid to capture rich contextual information by pooling operations at different resolutions and using the encoder-decoder structure to gradually obtain a clear image. An improved SegFormer approach based on Transformer, which uses a self-attentive mechanism to process spatial relationships in images, achieved mIOUs of 60.37 and 70.17 on 2 datasets. The multi-scale fully convolutional network MSFCN obtained mIOUs of 60.37 and 70.31, respectively. The residual structure-based ResUNet obtained mIOUs of 55.35 and 67.72. DPPNet, consisting of deep pyramidal pool (DPP) blocks and dense blocks connected by deep multi-extended residuals, obtained mIOUs of 60.92 and 67.98. In DeepLabv3+, ASPP can effectively capture multi-scale information thanks to the parallel way of multiple atrous rate convolution layers, which makes the model perform better on multi-scale objects; the segmentation results of this network are better than those of SegNet and PsPNet. Our proposed CA-Unet obtained optimal results on both datasets, where mIOUs reached 63.27 and 72.83, respectively.

3.4.4. Ablation Experiments

To verify the validity of the different components of our proposed model, we performed ablation experiments on two datasets. The results obtained are shown in Table 7, where we use VGG16 as the backbone network of the baseline model, commonly used in various previous models.

We evaluate the performance of each module on two multi-class land cover datasets by adding the improved module while keeping the other experimental conditions unchanged. Experiment 1 represents the results using the baseline UNet. Experiment 2 is based on the results obtained after the Coordinate Attention’s residual structure is added to the encoder. The mIOU is improved by 1.34% and 2.27% over UNet, respectively. Experiment 3 is based on Experiment 2 by replacing the upsampling operator in the decoder with the lightweight operator CARAFE. CARAFE gives better results with a significantly lower number of parameters, and the mIOU is improved by 2.14% and 1.89%, respectively. Experiment 4 introduces the attention fusion module in the decoder based on Experiment 3, and the mIOU is improved by 0.55% and 1.48%, respectively. The ablation experiments show that, by adding multiple attention modules and using more efficient upsampling operators, CA-UNet achieves a significant improvement in the segmentation performance for multi-classified land cover datasets.

4. Discussion

We provide a segmentation method for the land cover classification of high-resolution remote sensing images. In remote sensing images, the wide variety of target types inherently put the network feature extraction capability to a great test; in addition, the high reflectance variation of pixels in the same category makes it more difficult for the model to recognize objects of interest, so contextual information extraction is crucial to enhance the feature recognition capability. This is why we introduce CARAFE, which not only reduces the number of parameters and eases the computational burden, but also improves the feature integration capability of the model, weighing the number of parameters and feature extraction capability. Additionally, our model includes several attention modules, such as coordinate attention and a fused attention module based on spatial and channel attention, which effectively improve the feature reproduction ability of the model, as supported by experimental and prediction results. Through comparison and ablation experiments, we demonstrate that our model outperforms the baseline model UNet, effectively alleviating the problem of the inaccurate segmentation of small targets in remote sensing images and improving the accuracy of adjacent target boundary segmentation.

5. Conclusions

We propose a U-shaped network (CA-UNet) based on a multi-attention mechanism and CARAFE for the semantic segmentation of high-resolution remote sensing images. We construct a residual encoder based on the coordinate attention mechanism in the backbone. The residual block enhances the nonlinear fitting ability of the network, combines with the coordinate attention to enhance the focus on spatial location information, and more accurately captures key features. This helps to identify small targets and boundaries between different targets in remote sensing segmentation tasks and improves the multi-scale information feature extraction ability of the model.

The skip-connection is an important structure of UNet, and we embed an attention fusion module in this position to combine spatial attention and channel attention. The comprehensive use of features of different scales is crucial when segmenting remote sensing images, as the size and shape of different targets are different, and this module can further improve the performance of the model. Finally, we use the CARAFE operator to replace the traditional upsampling module. This operator is less computationally intensive and obtains more accurate contextual information through adaptive feature reorganization. It alleviates the information loss problem when upsampling and is very effective for segmenting targets of different scales and sizes of remote sensing images.

Our method achieved an mIOU of 72.83% on the Lu County land cover dataset, which is much higher than the baseline model UNet of 67.19%, and also has advantages over other commonly used models in the experiments. In the experiments conducted on the WHDLD dataset, our model achieved an mIOU of 63.27%, which is higher than the 59.24% of UNet and also achieved better results compared to other commonly used models in the experiments. Our model performs well on multi-classified remote sensing image datasets, especially multi-classified land cover datasets, with a high segmentation accuracy and strong generalizability. However, it also has some limitations, such as a larger image resolution and finer and denser target recognition problems. In addition, the segmentation of remote sensing images with large noise is also a difficult problem.

We plan to conduct further research and focus on more efficient algorithm designs to make it easier to deploy and combine with radar, multi-modal data fusion such as spectral data to improve segmentation accuracy and robustness. Finally, we aim to combine with other algorithms such as path planning to realize end-to-end applications.

Previous research has been widely used in the medical, military, and autonomous driving fields. The segmentation of high-resolution remote sensing images based on deep learning is also gaining more and more attention from researchers [38]. With wide practical significance, it has been widely used in agriculture [39,40,41,42], environmental monitoring, urban planning, etc.

Author Contributions

Methodology, J.J. Conceptualization, J.J. and X.S. Software, J.J. Validation, J.J., H.Y. and Y.T. Formal analysis, J.J. and H.Y. Investigation, J.J. and X.S. Resources, X.S. Data curation, J.J. and J.S. Writing—original draft preparation, J.J. Writing—review and editing, X.S. and J.S. Visualization, J.J. and Q.K. Supervision, X.S. Project administration, X.S. Funding acquisition, X.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the National Key Research and Development Program of China (No.2021YFD1700904-2) and the Youth Science and Technology Innovation Project of Henan Bureau of Geological and Mineral Exploration and Development [2021] No.4.

Data Availability Statement

The Lushi datasets presented in this study are available on request from the corresponding author. The data are not publicly available due to privacy.

Conflicts of Interest

The authors declare no conflict of interest.

References

Garcia-Garcia, A.; Orts-Escolano, S.; Oprea, S.; Villena-Martinez, V.; Martinez-Gonzalez, P.; Garcia-Rodriguez, J. A survey on deep learning techniques for image and video semantic segmentation. Appl. Soft Comput. 2018, 70, 41–65. [Google Scholar] [CrossRef]
Niu, R.G.; Sun, X.; Tian, Y.; Diao, W.H.; Chen, K.Q.; Fu, K. Hybrid Multiple Attention Network for Semantic Segmentation in Aerial Images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5603018. [Google Scholar] [CrossRef]
Majid, M.; Habib, S.; Javed, A.R.; Rizwan, M.; Srivastava, G.; Gadekallu, T.R.; Lin, J.C.W. Applications of Wireless Sensor Networks and Internet of Things Frameworks in the Industry Revolution 4.0: A Systematic Literature Review. Sensors 2022, 22, 2087. [Google Scholar] [CrossRef] [PubMed]
Zhang, L.F.; Zhang, L.P. Artificial Intelligence for Remote Sensing Data Analysis: A Review of Challenges and Opportunities. IEEE Geosci. Remote Sens. Mag. 2022, 10, 270–294. [Google Scholar] [CrossRef]
Kwan, C.; Ayhan, B.; Budavari, B.; Lu, Y.; Perez, D.; Li, J.; Bernabe, S.; Plaza, A. Deep Learning for Land Cover Classification Using Only a Few Bands. Remote Sens. 2020, 12, 2000. [Google Scholar] [CrossRef]
Zhu, Z.; Woodcock, C.E. Continuous change detection and classification of land cover using all available Landsat data. Remote Sens. Environ. 2014, 144, 152–171. [Google Scholar] [CrossRef] [Green Version]
Joshi, N.; Baumann, M.; Ehammer, A.; Fensholt, R.; Grogan, K.; Hostert, P.; Jepsen, M.R.; Kuemmerle, T.; Meyfroidt, P.; Mitchard, E.T.A.; et al. A Review of the Application of Optical and Radar Remote Sensing Data Fusion to Land Use Mapping and Monitoring. Remote Sens. 2016, 8, 70. [Google Scholar] [CrossRef] [Green Version]
Vali, A.; Comai, S.; Matteucci, M. Deep Learning for Land Use and Land Cover Classification Based on Hyperspectral and Multispectral Earth Observation Data: A Review. Remote Sens. 2020, 12, 2495. [Google Scholar] [CrossRef]
Maxwell, A.E.; Warner, T.A.; Fang, F. Implementation of machine-learning classification in remote sensing: An applied review. Int. J. Remote Sens. 2018, 39, 2784–2817. [Google Scholar] [CrossRef] [Green Version]
Cheng, G.; Xie, X.X.; Han, J.W.; Guo, L.; Xia, G.S. Remote Sensing Image Scene Classification Meets Deep Learning: Challenges, Methods, Benchmarks, and Opportunities. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 13, 3735–3756. [Google Scholar] [CrossRef]
Kwan, C.; Gribben, D.; Ayhan, B.; Bernabe, S.; Plaza, A.; Selva, M. Improving Land Cover Classification Using Extended Multi-Attribute Profiles (EMAP) Enhanced Color, Near Infrared, and LiDAR Data. Remote Sens. 2020, 12, 1392. [Google Scholar] [CrossRef]
Shao, Z.F.; Zhou, W.X.; Deng, X.Q.; Zhang, M.D.; Cheng, Q.M. Multilabel Remote Sensing Image Retrieval Based on Fully Convolutional Network. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 13, 318–328. [Google Scholar] [CrossRef]
Ayhan, B.; Kwan, C. Tree, Shrub, and Grass Classification Using Only RGB Images. Remote Sens. 2020, 12, 1333. [Google Scholar] [CrossRef] [Green Version]
Zhu, X.X.; Tuia, D.; Mou, L.C.; Xia, G.S.; Zhang, L.P.; Xu, F.; Fraundorfer, F. Deep Learning in Remote Sensing. IEEE Geosci. Remote Sens. Mag. 2017, 5, 8–36. [Google Scholar] [CrossRef] [Green Version]
Chen, J.; Yuan, Z.Y.; Peng, J.; Chen, L.; Huang, H.Z.; Zhu, J.W.; Liu, Y.; Li, H.F. DASNet: Dual Attentive Fully Convolutional Siamese Networks for Change Detection in High-Resolution Satellite Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 1194–1206. [Google Scholar] [CrossRef]
Hong, D.F.; Gao, L.R.; Hang, R.L.; Zhang, B.; Chanussot, J. Deep EncoderDecoder Networks for Classification of Hyperspectral and LiDAR Data. IEEE Geosci. Remote Sens. Lett. 2022, 19, 5500205. [Google Scholar] [CrossRef]
Kattenborn, T.; Leitloff, J.; Schiefer, F.; Hinz, S. Review on Convolutional Neural Networks (CNN) in vegetation remote sensing. ISPRS J. Photogramm. Remote Sens. 2021, 173, 24–49. [Google Scholar] [CrossRef]
Long, J.; Shelhamer, E.; Darrell, T. Fully Convolutional Networks for Semantic Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
Badrinarayanan, V.; Kendall, A.; Cipolla, R. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef] [PubMed]
Chen, L.C.E.; Zhu, Y.K.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. In Proceedings of the 15th European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 833–851. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Proceedings of the 18th International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar]
Zhao, H.S.; Shi, J.P.; Qi, X.J.; Wang, X.G.; Jia, J.Y. Pyramid Scene Parsing Network. In Proceedings of the 30th IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6230–6239. [Google Scholar]
Zhou, Z.W.; Siddiquee, M.M.R.; Tajbakhsh, N.; Liang, J.M. UNet plus plus: A Nested U-Net Architecture for Medical Image Segmentation. In Proceedings of the 4th International Workshop on Deep Learning in Medical Image Analysis (DLMIA)/8th International Workshop on Multimodal Learning for Clinical Decision Support (ML-CDS), Granada, Spain, 20 September 2018; pp. 3–11. [Google Scholar]
Zheng, Z.; Zhong, Y.F.; Wang, J.J.; Ma, A.L. Foreground-Aware Relation Network for Geospatial Object Segmentation in High Spatial Resolution Remote Sensing Imagery. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 4095–4104. [Google Scholar]
Chen, J.; Lu, Y.; Yu, Q.; Luo, X.; Zhou, Y. TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation. arXiv 2021, arXiv:2102.04306. [Google Scholar]
John, D.; Zhang, C. An attention-based U-Net for detecting deforestation within satellite sensor imagery. Int. J. Appl. Earth Obs. Geoinf. 2022, 107, 102685. [Google Scholar] [CrossRef]
Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers. arXiv 2021, arXiv:2105.15203. [Google Scholar]
Li, R.; Zheng, S.Y.; Duan, C.X.; Wang, L.B.; Zhang, C. Land cover classification from remote sensing images based on multi-scale fully convolutional network. Geo-Spat. Inf. Sci. 2022, 25, 278–294. [Google Scholar] [CrossRef]
Priyanka; Sravya, N.; Lal, S.; Nalini, J.; Reddy, C.S.; Dell’Acqua, F. DIResUNet: Architecture for multiclass semantic segmentation of high resolution remote sensing imagery data. Appl. Intell. 2022, 52, 15462–15482. [Google Scholar] [CrossRef]
Sravya, N.; Priyanka; Lal, S.; Nalini, J.; Reddy, C.S.; Dell’Acqua, F. DPPNet: An Efficient and Robust Deep Learning Network for Land Cover Segmentation From High-Resolution Satellite Images. IEEE Trans. Emerg. Top. Comput. Intell. 2023, 7, 128–139. [Google Scholar]
Wang, J.Q.; Chen, K.; Xu, R.; Liu, Z.W.; Loy, C.C.; Lin, D.H. CARAFE: Content-Aware ReAssembly of Features. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3007–3016. [Google Scholar]
He, K.M.; Zhang, X.Y.; Ren, S.Q.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Guo, M.H.; Xu, T.X.; Liu, J.J.; Liu, Z.N.; Jiang, P.T.; Mu, T.J.; Zhang, S.H.; Martin, R.R.; Cheng, M.M.; Hu, S.M. Attention mechanisms in computer vision: A survey. Comput. Vis. Media 2022, 8, 331–368. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the 31st IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Woo, S.H.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the 15th European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Hou, Q.; Zhou, D.; Feng, J. Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVRP), Nashville, TN, USA, 20–25 June 2021; pp. 13713–13722. [Google Scholar]
Qilong, W.; Banggu, W.; Pengfei, Z.; Peihua, L.; Wangmeng, Z.; Qinghua, H. ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 11531–11539. [Google Scholar]
Wang, L. Boost Precision Agriculture with Unmanned Aerial Vehicle Remote Sensing and Edge Intelligence: A Survey. Remote Sens. 2021, 13, 4387. [Google Scholar]
Abdullahi, H.S.; Sheriff, R.E.; Mahieddine, F. Convolution neural network in precision agriculture for plant image recognition and classification. In Proceedings of the 2017 7th International Conference on Innovative Computing Technology (INTECH), Luton, UK, 16–18 August 2017. [Google Scholar]
Fawakherji, M.; Youssef, A.; Bloisi, D.D.; Pretto, A.; Nardi, D. Crop and Weeds Classification for Precision Agriculture Using Context-Independent Pixel-Wise Segmentation. In Proceedings of the 2019 Third IEEE International Conference on Robotic Computing, Naples, Italy, 25–27 February 2019. [Google Scholar]
Beyaz, A.; Koc, D.G. Meta-Learning-Based Prediction of Different Corn Cultivars from Color Feature Extraction. Tarim Bilim. Derg. 2021, 27, 32–41. [Google Scholar] [CrossRef]
Beyaz, A.; Özkaya, M.T. Canopy analysis and thermographic abnormalities determination possibilities of olive trees by using data mining algorithms. Not. Bot. Horti Agrobot. Cluj-Napoca 2021, 49, 12139. [Google Scholar] [CrossRef]

Figure 1. The structure of CA-UNet.

Figure 2. The structure of the residual block.

Figure 3. The structure of coordinate attention.

Figure 4. The overall framework of CARAFE.

Figure 5. Structure of the Attention Fusion Module.

Figure 6. Diagram of our efficient channel attention (ECA) module, where k is adaptively determined via a mapping of channel dimension C.

Figure 7. The labels and images in WHDLD datasets.

Figure 8. High-resolution remote sensing images of Lushi County.

Figure 9. The labels and images in Lushi datasets.

Figure 10. Images of the segmentation results obtained on the Lushi datasets. (a) Image; (b) Label; (c) Result of UNet; (d) Result of Ca-UNet.

Figure 11. Images of the segmentation results obtained on the WHDLD datasets. (a) Image; (b) Label; (c) Result of UNet; (d) Result of CA-UNet.

Table 1. Results of WHDLD dataset.

	IOU	MPA	P	R
road	85.38	89.51	92.49	91.74
vegetation	56.59	71.62	90.87	73.75
water	95.35	94.76	97.95	97.29
pavement	63.75	79.36	79.13	76.64
building	74.96	85.03	84.47	86.94
bare soil	66.85	78.48	79.65	80.62

P and R stand for precision and recall, respectively; all values are reported as percentages (%).

Table 2. Results of Lushi County dataset.

	IOU	MPA	P	R
road	68.26	75.51	76.40	70.91
forest	84.31	89.92	90.53	87.67
water	95.14	97.77	98.02	98.13
farmland	66.30	79.48	78.05	77.10
building	50.11	61.56	64.52	67.75

P and R stand for precision and recall, respectively; all values are reported as percentages (%).

Table 3. Confusion matrix obtained from UNet’s experiments on the WHDLD dataset.

	bg	Vegetation	Water	Road	Pavement	Building	Bare Soil
vegetation	445	2,699,432	10,511	327,809	362,515	18,029	1187
water	294	6389	1,156,906	69,189	143,168	14,691	3438
road	925	364,557	124,375	2,142,061	602,204	100,887	3570
pavement	319	326,270	127,097	793,271	14,211,726	136,487	95,768
building	78	36,925	23,157	123,030	266,001	886,706	20,780
bare soil	48	1487	5639	2402	296,287	21,735	6,846,989

bg is shorthand for background.

Table 4. Result obtained on the WHDLD datasets.

	bg	Vegetation	Water	Road	Pavement	Building	Bare Soil
vegetation	229	2,757,283	8516	338,425	292,811	21,946	718
water	74	6153	1,212,076	66,220	92,918	13,869	2765
road	297	313,049	64,541	2,462,053	430,014	63,720	4905
pavement	161	300,475	117,836	521,161	14,484,436	158,983	117,886
building	119	34,083	18,795	63,993	180,175	1,039,717	19,795
bare soil	13	858	3107	2071	172,749	15,776	6,980,013

bg is shorthand for background.

Table 5. Result obtained on the WHDLD datasets.

	mIOU	MPA	P	R
UNet	59.24	71.57	73.51	70.73
UNet++	60.81	72.96	73.40	71.25
DeepLabv3+	61.01	72.49	73.87	72.01
SegNet	55.67	67.92	69.70	67.25
PsPNet	58.46	72.62	73.19	70.57
SegFormer	61.5	73.02	74.25	71.36
MSFCN	60.37	72.54	73.62	72.18
DPPNet	60.92	72.81	74.03	72.66
ResUNet	55.35	67.57	69.41	68.39
CA-UNet	63.27	74.86	76.57	74.95

Table 6. Result obtained on the Lushi datasets.

	mIOU	MPA	P	R
UNet	67.19	81.49	84.91	81.42
UNet++	69.85	82.72	82.63	71.25
DeepLabv3+	68.23	81.60	83.37	81.28
SegNet	66.47	79.29	80.70	78.25
PsPNet	67.64	82.38	82.27	81.65
SegFormer	70.17	81.11	83.55	80.93
MSFCN	70.31	81.90	84.24	80.77
DPPNet	67.98	79.12	81.94	78.41
ResUNet	67.72	80.46	82.62	82.54
CA-UNet	72.83	84.72	86.79	84.95

Table 7. Results of ablation experiments.

	Exp1	Exp2	Exp3	Exp4
Coordinate Attention + Residual	×	√	√	√
CARAFE	×	×	√	√
Attention Mechanism	×	×	×	√
mIOU (%)	59.24/67.19	60.58/69.46	62.72/71.35	63.27/72.83
Parameters (M)	19.43	19.85	18.09	18.53

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Jia, J.; Song, J.; Kong, Q.; Yang, H.; Teng, Y.; Song, X. Multi-Attention-Based Semantic Segmentation Network for Land Cover Remote Sensing Images. Electronics 2023, 12, 1347. https://doi.org/10.3390/electronics12061347

AMA Style

Jia J, Song J, Kong Q, Yang H, Teng Y, Song X. Multi-Attention-Based Semantic Segmentation Network for Land Cover Remote Sensing Images. Electronics. 2023; 12(6):1347. https://doi.org/10.3390/electronics12061347

Chicago/Turabian Style

Jia, Jintong, Jiarui Song, Qingqiang Kong, Huan Yang, Yunhe Teng, and Xuan Song. 2023. "Multi-Attention-Based Semantic Segmentation Network for Land Cover Remote Sensing Images" Electronics 12, no. 6: 1347. https://doi.org/10.3390/electronics12061347

APA Style

Jia, J., Song, J., Kong, Q., Yang, H., Teng, Y., & Song, X. (2023). Multi-Attention-Based Semantic Segmentation Network for Land Cover Remote Sensing Images. Electronics, 12(6), 1347. https://doi.org/10.3390/electronics12061347

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multi-Attention-Based Semantic Segmentation Network for Land Cover Remote Sensing Images

Abstract

1. Introduction

2. Methods

2.1. Structure of CA-UNet

2.2. Residual Encoder Based on Coordinate Attention

2.3. Feature Fusion Decoder Based on Attention Mechanism and CARAFE

3. Experiments and Results

3.1. Datasets

3.1.1. WHDLD Datasets

3.1.2. Lushi County Datasets

3.2. Evaluation Metrics

3.3. Implementation Details

3.4. Results and Analysis

3.4.1. Experimental Results Obtained on the WHDLD Dataset and Lushi County Dataset

3.4.2. Comparison of Predicted Results

3.4.3. Comparison with Other Methods

3.4.4. Ablation Experiments

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI