High-Resolution Remote Sensing Image Segmentation Algorithm Based on Improved Feature Extraction and Hybrid Attention Mechanism

: Segmentation of high-resolution remote sensing images is one of the hottest topics in deep learning. Compared to ordinary images, high-resolution remote sensing images possess characteristics such as higher intra-class diversity and lower inter-class separability. Additionally, the objects in these images are complex and have smaller sizes. Aiming at the classical segmentation network in remote sensing images, there are some problems, such as inaccurate edge object segmentation, inconsistent segmentation of different types of objects, low detection accuracy, and a high false detection rate. This paper proposes a new hybrid attention model (S-CA), a new coordinate efﬁcient channel attention module (C-ECA), and a new small-target feature extraction network (S-FE). The S-CA model enhances important spatial and channel features in shallow layers, allowing for more detailed feature extraction. The C-ECA model utilizes convolutional layers to capture complex dependencies between variations, thereby better capturing feature information at each position and reducing redundancy in feature channels. The S-FE network can capture the local feature information of different targets more effectively. It enhances the recognition and classiﬁcation capabilities of various targets and improves the detection rate of small targets. The algorithm is used for segmentation in high-resolution remote sensing images. Experiments were conducted on the public dataset GID-15 based on Gaofen-2 satellite remote sensing images. The experimental results demonstrate that the improved DeepLabV3+ segmentation algorithm for remote sensing images achieved a mean intersection over union (mIoU), mean pixel accuracy (mPA), and mean precision (mP) of 91.6%, 96.1%, and 95.5%, respectively. The improved algorithm is more effective than current mainstream segmentation networks.


Introduction
In recent years, with the swift evolution of artificial intelligence technology and machine vision, deep learning has found extensive applications in image segmentation.In the interpretation of remote sensing images, semantic segmentation (SS) is a crucial technique that is widely used in various fields, such as land cover mapping, land cover change detection, urban design and planning, and environmental monitoring.Semantic segmentation is a process that involves categorizing each pixel in an image into various classes, such as person, car, tree, etc., to identify and distinguish different objects or parts of the image [1].
High-resolution remote sensing images are a type of remote sensing data with high spatial resolution, which can provide more detailed and accurate surface information.Their images are usually acquired through remote sensing platforms such as aerial or satellite systems and have a wide range of applications.The resolution of high-resolution remote sensing images typically ranges from 1 to 10 m, which is more detailed compared to traditional remote sensing imagery.It can be obtained using sensing technologies such Remote sensing images offer extensive details of ground objects, effectively showcasing their spatial structure and textural characteristics.The wealth of contextual information they provide serves as valuable data support for semantic segmentation.However, the complexity and diversity of high-resolution remote sensing image data, along with their rich feature set and immense scale, make the task of segmentation in remote sensing images particularly difficult [3,4].
Although the existing methods constantly improve the target segmentation in remote sensing images, there are still defects.On the one hand, they lack the integration of global pixels, it is easy to lose information, and they overlook the context information between categories.On the other hand, target segmentation is more challenging due to the complex and changeable scenes involved in remote sensing images and is affected by the angle of light and imaging.In addition, the segmentation between similar categories is easily Electronics 2023, 12, 3660 3 of 20 confused.These factors hinder existing methods in accurately acquiring the category information of target boundary pixels, resulting in blurred boundaries.
To tackle these challenges, we proposed an improved algorithm network.The network employs an encoder-decoder architecture to progressively recover spatial information and acquire the desired information.We added the C-ECA attention model during the decoder stage.The C-ECA attention model can not only calculate channel attention maps but can also learn complex dependencies between channels by using a convolution layer.In order to integrate contextual information from different scales, the features extracted by the encoder are processed by two modules: the S-CA hybrid attention module and the Atrous Spatial Pyramid Pooling (ASPP) module.Furthermore, the shallow features obtained through the encoder also undergo passage through the S-FE network, with the ultimate objective of further refining the model's recognition capabilities in relation to these local small targets.
The proposed network is comprehensive and integrates deep semantic information, global context information, and local feature information.
This paper makes the following main contributions: (1) To overcome the shortage of shallow feature information in high-resolution remote sensing images, a hybrid attention module (S-CA) is proposed.The S-CA model combines channel attention (CA) and spatial attention (SA).The S-CA hybrid attention mechanism can process spatial and channel features simultaneously.The different positions of the input features are assigned weights so that the important spatial position information and channel information are strengthened, which makes the model pay more attention to these positions and improves the expression ability of features.
(2) Aiming at the problem of low recognition accuracy and a high false detection rate of local small targets in high-resolution remote sensing images, we present a smalltarget feature extraction network (S-FE) based on a feature extraction algorithm.The shallow feature maps are divided into multiple blocks, where each block represents a local region in the image, and feature extraction is captured.It can better capture the local feature information of different objects and analyze and understand the features within the model.It improves the recognition and classification ability of different ground objects and increases the small-target segmentation accuracy.
(3) Aiming at the problem of many complex fused features in high-resolution remote sensing images, an improved coordinate efficient channel attention (C-ECA) module based on the convolutional block attention module (CBAM) is proposed.The C-ECA model can strengthen important channels and suppress unimportant channels by learning the weights of each channel.It can adaptively adjust the weights of different channels according to their importance in different image regions to enhance the entire feature map's expressive ability.It helps the model acquire more image details and features, improves the accuracy and robustness of the model, and improves the expressive ability of features.

Related Work
Traditional image segmentation usually uses shallow semantic features such as texture, pixels, colors, and geometric forms of images to complete the image segmentation.These methods include the threshold method, region method, and edge detection method.Traditional methods have the features of simple algorithms and fast speed but lack extraction of high-level semantic features, cannot meet high-precision demand, and have a poor generalization capacity [5][6][7].Before the popularity of deep learning algorithms, machine learning algorithms were applied to the semantic segmentation of remote sensing images.In 2004, Mitra et al. employed an enhanced support vector machine (SVM) algorithm to perform supervised pixel classification for the purpose of distinguishing different types of land cover [8].In 2005, Poggi and others implemented a tree-shaped Markov random field (MRF) algorithm for the supervised segmentation of remote sensing images [9].In 2015, Zhang et al. introduced a unique approach to SAR image segmentation [10].Their improved conditional random field (CRF) approach allowed for the extraction of spatial structures of varying scales within the image context, leading to more precise detection of image edges.In 2017, Sun et al. classified pixels in remote sensing images, and a novel approach was employed that involved utilizing an ensemble learning method in combination with random forest.Additionally, they employed an improved conditional random field (CRF) technique to further refine the classification results [11].
The convolution neural networks (CNNs) have found extensive applications in the field of computer vision, and their efficiency in classification tasks is far better than that of traditional methods [12].Long et al. made a significant contribution to the field of image semantic segmentation with their introduction of the fully convolutional network (FCN) at the International CVPR in 2015 [13].This approach demonstrates superior performance and has achieved breakthrough results compared to traditional algorithms, opening the door based on deep learning.In the same year, Ronneberger and others put forward the U-Net model [14], which is used for semantic segmentation of codec of completely paired medical images.In this model, spatial information and image semantics are obtained by the down-sampling of the encoder, and the resolution of the feature map is recovered by the up-sampling of the decoder.Detailed information of the image is extracted by cross-layer fusion of the feature map.Although it has good performance in the medical field, it is not suitable for semantic prediction of indoor and outdoor scenes.
In 2016, the DeepLabV2 model was proposed based on the DeepLabV1 network [15], which used hole convolution instead of partial pooling operation for feature extraction, introduced the hole convolution pyramid module for feature multi-scale extraction, and used conditional random field (CRF) for boundary detail optimization.In 2017, DeepLabV3 improved the empty convolution pyramid module based on the v2 network to form an end-to-end network structure and canceled the CRF boundary optimization module [16].
In 2017, a team led by Zhao developed a novel pyramid scene parsing network that is designed to achieve highly accurate image semantic segmentation.It can make full use of the global context information while keeping the details of the image.Through the pyramid pooling module, PSPNet can obtain semantic information on different scales, which makes the model better understand the overall structure and semantics of images [17].
DeepLabV3+, proposed in 2018, uses the v3 network as an encoder and constructs an encoder-decoder network model with a cavity convolution pyramid module by adding a decoder with a simple structure, which achieves better segmentation results, but there are still problems of intra-class error identification and rough boundary prediction [18,19].
In 2019, the Microsoft Asia Pacific Lab proposed the high-resolution SAR rate network (HRNET) [20].It maintains the high-resolution SAR rate representation through a unique parallel structure and verifies its effectiveness in human posture estimation.However, the parallel network greatly increases the model's complexity while improving the deep network's fitting force.In the same year, Wang and other scholars adopted the integration method of the multi-connection ResNet model and the specific category attention model, which can extract features beneficial to segmentation targets through specific category attention models, thus improving the segmentation accuracy [21].In 2020, Du and other scholars adopted DeepLabV3 + and OBIA technology [22], which can divide the image into different levels of objects and further improve the segmentation accuracy.In the same year, Li's team adopted a form of integrating multiple attention mechanisms in the U-Net architecture for segmenting remote sensing images [23].Zeng and his colleagues have incorporated a feature cross-attention module into DeepLabV3+ for improved performance, which can fuse features of different scales and levels, thus improving the segmentation effect and refining the segmentation results [24].In 2021, Liu and colleagues introduced a remote sensing image segmentation framework for high-resolution SAR rate images.This framework integrates the attention mechanism and adaptive weighting, focusing on two key modules: the adaptive multi-scale module (AMSM) and the adaptive fusion module (AFM) [25].It uses the adaptive weighting method to fuse various feature information, which further improves the accuracy and robustness of segmentation.After that, a team led by Tian Zhi, Huang Tong, and He Tong introduced the transformer structure into deep learning and proposed the SegFormer network for image semantic segmentation tasks [26].Through experiments, they demonstrated that SegFormer exhibits superiority in both accuracy and computational efficiency.By capturing global context and long-range dependency relationships, SegFormer enhances the precision and accuracy of segmentation results.This research provides new ideas and paradigms for efficient visual segmentation.

Materials and Methods
This research is based on DeepLabV3+, which was selected as the basic model.The Xception is the backbone network of the DeepLabV3+ model.DeepLabV3+ primarily utilizes hole convolution and multi-scale pooling for semantic segmentation.Hole convolution expands the receptive field to capture contextual information better, while multi-scale pooling extracts feature information at different scales to adapt to targets of various sizes.Additionally, DeepLabV3+ not only introduces the encoder-decoder structure but also adopts the ASPP module.It is used to extract richer feature information and can operate at different resolutions, thus further improving the segmentation accuracy.The Xception model is a depth-wise separable convolution (DSC) network proposed by Chollet.It is based on Inception v3, replaces the Inception module with deep separable convolution, and then combines with ResNet's jump connection.The DSC structure of Xception helps the DeepLabV3+ model maintain a high accuracy, while significantly reducing the number of parameters, improving the computational efficiency of the model [27].This enables the DeepLabV3+ model to understand images' semantic information better and generate high-quality semantic segmentation results.The DeepLabV3+ model is shown in Figure 2. transformer structure into deep learning and proposed the SegFormer network for image semantic segmentation tasks [26].Through experiments, they demonstrated that Seg-Former exhibits superiority in both accuracy and computational efficiency.By capturing global context and long-range dependency relationships, SegFormer enhances the precision and accuracy of segmentation results.This research provides new ideas and paradigms for efficient visual segmentation.

Materials and Methods
This research is based on DeepLabV3+, which was selected as the basic model.The Xception is the backbone network of the DeepLabV3+ model.DeepLabV3+ primarily utilizes hole convolution and multi-scale pooling for semantic segmentation.Hole convolution expands the receptive field to capture contextual information better, while multi-scale pooling extracts feature information at different scales to adapt to targets of various sizes.Additionally, DeepLabV3+ not only introduces the encoder-decoder structure but also adopts the ASPP module.It is used to extract richer feature information and can operate at different resolutions, thus further improving the segmentation accuracy.The Xception model is a depth-wise separable convolution (DSC) network proposed by Chollet.It is based on Inception v3, replaces the Inception module with deep separable convolution, and then combines with ResNet's jump connection.The DSC structure of Xception helps the DeepLabV3+ model maintain a high accuracy, while significantly reducing the number of parameters, improving the computational efficiency of the model [27].This enables the DeepLabV3+ model to understand images' semantic information better and generate high-quality semantic segmentation results.The DeepLabV3+ model is shown in Figure 2. Since the resolution of high-resolution remote sensing images is too large, they contain a lot of pixel information.If the image is directly transmitted into the network for training, the network will compress it multiple times, which will cause some details to be lost.The remote sensing images contain a large amount of context information.If the cov- Since the resolution of high-resolution remote sensing images is too large, they contain a lot of pixel information.If the image is directly transmitted into the network for training, the network will compress it multiple times, which will cause some details to be lost.The remote sensing images contain a large amount of context information.If the coverage between image blocks is too small, the semantic connection between segmented image blocks is small.This will lead to the loss of context information of ground objects [28].Based on many studies and papers [29][30][31], we set the input size of our model's image data to 512 × 512.On the one hand, increasing the input image size results in a larger number of parameters in the model, which slows down its processing speed.On the other hand, if the input image is too small, the training dataset will significantly increase, adversely affecting the training efficiency.Therefore, we divided 10 images with a pixel size of 6800 × 7200 into 7280 images, each with a pixel size of 512 × 512, using a sliding step size of 256 pixels.Since there is a lot of small object information in high-resolution remote sensing images, it is difficult to identify them in conventional training.
Therefore, the S-FE module is used to segment the feature map, and every feature map after segmentation is recovered to the size of the original feature map.The small targets can be enlarged, and feature extraction can be carried out to obtain more image features of small targets.Before feature extraction, the S-CA hybrid attention mechanism module is used to make features produce more precise and accurate expressions.The S-CA model emphasizes important information in the image and suppresses noise and irrelevant information, enhancing the generalization ability and effectiveness of the model.At the same time, the C-ECA attention mechanism module is added after the characteristic fusion.The C-ECA model can screen multiple and complex feature information to reduce the complexity of the model, making the model pay more attention to the important spatial and channel information in the image.Simultaneously, considering features at different scales enables the model to learn and utilize information within the image more effectively.The improved model is shown in Figure 3.
Based on the improved image segmentation algorithm, it is mainly implemented in a combination of three parts: the DeepLabV3+ network model, the small-target feature extraction network, and the improvement attention mechanism S-CA and C-ECA modules, as depicted in Figure 3, where the algorithm's network structure is shown.
The specific steps for implementing the network are outlined below: (1) The shallow features obtained from the image through the Xception trunk are divided into three inputs.In Branch 1, the extracted features pass through the S-CA hybrid attention mechanism module and then through the ASPP module.The S-CA hybrid attention mechanism module is used to weigh the spatial structure and channel information of the feature map to obtain important spatial and channel information.The S-CA hybrid attention mechanism improves the accuracy and generalization of convolutional neural networks.The processed feature map is fed into the ASPP module, which utilizes convolution with different dilation rates to perform multi-scale feature fusion.This process extracts rich contextual information and yields effective high-level features.
(2) Branch 2 uses the S-CA hybrid attention mechanism module to improve the accuracy and the capacity of CNNs to generalize.It can make the important information of the shallow features more prominent.Branch 3 uses the S-FE module to increase the feature information of local small targets and improve the recognition accuracy of small targets.After that, the deep features obtained by Branch 1, the shallow features obtained by Branch 2, and the small-target features obtained before the division are superimposed and fused.
(3) The fusion feature chart passes through the C-ECA attention mechanism module to adapt and adjust the weight of different channels.After 3 × 3 convolution and up-sampling processing, it captures more image details and features, gradually refining them, the spatial information is restored, and finally, the segmentation result map is obtained.
In the process of introducing the attention mechanism, the proposed algorithm fuses shallow features, deep features, and local small-target features.It makes the fused feature map have richer global context information and improves the accuracy and robustness of target detection.Based on the improved image segmentation algorithm, it is mainly implemented in a combination of three parts: the DeepLabV3+ network model, the small-target feature extraction network, and the improvement attention mechanism S-CA and C-ECA modules, as depicted in Figure 3, where the algorithm's network structure is shown.
The specific steps for implementing the network are outlined below: (1) The shallow features obtained from the image through the Xception trunk are divided into three inputs.In Branch 1, the extracted features pass through the S-CA hybrid attention mechanism module and then through the ASPP module.The S-CA hybrid attention mechanism module is used to weigh the spatial structure and channel information of the feature map to obtain important spatial and channel information.The S-CA hybrid attention mechanism improves the accuracy and generalization of convolutional neural networks.The processed feature map is fed into the ASPP module, which utilizes convolution with different dilation rates to perform multi-scale feature fusion.This process extracts rich contextual information and yields effective high-level features.

Spatial-Channel Attention (S-CA)
In the task of segmentation of remote sensing images, it is necessary to identify the types and boundaries of various ground objects.Since remote sensing images have features of high resolution, complexity, and large scale, they make the semantic division tasks of remote sensing images more challenging.The traditional semantic segmentation model is not ideal in terms of high intra-class diversity and low inter-class separability.To solve this problem, the proposed approach in this paper presents an algorithm for the spatial-channel attention (S-CA) hybrid attention mechanism.
Spatial attention (SA) weighs spatial information and emphasizes important spatial positions in the image.A spatial attention map is generated through convolution and sigmoid functions, and each pixel position in the input feature map is weighted.Its attention is focused on valuable areas in the image, such as the edge of the object, to avoid paying too much attention to noise and background in the image and make important areas more prominent [32].
Channel attention (CA) is used to weigh the channel information and emphasize the important channel features in the image.A channel attention vector is generated through global average pooling and multi-layer perceptron, and each channel in the input feature map is weighted to make the important channels more prominent [33], improving the ability of feature expression.
Therefore, by combining the SA and the CA models, we can effectively utilize both spatial and channel features in remote sensing images and then increase the precision of semantic segmentation of remote sensing images.The S-CA hybrid attention mechanism entails a parallel linkage between SA and CA.These components are independently applied to the input features, and their output outcomes are fused to yield the ultimate feature representation.Figure 4   The input feature map F∈R C×H×W generates a two-dimensional spatial attention feature map, M S , via the SA mechanism, M S ∈R 1×H×W , as shown in Formula (1): The input feature map F∈R C×H×W generates a one-dimensional channel attention feature map, M C , through the CA mechanism, M C ∈R C×1×1 , as shown in Formula (2): The M S generated by the SA mechanism and M C generated by the CA mechanism are multiplied by the input feature map, F, to obtain F and F , respectively, as shown in Formula (3): The output graphs F and F after point multiplication are superimposed and fused.F after fusion is the output result of the S-CA mechanism, as shown in Formula The input feature map F ∈ R C×H×W generates a two-dimensional spatial attention feature map, M S , via the SA mechanism, M S ∈ R 1×H×W , as shown in Formula (1): The input feature map F ∈ R C×H×W generates a one-dimensional channel attention feature map, M C , through the CA mechanism, M C ∈ R C×1×1 , as shown in Formula (2): Electronics 2023, 12, 3660 9 of 20 The M S generated by the SA mechanism and M C generated by the CA mechanism are multiplied by the input feature map, F, to obtain F 1 and F 2 , respectively, as shown in Formula (3): The output graphs F 1 and F 2 after point multiplication are superimposed and fused.F after fusion is the output result of the S-CA mechanism, as shown in Formula (4): The S-CA hybrid attention mechanism can process spatial and channel information simultaneously, with low computational complexity, and the channel and spatial information obtained by it is more comprehensive and independent.It can capture key information from multiple angles, thus improving the feature expression ability and generalization ability.

Small-Target Feature Extraction Network (S-FE)
The S-FE network divides the remote sensing image into several adjacent blocks with different semantics.It can obtain feature maps with different local information in the process of feature extraction to better capture the spatial feature information of different positions.The segmented feature map is enlarged to the size before segmentation via the up-sampling operation so that small targets in remote sensing images can be enlarged into large targets.The discrimination of the network model in the process of feature extraction is improved, and the accuracy and robustness of semantic segmentation of remote sensing images are also improved.By introducing the small-target feature extraction module, the DeepLabV3+ network model can better capture the target feature information in local areas, thus improving the attention to small targets.The S-FE network improves the segmentation ability of remote sensing images and further improves the accuracy of segmentation.The specific cutting method is shown in Figure 5.

Small-Target Feature Extraction Network (S-FE)
The S-FE network divides the remote sensing image into several adjacent blocks with different semantics.It can obtain feature maps with different local information in the process of feature extraction to better capture the spatial feature information of different positions.The segmented feature map is enlarged to the size before segmentation via the upsampling operation so that small targets in remote sensing images can be enlarged into large targets.The discrimination of the network model in the process of feature extraction is improved, and the accuracy and robustness of semantic segmentation of remote sensing images are also improved.By introducing the small-target feature extraction module, the DeepLabV3+ network model can better capture the target feature information in local areas, thus improving the attention to small targets.The S-FE network improves the segmentation ability of remote sensing images and further improves the accuracy of segmentation.The specific cutting method is shown in Figure 5. Traditional semantic segmentation algorithms of remote sensing images mainly use convolution to extract shallow features to capture the spatial semantic information of images.However, compared with the whole remote sensing image, this information can only capture the feature information of some large targets and cannot focus attention on the feature information of local small targets.For a more accurate extraction of local target feature information, we introduce the small-target feature extraction network (S-FE) in the DeepLabV3+ network.
Most of the image segmentation models today adopt simple convolutional operations to extract features to obtain the shallow features of images.However, these simple convolution operations cannot achieve an ideal effect in extracting the middle-level feature information of remote sensing image segmentation.In order to better obtain the feature information of small-target details in remote sensing images, inspired by the structure of Inception V3 and ResNet, 1 × 1 and 3 × 3 filters are used to extract small targets with low attention to extract feature information at various scales.Additionally, we incorporate a residual block structure, not only enhancing the precision of feature extraction but also averting the issue of gradient vanishing.This ensures that the neural network can more effectively acquire the characteristics of remote sensing images.The structure is shown in Figure 6.Traditional semantic segmentation algorithms of remote sensing images mainly use convolution to extract shallow features to capture the spatial semantic information of images.However, compared with the whole remote sensing image, this information can only capture the feature information of some large targets and cannot focus attention on the feature information of local small targets.For a more accurate extraction of local target feature information, we introduce the small-target feature extraction network (S-FE) in the DeepLabV3+ network.
Most of the image segmentation models today adopt simple convolutional operations to extract features to obtain the shallow features of images.However, these simple convolution operations cannot achieve an ideal effect in extracting the middle-level feature information of remote sensing image segmentation.In order to better obtain the feature information of small-target details in remote sensing images, inspired by the structure of Inception V3 and ResNet, 1 × 1 and 3 × 3 filters are used to extract small targets with low attention to extract feature information at various scales.Additionally, we incorporate a residual block structure, not only enhancing the precision of feature extraction but also averting the issue of gradient vanishing.This ensures that the neural network can more effectively acquire the characteristics of remote sensing images.The structure is shown in Figure 6. Figure 6 visually represents the S-FE network structure, where an initial convolution operation of 1 × 1 size is applied to the feature map.This operation reduces the channel count, algorithmic complexity, and model computation volume, effectively guarding against overfitting.At the same time, in the deep features branch, two dilated convolution filters with the size of 3 × 3 and the expansion convolution rates of 2 and 4 are connected in series.It expands the perceptual range of the network model, improves the perception ability of the network, and effectively preserves the feature information in remote sensing images.Then, the semantic information of the shallow features obtained from Branch 1 and the deep features extracted from Branch 2 are combined into a new feature map.The residual structure can improve the feature extraction accuracy and avoid the gradient's disappearance simultaneously.At the same time, through a global pooling operation, the dimension of the fused feature map is reduced, and the number of model parameters is also reduced.This endeavor augments the model's aptitude for training and overall generalization capabilities, and it is multiplied with the original remote sensing image features map.Through this feature extraction method, the feature information of local small objects can be extracted more effectively, which makes up for the shortcomings of traditional image segmentation methods in dealing with small objects and improves the accuracy and generalization ability.

Coordinate Efficient Channel Attention Module (C-ECA)
The simple CA mechanism can only make the network pay attention to important channel information, but it pays little attention to the spatial position coordinates of its important features.Although, the CBAM can capture and express semantic information at different levels by adaptively adjusting the spatial and channel weights of feature maps [34].However, when dealing with intricate feature information present within fused feature maps, the integration of the CBAM not only elevates the model complexity but also falls short in efficiently extracting crucial feature insights.Inspired by the CBAM, we propose a method of fusing coordinate attention and enhanced channel attention in this paper, which makes the network not only adaptively adjust the channel and spatial coordinate information on the feature map but also reduces the redundancy of feature channels.
The coordinate attention (CA) algorithm is a technology that has attracted much attention in the field of image processing in recent years.It is a lightweight attention mechanism that can automatically learn the area in the image under the global coordinate system and weigh it at attention.It can effectively capture important areas and features of images or targets, thus improving the precision of identifying and locating objects in images [35].
The efficient channel attention (ECA) can automatically select important channels for Figure 6 visually represents the S-FE network structure, where an initial convolution operation of 1 × 1 size is applied to the feature map.This operation reduces the channel count, algorithmic complexity, and model computation volume, effectively guarding against overfitting.At the same time, in the deep features branch, two dilated convolution filters with the size of 3 × 3 and the expansion convolution rates of 2 and 4 are connected in series.It expands the perceptual range of the network model, improves the perception ability of the network, and effectively preserves the feature information in remote sensing images.Then, the semantic information of the shallow features obtained from Branch 1 and the deep features extracted from Branch 2 are combined into a new feature map.The residual structure can improve the feature extraction accuracy and avoid the gradient's disappearance simultaneously.At the same time, through a global pooling operation, the dimension of the fused feature map is reduced, and the number of model parameters is also reduced.This endeavor augments the model's aptitude for training and overall generalization capabilities, and it is multiplied with the original remote sensing image features map.Through this feature extraction method, the feature information of local small objects can be extracted more effectively, which makes up for the shortcomings of traditional image segmentation methods in dealing with small objects and improves the accuracy and generalization ability.

Coordinate Efficient Channel Attention Module (C-ECA)
The simple CA mechanism can only make the network pay attention to important channel information, but it pays little attention to the spatial position coordinates of its important features.Although, the CBAM can capture and express semantic information at different levels by adaptively adjusting the spatial and channel weights of feature maps [34].However, when dealing with intricate feature information present within fused feature maps, the integration of the CBAM not only elevates the model complexity but also falls short in efficiently extracting crucial feature insights.Inspired by the CBAM, we propose a method of fusing coordinate attention and enhanced channel attention in this paper, which makes the network not only adaptively adjust the channel and spatial coordinate information on the feature map but also reduces the redundancy of feature channels.
The coordinate attention (CA) algorithm is a technology that has attracted much attention in the field of image processing in recent years.It is a lightweight attention mechanism that can automatically learn the area in the image under the global coordinate system and weigh it at attention.It can effectively capture important areas and features of images or targets, thus improving the precision of identifying and locating objects in images [35].
The efficient channel attention (ECA) can automatically select important channels for feature learning, which realizes more effective feature representation and a faster training speed.Simultaneously, ECA also introduces the multi-scale attention mechanism, which can automatically select relevant information on a spatial scale and fuse multi-scale features to obtain a more accurate feature representation.The ECA adopts a novel cross-channel interaction strategy that does not involve any dimensionality reduction.This approach effectively circumvents the potential negative effects of dimensionality reduction on channel attention learning [36].
The coordinate efficient channel attention (C-ECA) module can not only calculate the channel attention map but can also learn the complex dependencies between channels by using the convolution layer.It can better capture the feature information of each position, reduce the redundancy of feature channels, and enhance the model's ability to generalize and perform robustly.Figure 7 illustrates the model architecture of the C-ECA model.
Electronics 2023, 12, x FOR PEER REVIEW speed.Simultaneously, ECA also introduces the multi-scale attention mechanism, can automatically select relevant information on a spatial scale and fuse multi-sc tures to obtain a more accurate feature representation.The ECA adopts a novel channel interaction strategy that does not involve any dimensionality reduction.T proach effectively circumvents the potential negative effects of dimensionality red on channel attention learning [36].
The coordinate efficient channel attention (C-ECA) module can not only calcu channel attention map but can also learn the complex dependencies between chan using the convolution layer.It can better capture the feature information of each po reduce the redundancy of feature channels, and enhance the model's ability to gen and perform robustly.Figure 7 illustrates the model architecture of the C-ECA mo Coordinate Attention Efficient Channel Attention The attention feature map X C is generated by the CA mechanism by the input map X∈R C×H×W , as shown in Formula (5): The input feature map G ∈ R C×H×W (X=G) generates the attention feature m through the channel attention mechanism.
The characteristic graph X C is obtained by the CA mechanism and the origin ture diagram to obtain the point-to-multiplication operation of the feature F , wh is used as the input of the ECA mechanism, and the results of F are used to obt as shown in Formula ( 6): The attention feature map X C is generated by the CA mechanism by the input feature map X ∈ R C×H×W , as shown in Formula (5): The input feature map G ∈ R C×H×W (X = G) generates the attention feature map G C through the channel attention mechanism.
The characteristic graph X C is obtained by the CA mechanism and the original feature diagram to obtain the point-to-multiplication operation of the feature F , where F is used as the input of the ECA mechanism, and the results of F are used to obtain F , as shown in Formula ( 6): The coordinate efficient channel attention (C-ECA) mechanism, through point multiplication operation, makes feature graph fusion more compact.It realizes feature interactions, can better capture the interactions between channels at various positions, reduces the redundancy of feature channels, and thus improves the performance of the model and the generalization and robustness of the network.

Experiments and Analysis
Sections 4.1 and 4.2 are concise introductions to the experimental setup and the datasets employed in the experiments.Additionally, a succinct overview of the data preprocessing is presented.Subsequently, Sections 4.3 and 4.4 delve into the comprehensive ablation experiments and comparative experiments, respectively.Finally, Section 4.5 introduces the anti-interference experiment of the model.

Experimental Environment
Table 1 shows the specific configuration parameters of the experimental environment.

Experimental Datasets
The Gaofen Image Dataset (GID) has higher intra-class diversity and lower inter-class separability.The size of the Gaofen-2 satellite remote sensing image is 6800 × 7200, which is labeled at the pixel level by experts in the remote sensing interpretation field [37].GID-15 includes 15 categories: paddy fields, irrigated land, dry cropland, garden land, arbor forest, shrub land, natural meadow, artificial meadow, industrial land, urban residence, rural residence, traffic land, rivers, lakes, and ponds.Since its launch in 2014, GF-2 has been used in important applications such as land surveys, environmental monitoring, crop estimation, and construction planning.This experiment used the GID-15 dataset as experimental data.Here, 10 remote sensing images with 6800 × 7200 resolution were expanded into 7280 images with 512 × 512 resolution by sliding segmentation and randomly divided into training sets (90%) and verification sets (10%).Training and verification were conducted under the same hyperparameters.After extensive training, we found that the model achieved the best convergence after 400 epochs.The model employed the stochastic gradient descent (SGD) optimizer.The model's initial learning rate was configured at 7 × 10 −3 , with the minimum learning rate set to 7 × 10 −5 .This configuration yielded the optimal results in our observations.

Ablation Experiment
To verify the effectiveness of the S-FE network and the improved attention mechanism module proposed in this paper, we conducted ablation experiments under the same experimental environment to evaluate the impact of different models on the segmentation results of high-resolution remote sensing images.DeepLabV3+ was selected as the benchmark model in the ablation experiment, and the input image's resolution was set to 512 × 512.

Comparative Experiment
In order to prove the effectiveness of the improved model compared with other algorithms, we compared it with various popular segmentation algorithms.The outcomes of this comparison are illustrated in Table 3 and Figure 8. Firstly, it was contrasted with the classical FCN semantic segmentation algorithm, and then compared with the PSPNet algorithm, the DeepLabV3 algorithm, and the DeepLabV3+ model.The improved DeepLabV3+ algorithm showed obvious improvement compared with the other algorithms.results of high-resolution remote sensing images.DeepLabV3+ was selected as the benchmark model in the ablation experiment, and the input image's resolution was set to 512 × 512.We obtained the ablation results by conducting 400 epochs of training.We report the results of the ablation experiment in Table 2.

Comparative Experiment
In order to prove the effectiveness of the improved model compared with other algorithms, we compared it with various popular segmentation algorithms.The outcomes of this comparison are illustrated in Table 3 and Figure 8. Firstly, it was contrasted with the classical FCN semantic segmentation algorithm, and then compared with the PSPNet algorithm, the DeepLabV3 algorithm, and the DeepLabV3+ model.The improved DeepLabV3+ algorithm showed obvious improvement compared with the other algorithms.

Anti-Interference Experiment
High-resolution remote sensing images are often affected by factors as water surface fluctuations, cloud cover, and meteorological disturbances.These factors may cause noise, stripes, and other interference in the images, further affecting the accuracy and robustness of the models.By conducting anti-interference experiments, it is possible to evaluate the performance of algorithms in complex environments.It can improve the robustness and adaptability of models and make the algorithms more suitable for real-world applications.
To verify the stability of the improved model compared with other algorithms, additional additive white Gaussian noise (AWGN) (Sigma = 5, 10, 15) was added to the test and verification sets of GID-15 to compare the anti-jamming experiments of the models.Sigma represents the variance of Gaussian noise: the larger the sigma value, the greater the Gaussian noise.We found in the experiment that when sigma > 20, the image quality significantly decreased, and the model could not correctly recognize the targets.Therefore, we considered the range of 0-20 as the effective interval and selected 0, 5, 10, and 15 as the observation points for the experimental comparison.The results are shown in Table 4 and Figure 9.

Anti-Interference Experiment
High-resolution remote sensing images are often affected by factors such as surface fluctuations, cloud cover, and meteorological disturbances.These factors cause noise, stripes, and other interference in the images, further affecting the acc and robustness of the models.By conducting anti-interference experiments, it is po to evaluate the performance of algorithms in complex environments.It can improv robustness and adaptability of models and make the algorithms more suitable for world applications.
To verify the stability of the improved model compared with other algorithms, tional additive white Gaussian noise (AWGN) (Sigma = 5, 10, 15) was added to th and verification sets of GID-15 to compare the anti-jamming experiments of the m Sigma represents the variance of Gaussian noise: the larger the sigma value, the g the Gaussian noise.We found in the experiment that when sigma > 20, the image q significantly decreased, and the model could not correctly recognize the targets.T fore, we considered the range of 0-20 as the effective interval and selected 0, 5, 10, a as the observation points for the experimental comparison.The results are shown in 4 and Figure 9.When adding AWGN = 5, the image clarity was not greatly affected, and the mI each model only decreased by 8.8%, 8.6%, 3%, 2.1%, 3.6%, and 2.1%, respectively.E When adding AWGN = 5, the image clarity was not greatly affected, and the mIoU of each model only decreased by 8.8%, 8.6%, 3%, 2.1%, 3.6%, and 2.1%, respectively.Except for the FCN and PSP models, the other models were hardly affected by noise.When increasing AWGN to 10, the performance of all models was significantly reduced.However, because DeepLabV3, DeepLabV3+, SegFormer, and our model all use multi-scale feature extraction networks, which are more detailed in the feature extraction process, the mIoU only decreased by 21.8%, 15.4%, 16.5%, and 14.4%, respectively.Our model adds an S-FE network to extract features for small objects multiple times, so it still has certain advantages in small-object feature extraction, resulting in a smaller decrease in mIoU.When AWGN was increased to 15, the image noise became very noticeable.Even though the models have multi-scale feature extraction networks, their classification ability rapidly decreased, and the error rate increased due to the influence of noise.The mIoU of each model decreased by 51.2%, 53.9%, 37.2%, 30.5%, 34.8%, and 27.8%, respectively.For our model, the addition of the S-CA and the C-ECA attention modules in the network enhanced the model's ability to recognize and locate important features.Compared to other models, it showed a slower decline in classification ability and stronger resistance to interference within a certain range.

Results and Analysis
In Sections 5.1 and 5.2, a detailed analysis of the data results from the ablation experiments and comparative experiments is provided, respectively.In Section 5.3, the proposed model algorithm is subjected to visualization experiments, and in-depth analysis and comparison of the segmentation results from different model algorithms are presented.Finally, the anti-interference experiment of the model mentioned in Section 4.5 is carried out in Section 5.4, and the visualization results are analyzed in detail.

Analysis of Ablation Experiment Results
According to Table 2 in the ablation experiment, we can see that compared with the basic DeepLabV3+ network model, when we added the S-FE, the S-CA, and the C-ECA models to the network, their mIoU increased by 1.2%, 1.5%, and 1.7%, respectively, compared with the basic model.When we added S-FE and S-CA to the network at the same time, there was obvious growth, 3.1% and 3.4% higher than the basic model, respectively.When we added all three models to the DeepLabV3+ network, its accuracy reached 91.5%, 4.2% higher than the basic model.It has higher precision, so improving segmentation tasks for high-resolution remote sensing images had a noticeable effect.

Analysis of Comparative Test Results
The outcomes of mIoU, mPA, and mP of different image segmentation methods on the GID-15 dataset are presented in Table 3.
The classification precision of DeepLabV3 was the worst of all networks.The DeepLabV3 network has a larger perceptual field due to the introduction of the ASPP structure but ignores the processing of small objects.PSPNet, a relatively novel semantic segmentation model, holds certain advantages in high-resolution segmentation.Nevertheless, it still has considerable room for enhancement in areas such as imbalanced categories and indistinct boundaries.FCN, as a pioneer of semantic segmentation, has some advantages in performance and accuracy.However, FCN has different category labels for different parts of the same object, which cannot maintain the shape and boundary of the object well.Although DeepLabV3+ combines multi-scale information and is suitable for different scenes and tasks, its classification accuracy for various ground objects is not ideal.SegFormer adopts the transformer architecture, which efficiently captures dependencies between pixels.This allows the model to strike a good balance between computational efficiency and accuracy.However, there is still room for improvement in recognizing small objects.
Our approach integrates C-CA and C-ECA attention mechanism modules, and the S-FE network.This incorporation provides the algorithm with more abundant contextual information on remote sensing images, more detailed processing of important features, and obviously improves the recognition accuracy of small targets.On the GID-15 dataset, the mIoU, mPA, and mP reached 91.6%, 96.1%, and 95.2%, respectively, achieving the best overall classification results.This shows that introducing S-CA, C-ECA, and S-FE modules helped to improve the precision of the model in higher intra-class diversity, lower inter-class separability, and small-target recognition.

Model Visualization Analysis
The prediction results of FCN, PSPNet, DeepLabV3, DeepLabV3+, SegFormer, and our model algorithms on the GID-15 dataset are shown in Figure 10.In order to better compare the segmentation effects of these classical networks in various aspects, we randomly selected six representative scenes from the dataset, which were recorded as scenes 1 to 6.
information on remote sensing images, more detailed processing of important features, and obviously improves the recognition accuracy of small targets.On the GID-15 dataset, the mIoU, mPA, and mP reached 91.6%, 96.1%, and 95.2%, respectively, achieving the best overall classification results.This shows that introducing S-CA, C-ECA, and S-FE modules helped to improve the precision of the model in higher intra-class diversity, lower interclass separability, and small-target recognition.

Model Visualization Analysis
The prediction results of FCN, PSPNet, DeepLabV3, DeepLabV3+, SegFormer, and our model algorithms on the GID-15 dataset are shown in Figure 10.In order to better compare the segmentation effects of these classical networks in various aspects, we randomly selected six representative scenes from the dataset, which were recorded as scenes 1 to 6.  Scene 1 includes natural meadows and irrigated land, which are used to evaluate the classification performance of different networks on larger features.Scene 2 consists of a complex environment with a large number of small target scenes used to assess the detection performance of different networks on small targets.Scene 3 contains over 90% irrigated land and a small number of rural residential, paddy fields, and garden land, with an uneven distribution of sample categories.This scene is used to evaluate the classification performance of different networks under class imbalance conditions.Scene 4 includes various categories, such as industrial land, irrigated land, urban residential, rivers, and artificial meadow.It is used to evaluate the classification performance of other networks in complex scenes.In Scene 5, garden lands intermingle with irrigated land, providing a setting to evaluate different networks' classification efficacy in low inter-class separability.
Finally, Scene 6 is dominated by irrigated land and industrial land categories, featuring intricately annotated edges.This scene is utilized to validate the boundary segmentation refinement effects exhibited by various networks.
By using these six scenes, models with different segmentation performances can compare the segmentation results of images from multiple perspectives.Figure 10 presents the primitive image, labels, and segmentation outcomes of various networks for each scene.It can be seen from the six different segmented scenes that: (1) Although different segmentation models have excellent performance in the segmentation of large targets, our model is more detailed in the segmentation boundary of large targets.(2) Our model is clearly more accurate than other models in the segmentation of small targets, and the recognition rate of small targets is obviously improved.(3) Compared with other models with good performance, our model also has a certain improvement in the scene of unbalanced samples, especially in the recognition of a few categories.(4) Our model also has an excellent segmentation effect in complex environments, and it is particularly excellent in low separability between classes.Compared with other models, the improved algorithm network has a great improvement effect in larger category classification, smalltarget detection, higher intra-class diversity, lower inter-class separability, and classification in complex environments.

Anti-Interference Experiment Visualization Analysis
We added AWGN (sigma = 5, 10, 15) to the GID-15 test and verification sets, respectively, and randomly selected a picture.The anti-interference segmentation results of FCN, PSPNet, DeepLabV3, DeepLabV3+, SegFormer, and our model are shown in Figure 11.

Figure 1 .
Figure 1.Imaging of high-resolution remote sensing images.Bright blue represents ponds, blue represents rivers, red represents industrial land, green represents paddy fields, and black represents the background.

Figure 8 .
Figure 8.(a) Statistical line chart of model comparison experiment.(b) Statistical bar chart of model comparison experiment.

Figure 8 .
Figure 8.(a) Statistical line chart of model comparison experiment.(b) Statistical bar chart of model comparison experiment.

Figure 9 .
Figure 9. Statistical chart of model anti-interference experiment.

Figure 9 .
Figure 9. Statistical chart of model anti-interference experiment.

Figure 10 .
Figure 10.The classification comparisons of various networks on the GID-15 dataset: (a) the primitive image, (b) the labels, (c) the FCN result, (d) the PSPNet result, (e) the DeepLabV3 result, (f) the DeepLabV3+ result, (g) the SegFormer result, and (h) our model result.The dotted marked circles represent the focus area of the segmentation effect of the target.Scene 1 includes natural meadows and irrigated land, which are used to evaluate the classification performance of different networks on larger features.Scene 2 consists of a

Figure 10 .
Figure 10.The classification comparisons of various networks on the GID-15 dataset: (a) the primitive image, (b) the labels, (c) the FCN result, (d) the PSPNet result, (e) the DeepLabV3 result, (f) the DeepLabV3+ result, (g) the SegFormer result, and (h) our model result.The dotted marked circles represent the focus area of the segmentation effect of the target.

Figure 11 .
Figure 11.The classification comparisons of different AWGN on networks: (1) the FCN result, (2) the PSPNet result, (3) the DeepLabV3 result, (4) the DeepLabV3+ result, (5) the SegFormer result, and (6) our model result.The dotted marked rectangles represent the area of interest of the segmentation effect of the target.When AWGN = 5, PSPNet was the most disturbed, and the other models also had classification errors, while our model was almost unaffected.When AWGN = 10, except for the DeepLabV3+ model and our model, which had a small number of segmentation errors, the other models already had a large number of segmentation errors.When AWGN = 15, only our model and the SegFormer model could correctly distinguish most scenarios,

Figure 11 .
Figure 11.The classification comparisons of different AWGN on networks: (1) the FCN result, (2) the PSPNet result, (3) the DeepLabV3 result, (4) the DeepLabV3+ result, (5) the SegFormer result, and (6) our model result.The dotted marked rectangles represent the area of interest of the segmentation effect of the target.
depicts the architecture of the S-CA hybrid attention mechanism model.

Table 3 .
Comparative experiments of different segmentation algorithms.

Table 3 .
Comparative experiments of different segmentation algorithms.

Table 4 .
mIoU changes of models under different AWGN.

Table 4 .
mIoU changes of models under different AWGN.