Firstly, since the number of parameters in the backbone network directly determines the overall parameter size of the model, MobileNetv3 was used as the backbone network for feature extraction to meet the overall lightweight network. In addition to the four adequate feature layers obtained using the backbone network, the final feature layer Conv4 had a shape of (16, 16, 256) and was connected to an ASPP module with five parallel branches. To enhance the expressiveness of the feature map and facilitate the detection of targets at different scales, null convolution was used to achieve local and global feature fusion. Secondly, the reinforced feature layers of the ASPP module were used for reverse progressive upsampling. Combining with the attention mechanism LNCA-Net, the feature fusion and upsampling were performed gradually with the remaining three feature layers in the backbone network to obtain the final feature layer with all fused features. Finally, this feature layer was used to classify each feature point, equivalent to the predictive classification of each pixel point, to form the predictive output module.
3.1. Lightweight and Efficient Backbone Network
For deep learning algorithms designed for semantic segmentation problems applied to high-resolution remote sensing images, it is especially critical that the network itself is lightweight and efficient due to the memory and power constraints of the application device. The problem of how to make the network computationally reduced while ensuring accuracy has received widespread attention, of which the MobileNet series is excellent, whereas MobileNetV3 [
39], after the accumulation of the first two generations of V1 [
40] and V2 [
41], utilizes the deep separable convolution of MobileNetV1 while synthesizing MobileNetV2 with linear bottleneck and a squeeze-and-excitation network (SE-Net) [
42] of the inverse residual structure. The overall computation of the neural network mainly depends on the number of parameters of the backbone network, so it is essential to choose a lightweight backbone network. In this paper, the B-neck structure of MobileNetV3 was used as the backbone network by concatenating, as shown in
Figure 3, to improve the execution speed of the network.
The B-neck was composed of an inverted residual block, first using 1 × 1 standard convolution to up-dimension and then extracting features by depthwise separable convolution, using SE-Net attention mechanism, and finally using 1 × 1 standard convolution to down-dimension and output. This design allowed less information to be lost when high-dimensional information was passed through the activation function. In addition, the B-neck was divided into two structures according to the step size. The shortcut connection was only available when the stride was 1, i.e., when the input features had the same shape as the output features. We constructed the feature extraction backbone network in
Figure 2 by combining two kinds of B-neck.
The feature extraction process used a preliminary feature map obtained by one standard convolution first, followed immediately by each feature layer being compressed and deepened by combining one Stride 2 and multiple Stride 1s. A total of four feature layers, namely Conv1, Conv2, Conv3, and Conv4, were obtained with shapes of (128, 128, 32), (64, 64, 64), (32, 32, 128), and (16, 16, 256), respectively. These preliminary feature layers needed to be processed through an enhanced feature extraction network to deepen the sampling capability of the model to deepen the model’s sampling capability for features.
3.2. Enhanced Feature Extraction Network
The enhanced feature extraction network proposed in this paper aimed to increase the richness of the feature map and was composed of an ASPP module and a reverse progressive attention feature fusion network. The ASPP module evolved from the spatial pyramid pooling (SPP) module [
43]. It aimed to sample features by convolution kernels at different scales, enabling accurate and efficient classification of regions at arbitrary scales. This method of fusing local and global feature information can enhance the correlation between features in the spatial dimension. However, using pooling layers alone in SPP to increase the perceptual field also decreased the resolution and led to the loss of detailed information. To solve this problem, one can use null convolution instead of pooling layers to achieve a larger perceptual field while reducing the loss of resolution [
44].
The ASPP module shown in
Figure 4 has five branches, the first three use a 3 × 3 atrous convolution with convolution kernels and rates of 6, 12, and 18, respectively. The last two branches will use a standard convolution with a 1 × 1 kernel, the difference being whether or not they pass through the global average pooling layer. The features of each branch are then merged to obtain a Conv4_1 of the shape (16, 16, 256) by compressing the number of channels using the standard convolution of 1 × 1. The structure of an atrous convolution with different rates in parallel is often used in semantic segmentation because it can increase the perceptual field and capture multi-scale information without increasing the number of parameters. However, the pixel points generated by the nature of the atrous convolution are disjointed, and they are independent of each other lacking dependencies, which can cause local information loss and no correlation between features at a distance. Therefore, the latter two branches of the ASPP module proposed in this paper were designed to compensate for this drawback by enhancing the global and local information interaction through global average pooling and standard convolution with a 1 × 1 convolution kernel.
Since there are targets of different sizes in the semantic segmentation of high-resolution remote sensing images, even though the feature layer Conv4 of the backbone feature extraction network can obtain feature layers with high semantic information after processing by the ASPP module, its size is small, compared with other feature layers of larger sizes, which still retain rich feature details. More feature layers need to be combined to improve the accuracy of the results. However, blind direct upsampling for the concatenated merging of multiple feature layers is flawed due to the interference of redundant details in the background for detection.
It becomes necessary to use an attention mechanism to weigh the features at all pixel locations for each size. As shown in the RPA-Net of
Figure 2, each backbone feature layer was attentionally enhanced and then combined with the previous feature layer that was upsampled so that a gradual reverse progression of decoding allowed more features to be generated for semantic segmentation.
The core idea of the attention mechanism was to let the model learn to focus on the critical information and ignore the unimportant information, which was to understand the weight distribution by using the relevant feature map and then apply the obtained weights to the original feature map, to sum up, the weights. The weighting can be applied to the spatial domain to transform the information in the spatial part of the image to extract the critical data. It can also be used in the channel domain, adding a weight to each channel signal to represent the channel’s relevance to the necessary information; the more significant the importance, the higher the relevancy.
In our previous work, we proposed the lightweight residual convolutional attention network (LRCA-Net) [
45], which is spatial-channel hybrid attention, with the core idea of improving the channel attention module of CBAM [
46] by using 1D convolution instead of fully connected layers and adding a residual structure. Such an improvement can improve the performance a lot, but it does not improve the spatial attention module, as shown in
Figure 5.
The spatial attention module pools the input features with the shape of
C ×
H ×
W by the maximum and average pooling and then connects the two obtained features into a channel pool with the shape of 2 ×
H ×
W and convolves them by a standard convolutional layer of size 7 × 7 to obtain the spatial attention map by sigmoid as in Equation (1).
where
As denotes the spatial attention module,
F′ denotes the channel refined feature,
σ denotes the sigmoid function, and
k7×7 represents the convolution of kernel size 7 × 7.
By analyzing it, it can be seen that the drawback of such a spatial attention module is very obvious because simply performing a 7 × 7 convolution is limited by the size of the convolution kernel, which is only limited to the features of adjacent points that can be captured by the convolution and lacks the remote interaction between any two more distant positions; so, how to use the attention structure in the spatial dimension to effectively integrate global information by performing autocorrelation on the global feature map has become the direction of this improvement.
Non-local neural networks provide an autocorrelation matrix algorithm for the problem of how to capture long-range dependencies [
47]. Along this line, we improved the spatial attention module, as shown in
Figure 6. First, we generated three copies of the input feature maps
F′ of shape
H ×
W ×
C by standard convolution of 1 × 1, respectively, and reshaped them into
N ×
C, where
N is equal to
H ×
W, as in Equation (2).
where
Conv2
d represents the convolution kernel as a 1 × 1 standard convolution operation, reshape is the reshaping function, and the shapes of
R1,
R2, and
R3 are
N ×
C, where
N is equal to
H ×
W.
After transposing
R3 to obtain
and performing matrix multiplication with
R2, using the softmax layer to obtain the spatial attention map, we performed matrix multiplication with
R1 and reshaped it to obtain
R4 with the shape
H ×
W ×
C as in Equation (3), and finally added
R4 to the input feature map
F′ to obtain the spatial-refined feature
F″ as in Equation (4).
where
represents the transpose matrix of
R3,
S represents the softmax function, and
F″ represents the spatial-refined feature.
In contrast to the previous spatial attention module, performing the standard convolution operation involved only a weighted sum of the pixel values around that location. In contrast, by an improved autocorrelation matrix operation, finding a value at a location corresponded to a weighted sum of the values at all locations. For the features at a specific location, the features at all locations were aggregated and updated by a weighted sum, where the weights were determined by the similarity of the features at the corresponding two locations. Therefore, the implementation of non-local attention that associated the features between two pixels at a certain distance on the image helped the network model accomplish the semantic segmentation task; an illustration of the dependencies of the global context information compared to the local information is shown in
Figure 7.
With replacement of the previous method with a proposed spatial attention module, the overall structure of the improved attention mechanism LNCA-Net was obtained, as shown in
Figure 8. Since the network design without changing the feature map size allowed LNCA-Net to be inserted into arbitrary network structures, the progressive attentional feature fusion structure was designed in combination with such an attention mechanism.
Image conversion from low- to high-resolution images can lead to distortion. Using the skip connection structure to combine the distortions was relatively negligible and retained more detailed information. Such a progressive feature fusion structure can directly connect gradient, point, line, and other input to the decoder more accurately after the attention enhancement process of the same shape feature layer in the encoder, which is equivalent to adding more detailed information when judging the target region, which is beneficial to obtain more accurate segmentation results.