Dual Path Attention Net for Remote Sensing Semantic Image Segmentation

: Semantic segmentation plays an important role in being able to understand the content of remote sensing images. In recent years, deep learning methods based on Fully Convolutional Networks (FCNs) have proved to be e ﬀ ective for the sematic segmentation of remote sensing images. However, the rich information and complex content makes the training of networks for segmentation challenging, and the datasets are necessarily constrained. In this paper, we propose a Convolutional Neural Network (CNN) model called Dual Path Attention Network (DPA-Net) that has a simple modular structure and can be added to any segmentation model to enhance its ability to learn features. Two types of attention module are appended to the segmentation model, one focusing on spatial information the other focusing upon the channel. Then, the outputs of these two attention modules are fused to further improve the network’s ability to extract features, thus contributing to more precise segmentation results. Finally, data pre-processing and augmentation strategies are used to compensate for the small number of datasets and uneven distribution. The proposed network was tested on the Gaofen Image Dataset (GID). The results show that the network outperformed U-Net, PSP-Net, and DeepLab V3 + in terms of the mean IoU by 0.84%, 2.54%, and 1.32%, respectively.


Introduction
Semantic segmentation is a fundamental aspect of computer vision research. Its goal is to assign a category label to each pixel in an image. Together with other kinds of deep learning research, it plays an important role in the recognition of different types of land cover in remote sensing images [1][2][3]. Recognizing the information an image contains is a key part of remote sensing image interpretation. Semantic segmentation is widely used in land cover mapping and monitoring, urban classification analysis, tree species identification in forest management, etc. [4][5][6][7][8][9][10][11][12]. To accomplish it, land cover types need to be distinguished in terms of "same object, different spectrum", or "same spectrum, different object". For instance, "lake" and "river" are two different types of land cover, but in remote sensing, they can have a similar appearance. Places with a high density of buildings or a low density of buildings may still both be classified as urban residential areas. In addition, the boundaries between different types of land cover are intricate and irregular, which makes the remote sensing segmentation task even more difficult. Thus, discrimination between features at a pixel level is essential.
In recent years, the state-of-the-art in semantic segmentation networks has progressed enormously [13][14][15]. One way to solve the above issues is by using a recurrent neural network to capture long-range contextual information. This kind of network can achieve remarkable results. For instance, a directed acyclic graph recurrent neural network [16] can capture the rich contextual information present in local features. However, although this method is very effective, it is largely dependent on longer-term learning results. Obtaining such a large number of remote sensing image segmentation labels is very difficult, so it is of limited practical utility for the segmentation of remote sensing images.
Another effective way of tackling the issues described above is to use self-attention mechanisms. These are popular and simple to adapt to semantic segmentation tasks because of their varied and flexible structure [17][18][19][20][21][22]. Self-attention mechanisms focus on local features by generating weight feature maps and fusing downstream feature maps. This may involve having one or more modules built upon a basic backbone, with each module focusing on things such as the channel or spatial information. However, downstream feature maps can lose a lot of spatial information, and the capture of the original spatial information directly is currently not feasible. Yet, having very precise spatial information is crucial for the effective segmentation of remote sensing images.
To address the above issues, we propose here a novel self-attention mechanism model, called a Dual Path Attention Network (DPA-Net), which is designed for remote sensing semantic segmentation. It uses two attention modules: a total spatial attention module to capture spatial information and a channel attention module to capture the channel information separately. The two modules can easily be appended to other segmentation models such as PSP-Net [23]. At present, there are many methods for the efficient extraction of different kinds of feature information. However, the input of almost all spatial attention methods is the feature map after sampling. As mentioned above, compared with the original image, the downsampled feature map contains a lot less spatial information. Therefore, this kind of spatial attention is inevitably inefficient, as it is unable to fully utilize the spatial information in the data. Therefore, instead of the downsampled image, we changed the input of the spatial attention method to the original image. In the total spatial attention module, spatial information is captured from the original image according to the self-attention mechanism mentioned above. The output of the TSAM is a single channel weight matrix. Each pixel of the output can be updated again by fusing according to the corresponding weight, with the weight itself being generated by the module. After being fused with the final feature map of DPA-Net, the TSAM will provide a weight for each pixel. During the training, the network pays higher attention to the areas with larger weights. This means that each pixel has its own focus in the network. For the channel attention module, the self-attention mechanism captures the channel information according to the channel maps. As with the total spatial attention module, it generates a weight factor. The feature maps are updated by integrating this weight factor. Once the two modules have completed their operations, two feature maps are obtained that contain spatial information and channel information, respectively. Then, these two feature maps are aggregated to generate the final output.
It is worth emphasizing that although the proposed method is more effective than the original self-attention method, it does not significantly change the memory footprint. Overall, it solves the conventional problems associated with self-attention mechanisms in a straightforward way. First of all, the TSAM makes its calculations on the basis of the original image. When compared to downstream feature maps, original remote sensing images contain more spatial information. Secondly, the output of the two modules acts on the last feature map in the model. Thus, the two modules can control the back propagation of the entire model. In addition, the simplicity of the module structure makes it easy for it to be used with any segmentation model. To verify the effectiveness of our method, we conducted experiments with U-Net, PSP-Net, and DeepLab V3+ [24,25] on the Gaofen Image Dataset (GID) [26]. It improved the mean IoU for each module by 0.84%, 2.54% and 1.32%, respectively.
The main contributions of the paper can be summarized as follows: • We propose a Dual Path Attention Network (DPA-Net) that uses a self-attention mechanism to enhance a network's ability to capture key local features in the semantic segmentation of remote sensing images.

•
A total spatial attention module is used to extract pixel-level spatial information, and a channel attention module is proposed to focus on different features. After the dual path feature extraction has taken place, the performance of the sematic segmentation is significantly improved.

Methods
In this section, we first present the overall framework of our network; then, we introduce the two attention modules, which capture spatial and channel-related contextual information. The section concludes with a description of how the output from the two modules is aggregated to give the final output.

Overview
For regular semantic segmentation, the scene for segmentation will include a variety of objects of diverse scales with different lighting that are visible from different viewpoints. However, because of the same shooting angle and distance of the samples in different remote sensing images, the boundary problem can be considered as more than just a multi-scale and multi-angle problem. In a remote sensing image, there will be many different types of land cover. In general, different types of land cover have their own spectral and structural characteristics, which are visible in different brightness values, pixel values, or spatial changes in remote sensing images. On account of the complexity of the composition, nature, distribution, and imaging conditions of the surface features, remote sensing images can be thought of in terms of "same object, different spectrum" and "same spectrum, different object". There are also two or more kinds of "mixed pixels" that can occur in a single pixel or the instantaneous field of view, making the work of recognition in remote sensing images even more complex. All of these factors can affect the accuracy of the result. To deal with this, our proposed method seeks to enhance the aggregated channel and spatial features separately, thus improving the feature representation for remote sensing segmentation.
Our method can be used with any semantic segmentation model, such as U-Net, PSP-Net, etc. Taking PSP-Net as an example, its basic structure is shown in Figure 1 [22]. The input image (a) is fed into a Convolutional Neural Network (CNN) to obtain the feature map of the last convolutional layer (b). Then, a pyramid parsing module (c) is used to get different sub-region representations, followed by upsampling and concatenation layers to form the final feature representation. This contains both local and global context information. Finally, a convolutional layer is used to get the per-pixel prediction (d) according to the required representation.

Overview
For regular semantic segmentation, the scene for segmentation will include a variety of objects of diverse scales with different lighting that are visible from different viewpoints. However, because of the same shooting angle and distance of the samples in different remote sensing images, the boundary problem can be considered as more than just a multi-scale and multi-angle problem. In a remote sensing image, there will be many different types of land cover. In general, different types of land cover have their own spectral and structural characteristics, which are visible in different brightness values, pixel values, or spatial changes in remote sensing images. On account of the complexity of the composition, nature, distribution, and imaging conditions of the surface features, remote sensing images can be thought of in terms of "same object, different spectrum" and "same spectrum, different object". There are also two or more kinds of "mixed pixels" that can occur in a single pixel or the instantaneous field of view, making the work of recognition in remote sensing images even more complex. All of these factors can affect the accuracy of the result. To deal with this, our proposed method seeks to enhance the aggregated channel and spatial features separately, thus improving the feature representation for remote sensing segmentation.
Our method can be used with any semantic segmentation model, such as U-Net, PSP-Net, etc. Taking PSP-Net as an example, its basic structure is shown in Figure 1 [22]. The input image (a) is fed into a Convolutional Neural Network (CNN) to obtain the feature map of the last convolutional layer (b). Then, a pyramid parsing module (c) is used to get different sub-region representations, followed by upsampling and concatenation layers to form the final feature representation. This contains both local and global context information. Finally, a convolutional layer is used to get the per-pixel prediction (d) according to the required representation. The general structure of DPA-PSP-Net is shown in Figure 2. We employed a pretrained ResNet50 [48] and used a dilated strategy [38] for the backbone. Drawing upon the structure of ResNet50, the proposed framework has four residual blocks, a Pyramid Pooling Module (PPM), a channel attention module, and a spatial attention module. We removed the down-sampling operation and employed dilated convolutions in the last two residual blocks instead, which is identical to the process used in PSP-Net. Thus, the size of the final feature map was at 1/8 of the scale of the input image. Given an input image with a size of 256 px × 256 px, we used ResNet50 to get the feature map, F1, while the weighting factor for the spatial attention, Ws, was obtained by the spatial attention module. F1 was fed into the PPM and the channel attention module, respectively, to obtain the feature map, F2, after up-sampling and applying the weighting factor for channel attention, Wc. Finally, F2 was multiplied by Wc and Ws to obtain the features to obtain the channel attention-weighted feature map, FC, and spatial attention-weighted feature map, FS. Then, FC and FS were aggregated to get the final output. The general structure of DPA-PSP-Net is shown in Figure 2. We employed a pretrained ResNet50 [48] and used a dilated strategy [38] for the backbone. Drawing upon the structure of ResNet50, the proposed framework has four residual blocks, a Pyramid Pooling Module (PPM), a channel attention module, and a spatial attention module. We removed the down-sampling operation and employed dilated convolutions in the last two residual blocks instead, which is identical to the process used in PSP-Net. Thus, the size of the final feature map was at 1/8 of the scale of the input image. Given an input image with a size of 256 px × 256 px, we used ResNet50 to get the feature map, F 1 , while the weighting factor for the spatial attention, Ws, was obtained by the spatial attention module. F 1 was fed into the PPM and the channel attention module, respectively, to obtain the feature map, F 2 , after up-sampling and applying the weighting factor for channel attention, W c . Finally, F 2 was multiplied by W c and W s to obtain the features to obtain the channel attention-weighted feature map, F C , and spatial attention-weighted feature map, F S . Then, F C and F S were aggregated to get the final output.

Total Spatial Attention Module
The effectiveness of the feature extraction is directly related to the accuracy of the results in remote sensing image segmentation. Features can be obtained by using contextual information. However, many studies [23,41] have shown that local features generated by traditional FCNs can lead to the wrong classification of objects and inaccurate prediction of object shapes. The attention mechanism plays an important role in the human visual system. When confronted with complex scenes, human beings can quickly focus their attention on significant aspects and prioritize them. As with the human visual system, a computer-based attention mechanism can focus the computing power of a network on key features, so that important features can be extracted from remote sensing images more effectively and redundant information can be set aside. To enhance the local feature extraction ability for difficult remote sensing images, we have developed a total spatial attention module (TSAM). This module can capture the spatial boundary information of remote sensing images, which makes it easier to extract the boundary features and refine other adaptive features, while suppressing less important information. The structure of the module is very simple, and it can be embedded in any network to improve a network's feature learning ability. Numerous methods for handling spatial attention already exist [20,22]. However, in our spatial attention module, the input is data rather than a feature map, F1. In view of the high-resolution character of remote sensing images and the complex spatial information they contain, the accuracy of the boundary information is of vital importance. As a network deepens, the receptive field gradually expands, the semantic information becomes increasingly advanced, the feature map becomes smaller and smaller, and the spatial information is constantly reduced. The size of the feature map, F1, is only 1/8 of the input data, so a lot of spatial information has been lost. Therefore, the original image is a better resource for capturing the important spatial information in a remote sensing image.
The structure of the total spatial attention module is illustrated in Figure 3a. The input data are the remote sensing image, I ∈ R 4×H×W , which is the same as the input data in ResNet. Input I first passes through the layers conv3×3, BN [49], and ReLU, with channel number C, to generate the feature map, A ∈ R C×H×W . Then, A is fed into the conv1×1, BN, and ReLU layers to obtain the next feature map, B ∈ R 1×H×W . Feature map B passes through another conv1×1 layer to generate the feature map, C ∈ R 1×H×W . Finally, a sigmoid function is used to get the spatial attention weighting factor, Ws ∈ R H×W . The process is as follows:

Total Spatial Attention Module
The effectiveness of the feature extraction is directly related to the accuracy of the results in remote sensing image segmentation. Features can be obtained by using contextual information. However, many studies [23,41] have shown that local features generated by traditional FCNs can lead to the wrong classification of objects and inaccurate prediction of object shapes. The attention mechanism plays an important role in the human visual system. When confronted with complex scenes, human beings can quickly focus their attention on significant aspects and prioritize them. As with the human visual system, a computer-based attention mechanism can focus the computing power of a network on key features, so that important features can be extracted from remote sensing images more effectively and redundant information can be set aside. To enhance the local feature extraction ability for difficult remote sensing images, we have developed a total spatial attention module (TSAM). This module can capture the spatial boundary information of remote sensing images, which makes it easier to extract the boundary features and refine other adaptive features, while suppressing less important information. The structure of the module is very simple, and it can be embedded in any network to improve a network's feature learning ability. Numerous methods for handling spatial attention already exist [20,22]. However, in our spatial attention module, the input is data rather than a feature map, F 1 . In view of the high-resolution character of remote sensing images and the complex spatial information they contain, the accuracy of the boundary information is of vital importance. As a network deepens, the receptive field gradually expands, the semantic information becomes increasingly advanced, the feature map becomes smaller and smaller, and the spatial information is constantly reduced. The size of the feature map, F 1 , is only 1/8 of the input data, so a lot of spatial information has been lost. Therefore, the original image is a better resource for capturing the important spatial information in a remote sensing image.
The structure of the total spatial attention module is illustrated in Figure 3a. The input data are the remote sensing image, I ∈ R 4×H×W , which is the same as the input data in ResNet. Input I first passes through the layers conv3×3, BN [49], and ReLU, with channel number C, to generate the feature map, A ∈ R C×H×W . Then, A is fed into the conv1×1, BN, and ReLU layers to obtain the next feature map, B ∈ R 1×H×W . Feature map B passes through another conv1×1 layer to generate the feature map, C ∈ R 1×H×W . Finally, a sigmoid function is used to get the spatial attention weighting factor, W s ∈ R H×W . The process is as follows: where I o denotes the original remote sensing image, and A, B, and C are the corresponding feature maps in Figure 3a. In this way, each value, w s in W s , is between 0 and 1. This can be regarded as the weight of each corresponding pixel in the original image, reflecting the pixel's relative importance. This simple method makes it possible to generate a position weight with the same width and height as the original image, with the network enhancing the pixel level local feature extraction ability with almost no increase in computation. Thus, more effective remote sensing scene features can be extracted, thereby improving the classification performance.
ISPRS Int. J. Geo-Inf. 2020, 9, x FOR PEER REVIEW 6 of 20 where Io denotes the original remote sensing image, and A, B, and C are the corresponding feature maps in Figure 3a. In this way, each value, ws in Ws, is between 0 and 1. This can be regarded as the weight of each corresponding pixel in the original image, reflecting the pixel's relative importance. This simple method makes it possible to generate a position weight with the same width and height as the original image, with the network enhancing the pixel level local feature extraction ability with almost no increase in computation. Thus, more effective remote sensing scene features can be extracted, thereby improving the classification performance.

Channel Attention Module
There are some commonplace problems in remote sensing image datasets. These include the uneven distribution of samples and the varying complexity of different kinds of land cover. When a model is trained, as the network deepens, the semantic information becomes increasingly sophisticated. Each channel in the final advanced semantic features can be seen as a summary of different types of land cover. We introduced a channel attention module(CAM) to enhance the feature channels with similar values occurring in the same image location. If the same position in an image has similar values for different channels, it means that there may be at least two types of feature, with little or no difference between them. The output of the CAM aims to make the relationship between similar channels more obvious. The CAM can capture different kinds of important information in a remote sensing image relating to different channels in the high-level semantic feature map. This facilitates the extraction of key features, refining the balance in the adaptive feature extraction. The input for the CAM is the feature map, F1. This contains the highest-level semantic features in the whole model.
The structure of the CAM is illustrated in Figure 3b. The input data are the feature map, F1 ∈ R 512×H/4×W/4 . The input, F1, first passes through a 3 × 3 convolutional layer, a BN layer, and a ReLU layer, with the channel number, C, to generate the feature map A ∈ R C×H×W . Then, Global Average Pooling is used to obtain the feature map, B ∈ R C×1×1 . Then, B is fed into a 1 × 1 convolutional layer to get the feature map, C ∈ R C×1×1 . Finally, we use a sigmoid function to get the channel attention weighting factor, Wc ∈ R C×1×1 . The process can be summarized as follows:

Channel Attention Module
There are some commonplace problems in remote sensing image datasets. These include the uneven distribution of samples and the varying complexity of different kinds of land cover. When a model is trained, as the network deepens, the semantic information becomes increasingly sophisticated. Each channel in the final advanced semantic features can be seen as a summary of different types of land cover. We introduced a channel attention module(CAM) to enhance the feature channels with similar values occurring in the same image location. If the same position in an image has similar values for different channels, it means that there may be at least two types of feature, with little or no difference between them. The output of the CAM aims to make the relationship between similar channels more obvious. The CAM can capture different kinds of important information in a remote sensing image relating to different channels in the high-level semantic feature map. This facilitates the extraction of key features, refining the balance in the adaptive feature extraction. The input for the CAM is the feature map, F 1 . This contains the highest-level semantic features in the whole model.
The structure of the CAM is illustrated in Figure 3b. The input data are the feature map, F 1 ∈ R 512×H/4×W/4 . The input, F 1 , first passes through a 3 × 3 convolutional layer, a BN layer, and a ReLU layer, with the channel number, C, to generate the feature map A ∈ R C×H×W . Then, Global Average Pooling is used to obtain the feature map, B ∈ R C×1×1 . Then, B is fed into a 1 × 1 convolutional layer to get the feature map, C ∈ R C×1×1 . Finally, we use a sigmoid function to get the channel attention weighting factor, W c ∈ R C×1×1 . The process can be summarized as follows: where F 1 is the corresponding feature map in Figure 2; and A, B, and C are the corresponding feature maps in Figure 3a. As with W s , each value in W c is between 0 and 1. This can be regarded as the weight of each category, which reflects the feature extraction difficulty. By using this simple method to generate channel weights, the network can focus on more complex types of feature extraction, reduce the redundant information, and improve the land cover type classification.

Feature Aggregation
By using the above modules, the important information in high-resolution remote sensing images can be extracted more effectively. To make full use of the contextual information, the features are aggregated after the attention weights have been applied. This involves multiplying the output of the two modules (W s and W c ) and the feature map, F 2 (the PSP-Net output in our example), by the corresponding elements to get two feature maps of the same size, C × H × W. One is the feature map after application of the channel attention weight, F c . The other is the feature map after application of the spatial attention weight, F s . The feature aggregation is completed by summing the corresponding elements in F c and F s . It should be emphasized that the two attention modules are very simple and can be directly used in any segmentation model. They do not significantly increase the computational load, but they can significantly improve a network's performance.

Experiments
In this section, we first introduce the Gaofen Image Dataset (GID) and explain how the model was implemented. Then, we present how a comprehensive experiment was conducted on the GID dataset to evaluate our proposed method and to compare its semantic segmentation performance against other state-of-the-art algorithms.

Dataset Description
The high-resolution image dataset, GID [26], is a large-scale land cover dataset. It was constructed from GF-2 satellite images. As a result of its large coverage, wide distribution, and high spatial resolution, it has a number of advantages over existing land cover datasets. GF-2 is the highest resolution civil terrestrial observation satellite in China at present, so the image clarity of the dataset is exceptional. The categories covered by the dataset are also both varied and typical, so the characterization of the land cover types is representative of the distribution of the land cover in most parts of China. At the same time, the complexity of the land cover types make the dataset especially valuable for research. The GID dataset consists of two parts: a large-scale classification set and a fine-grained land cover classification set. The large-scale classification set contains 150 GF-2 images annotated at pixel level. The fine-grained classification set consists of 30,000 multi-scale image blocks and 10 pixel level annotated GF-2 images. We deliberately chose to use the GID dataset with 16 kinds of land cover, which are more difficult to train. Each image is 6800 px × 7200 px, with 4 NirRGB channels and high-quality pixel-level labels for the 16 types of land cover. The 16 types of land cover are as follows: industrial land; urban residential; rural residential; traffic land; paddy field; irrigated land; dry cropland; garden plot; arbor woodland; shrub land; natural grassland; artificial grassland; river; lake; pond; and other categories. Figure 4 shows the distribution of the types of land cover.

Dataset Preprocessing
Due to the uneven distribution of the different types of land cover in the GID dataset and the fact that the images are very large, the dataset needed to be preprocessed, so that the training could be more effective. First of all, we manually cropped the 10 images to 1000 px × 1000 px to serve as a validation set, keeping the change in the distribution as small as possible. The reason for selecting the validation set was that our method was a full convolution network (FCN), so it was not sensitive to the size of the input image. Moreover, the size of remote sensing images are often very large, so we chose a larger size image to verify. Manual selection can also ensure a balanced distribution. The distribution of the types of land cover in the validation set is shown in Figure 5.

Dataset Preprocessing
Due to the uneven distribution of the different types of land cover in the GID dataset and the fact that the images are very large, the dataset needed to be preprocessed, so that the training could be more effective. First of all, we manually cropped the 10 images to 1000 px × 1000 px to serve as a validation set, keeping the change in the distribution as small as possible. The reason for selecting the validation set was that our method was a full convolution network (FCN), so it was not sensitive to the size of the input image. Moreover, the size of remote sensing images are often very large, so we chose a larger size image to verify. Manual selection can also ensure a balanced distribution. The distribution of the types of land cover in the validation set is shown in Figure 5.

Dataset Preprocessing
Due to the uneven distribution of the different types of land cover in the GID dataset and the fact that the images are very large, the dataset needed to be preprocessed, so that the training could be more effective. First of all, we manually cropped the 10 images to 1000 px × 1000 px to serve as a validation set, keeping the change in the distribution as small as possible. The reason for selecting the validation set was that our method was a full convolution network (FCN), so it was not sensitive to the size of the input image. Moreover, the size of remote sensing images are often very large, so we chose a larger size image to verify. Manual selection can also ensure a balanced distribution. The distribution of the types of land cover in the validation set is shown in Figure 5.  After removing the validation set from the GID dataset, 15,000 images were randomly cropped from the original images to 256 px × 256 px to create a training set. As there was an uneven distribution of different types of land cover, we cropped images with the lowest GID land cover distribution, such as garden plots and artificial grassland from the original images, giving about nine images of different sizes. Then, 1000 256 px × 256 px images were randomly cropped from these nine images and added to the training set to improve the distribution. Thus, the final training set was made up of 16,000 images with a size of 256 px × 256 px, as shown in Figure 6. Figure 7 shows the training set's distribution. We did not use a test set, as the size of the dataset was too small. Although there are 16,000 images in the training dataset, they are randomly cropped from the rest of the GID dataset. Therefore, there is an overlap between the pictures of the training set. This operation itself is a data enhancement process and would be difficult to train because it would take up too much memory.
images of different sizes. Then, 1000 256 px × 256 px images were randomly cropped from these nine images and added to the training set to improve the distribution. Thus, the final training set was made up of 16,000 images with a size of 256 px × 256 px, as shown in Figure 6. Figure 7 shows the training set's distribution. We did not use a test set, as the size of the dataset was too small. Although there are 16,000 images in the training dataset, they are randomly cropped from the rest of the GID dataset. Therefore, there is an overlap between the pictures of the training set. This operation itself is a data enhancement process and would be difficult to train because it would take up too much memory.   images and added to the training set to improve the distribution. Thus, the final training set was made up of 16,000 images with a size of 256 px × 256 px, as shown in Figure 6. Figure 7 shows the training set's distribution. We did not use a test set, as the size of the dataset was too small. Although there are 16,000 images in the training dataset, they are randomly cropped from the rest of the GID dataset. Therefore, there is an overlap between the pictures of the training set. This operation itself is a data enhancement process and would be difficult to train because it would take up too much memory.

Data Augmentation
High-resolution remote sensing images can very easily cause network overfitting because it is hard to obtain a sufficient number of labeled images. The limited number of types in the small GID dataset also made the network training more difficult. Therefore, a data augmentation strategy was employed to enhance the generalizability of the network. We used Albumentations (https://github. com/albumentations-team/albumentations) to augment the dataset and applied the horizontalflip, verticalflip, randomrotate90, and transform functions to enrich the training dataset. This also gave the features extracted from the network rotation invariance. Elastictransform, blur, and cutout were also used for every image during the training to suppress the likelihood of the network capturing insignificant features. The probability for all of the above operations was 0.5.

Implementation Details
We used the pixel accuracy (Acc), mean IoU, and F1-score as performance evaluation metrics for the semantic segmentation results. Pixel accuracy is the number of correctly classified pixels divided by the total number of pixels in the image. It can be calculated as follows: (10) where k is the number of foreground categories; p ii is the number of pixels predicted correctly; and p ij represents a pixel that belongs to class i but that is predicted to belong to class j.
With regard to semantic segmentation, the mean IoU calculates the mean intersection-over-union of two sets with the same kind of category: the ground truth and the predicted segmentation. This is a valuable measure for establishing segmentation performance. The results fall in the range of 0 to 1, with a higher value indicating a better segmentation performance. The mean IoU can be calculated as follows: (11) where k is the number of foreground categories; p ii is the number of pixels predicted correctly; and p ij and p ji are the false positive and false negative interpretations. Another indicator that is used is the F1-score. The F1-score is the weighted harmonic mean of the precision and recall. The F1-score and recall can be obtained as follows: (13) where Acc is the pixel accuracy mentioned above; p ii is the number of pixels predicted correctly; and p ji denotes the false negative interpretations. After augmenting the training set using the above method, we set the training period to 100 epochs for all of the experiments and employed an apex to obtain semi-precise training. We used a weight decay of 0.00001 and a momentum of 0.9. All of the backbones of the model were set to Resnet50, which was pretrained on ImageNet to facilitate ablation experiments. Cross-Entropy Loss was used at the end of the model to supervise the final results. This can be calculated as follows: where n is the total number of pixels; p gt is the ground truth of pixel p i ; and p pre is the prediction for pixel p i . The base learning rate was set to 0.15 and decreased to 0.00001 through cosine annealing until the end of the training was achieved. We used the Ubuntu 18.04 system for the experiment and the GPU was an NVIDIA RTX2080TI. The experiment was implemented by using Pytorch and was optimized by adopting a stochastic gradient descent (SGD) approach.

Ablation Study of Total Spatial Attention Module-Related Improvements
Numerous approaches have used channel and spatial attention modules in recent years [20,22,26]. Most use the feature map, F 1 , as input (see Figure 2). In our total spatial attention module (TSAM), the basic idea of a spatial attention module is modified by moving its location and simplifying the structure to ensure the method's overall simplicity. The part to be played by a CAM is well-established, so there is no need to repeat studies of the CAM here. Therefore, our experiments primarily focused on the potential improvements arising from using a TSAM.

Experiment 1: Effect of High-Level Semantic Information on the TSAM
As the TSAM extracts features from the original image, it is possible that a lack of advanced semantic information might affect its effectiveness. To assess this possibility, we extracted a spatial weighting factor matrix from the backbone and fused it with the TSAM's output to increase the high-level semantic information. Then, a method without any high-level spatial attention (HLSA) was compared with methods that use high-level spatial attention in various ways. We used PSP-Net for the experiment (DPA-PSP-Net) because DPA-Net can be appended to any network. The experiment showed that using the TSAM without HLSA for the original image was sufficiently effective. It delivered results of 82.75% for Acc and 67.92% for the mean IoU. HLSA did not improve the network performance, so we did not employ it. The experimental results are shown in Table 1.

Experiment 2: Effect of the TSAM Location
As a network deepens, the feature map becomes smaller, and the spatial information decreases. This was the basis of our reasoning that it would be more effective to capture the spatial information from the original image. To verify this assumption, we calculated the TSAM for three different locations in the model: at the beginning, in the middle, and at the end of the backbone. ResNet consists of five blocks in series. We chose the original image, the 3rd block's ResNet output and the 5th block's ResNet output as the TSAM input. The feature maps corresponding to the blocks in ResNet were 1, 1/4, and 1/8 times the size of the original image. As shown in Table 2, the performance of the TSAM improved in line with an increase in the input size, confirming that our initial conjecture was correct. To assess the effect of the depth of the TSAM, we tested different numbers of parameters to find the most efficient structure. We only changed the number of layers before the layer that makes up the feature map's channel 1. In other words, we kept the last two 1 × 1 convolutions and increased or decreased the number of 3 × 3 convolutions. The experimental results show that the performance was most effective when there were three layers in the TSAM. Table 3 shows the results for models using different numbers of layers.

Ablation Study for Both Attention Modules
In order to assess any potential differences between the effect of the two modules on improving the remote sensing semantic segmentation performance, we conducted experiments with different combinations. The results are shown in Table 4.  Table 4 makes evident the performance improvements brought about by using both the CAM and the TSAM. Compared with a baseline PSP-Net, applying a CAM delivered a mean IoU result of 66.90% and an F1-score of 71.45%, which amounts to a 1.52% and 7.48% improvement, respectively. Employing just a TSAM increased the mean IoU to 67.37% and F1-score to 6.07%. However, the biggest performance improvement came from using both modules together. When we integrated the CAM and TSAM, the mean IoU result was 67.92%, which was 2.54% higher than the baseline. The F1-score result was 72.56%, which was 8.59% higher than the baseline. These experimental results confirm that the dual path attention approach with two modules is a more effective strategy for improving the performance of semantic segmentation models on remote sensing images.
We also considered the feasibility of a Squeeze-and-Excitation(SE) operation, so we added SE operations to CAM and TSAM for comparative experiments. The structure is illustrated in Figure 8. For TSAM, we add a convolutional layer to reduce its size to H/2 × W/2. Then, we used an upsampling operation to restore it to its original size. For CAM, we used a fully connected layer to reshape its size to C/2 × 1 and restore it. The output of TSAM was also changed to 16 channels, i.e., the number of channels and the number of land cover types were the same. The experimental results are shown in Table 5. Table 5. DPA-Net performance using the structure in Figure 8. We noticed that the performance of TSAM using Squeeze-and-Excitation operations was not always as good as expected. The structure of the attention module also became more complex, although its performance was not better. The mean IoU results for the SE operations were also lower than the results using our method by 0.59%, 1.21%, and 0.44% for U-Net, PSP-Net, and Deeplab V3+, respectively. The F1-score results for the SE operations were lower than those produced by our proposed method by 1.30%, 6.81%, and 0.21%, respectively. This may be because the function of the Squeeze-and-Excitation operation is to remove redundant information. However, our CAM focuses on the features of categories and is no longer able to remove redundancy. The purpose of setting the TSAM input as the original image is to have better resolution, retain better state features, and provide a better positioning function. Therefore, the Squeeze-and-Excitation operation may not be best applied to a TSAM.

Comparison with Different Models
In view of the small amount of GID data, we used augmentation to offset the potential problem of network overfitting. To verify the validity of our chosen augmentation method, we conducted experiments where we trained DPA-Net on the U-Net, PSP-Net, and DeepLab V3+ semantic segmentation models, using the original dataset and the augmented dataset. The results are shown in Table 6. The results indicate that the augmentation strategy we employed was effective. The semantic segmentation mean IoU increased to 67.07%, 67.92%, and 67.37% for U-Net, PSP-Net, and DeepLab V3+, respectively. The F1-score increased to 65.75%, 72.56%, and 67.31% for the above techniques, respectively. This suggests that augmentation strategies can enhance the scope for network generalization by enriching the data.
To verify the effectiveness of our method in relation to actual remote sensing image segmentation tasks, we compared it against more mainstream methods based on self-attention mechanisms. These methods were Non-Local NN, SE-Net, CBAM, and DA-Net. The results of the experiment are shown in Tables 7 and 8.

Comparison with Different Models
In view of the small amount of GID data, we used augmentation to offset the potential problem of network overfitting. To verify the validity of our chosen augmentation method, we conducted experiments where we trained DPA-Net on the U-Net, PSP-Net, and DeepLab V3+ semantic segmentation models, using the original dataset and the augmented dataset. The results are shown in Table 6. The results indicate that the augmentation strategy we employed was effective. The semantic segmentation mean IoU increased to 67.07%, 67.92%, and 67.37% for U-Net, PSP-Net, and DeepLab V3+, respectively. The F1-score increased to 65.75%, 72.56%, and 67.31% for the above techniques, respectively. This suggests that augmentation strategies can enhance the scope for network generalization by enriching the data.
To verify the effectiveness of our method in relation to actual remote sensing image segmentation tasks, we compared it against more mainstream methods based on self-attention mechanisms. These methods were Non-Local NN, SE-Net, CBAM, and DA-Net. The results of the experiment are shown in Tables 7 and 8.
The experimental results show that DPA-PSP-Net provided the most effective semantic segmentation. SE-Net was the next most effective. The mean IoUs for the supposedly stronger CBAM and DA-Net were only 65.45% and 64.67%, respectively. Non-local NN and SE-Net had better F1-scores. However, they were still lower than that of DPA-PSP-Net. This confirms that the segmentation of remote sensing images is different from normal scene segmentation, so, DPA-PSP-Net may have an advantage over existing methods.  To further assess the effectiveness of the proposed method, we compared the mean IoU and F1-score for each type of land cover when using the three different models, U-Net, PSP-Net, and DeepLab V3+, with or without DPA-Net.
As shown in Tables 9 and 10, every model performed better with DPA-Net than it did on its own. Note in particular that although PSP-Net had lower mean IoU results than U-Net and DeepLab V3+ on its own, DPA-PSP-Net outperformed any other approach. The same is true for the F1-scores. Another point to note is that because the distribution of shrubbery woodland was so small, no network had a good way of capturing its key features, so every approach had poor results. However, this did not change the fact that DPA-Net still improved the segmentation model. Several visual comparisons using PSP-Net as an example are shown in Figure 9.  The output of TSAM is shown in the rightmost column in Figure 9. Although the input of TSAM is the original image, the output does not seem to include a lot of noise. For some position attention modules, such as "lake" in the second row and "dry cropland" in the last row, the details and boundaries are even more clear. These results reveal the effectiveness of the visualized weighting factors of TSAM.
To further assess the contribution made by TSAM to DPA-Net, we visualized the differences in the output of DPA-Net with different forms of attention. We randomly selected a test image as shown in Figure 10. We first compared the output of DPA-Net with just CAM, then with both TSAM and CAM, while saving the output feature maps that passed the softmax function. The size of these two feature maps was (C, H, W). Then, we performed an L1 Norm operation on these two feature maps for the C dimension, yielding a heat map with a size of (1000, 1000). This is shown in the right-hand column of Figure 10. The output of TSAM is shown in the rightmost column in Figure 9. Although the input of TSAM is the original image, the output does not seem to include a lot of noise. For some position attention modules, such as "lake" in the second row and "dry cropland" in the last row, the details and boundaries are even more clear. These results reveal the effectiveness of the visualized weighting factors of TSAM.
To further assess the contribution made by TSAM to DPA-Net, we visualized the differences in the output of DPA-Net with different forms of attention. We randomly selected a test image as shown in Figure 10. We first compared the output of DPA-Net with just CAM, then with both TSAM and CAM, while saving the output feature maps that passed the softmax function. The size of these two feature maps was (C, H, W). Then, we performed an L1 Norm operation on these two feature maps for the C dimension, yielding a heat map with a size of (1000, 1000). This is shown in the right-hand column of Figure 10. This heat map shows the difference in output between DPA-Net with TSAM and without TSAM. The brighter the highlight, the greater the contribution of TSAM. In the images, we can see that the river and the lake areas are relatively pronounced. This means that the contribution of TSAM was especially significant in these regions. This heat map makes the contribution of TSAM to the overall prediction evident.
We also counted the Multiplication and Accumulation (MAC) results for DPA-Net and the number of parameters required, and then, we compared them with the original U-Net, PSP-Net, and DeepLab V3+ models. As can be seen in Table 11, the MAC results only increased by 0.07G, 0.223G, and 0.069G, respectively, across the three models, and the number of parameters only increased by 0.075M, 0.077M, and 0.075M, respectively. This shows that compared with the original method, DPA-Net only increases the memory footprint by a small amount.  This heat map shows the difference in output between DPA-Net with TSAM and without TSAM. The brighter the highlight, the greater the contribution of TSAM. In the images, we can see that the river and the lake areas are relatively pronounced. This means that the contribution of TSAM was especially significant in these regions. This heat map makes the contribution of TSAM to the overall prediction evident.
We also counted the Multiplication and Accumulation (MAC) results for DPA-Net and the number of parameters required, and then, we compared them with the original U-Net, PSP-Net, and DeepLab V3+ models. As can be seen in Table 11, the MAC results only increased by 0.07G, 0.223G, and 0.069G, respectively, across the three models, and the number of parameters only increased by 0.075M, 0.077M, and 0.075M, respectively. This shows that compared with the original method, DPA-Net only increases the memory footprint by a small amount.

Conclusions
In this paper, we have proposed a Dual Path Attention Network (DPA-Net) for the semantic segmentation of remote sensing images. It can be used with any segmentation model without having any significant impact on the memory footprint or number of parameters. A remote sensing image is first processed via the backbone and a total spatial attention module to obtain a feature map and

Conclusions
In this paper, we have proposed a Dual Path Attention Network (DPA-Net) for the semantic segmentation of remote sensing images. It can be used with any segmentation model without having any significant impact on the memory footprint or number of parameters. A remote sensing image is first processed via the backbone and a total spatial attention module to obtain a feature map and spatial weighting factor. Then, a CAM is calculated from the feature map to get the channel weighting factor. Finally, the output of the segmentation model is multiplied by the spatial weighting factor and channel weighting factor separately to get two feature maps that capture different aspects of the features. Then, these two feature maps are fused to obtain the final DPA-Net output. The proposed network was tested and found to improve the performance of various state-of-the-art segmentation models on the GID dataset. We believe the performance can be further improved by refining the structure of the two path attention modules, so this will be the focus of our future work.