remote Improved U-Net Remote Sensing Classiﬁcation Algorithm Fusing Attention and Multiscale Features

: The selection and representation of classiﬁcation features in remote sensing image play crucial roles in image classiﬁcation accuracy. To effectively improve the features classiﬁcation accuracy, an improved U-Net remote sensing classiﬁcation algorithm fusing attention and multiscale features is proposed in this paper, called spatial attention-atrous spatial pyramid pooling U-Net (SA-UNet). This framework connects atrous spatial pyramid pooling (ASPP) with the convolutional units of the encoder of the original U-Net in the form of residuals. The ASPP module expands the receptive ﬁeld, integrates multiscale features in the network, and enhances the ability to express shallow features. Through the fusion residual module, shallow and deep features are deeply fused, and the characteristics of shallow and deep features are further used. The spatial attention mechanism is used to combine spatial with semantic information so that the decoder can recover more spatial information. In this study, the crop distribution in central Guangxi province was analyzed, and experiments were conducted based on Landsat 8 multispectral remote sensing images. The experimental results showed that the improved algorithm increases the classiﬁcation accuracy, with the accuracy increasing from 93.33% to 96.25%, The segmentation accuracy of sugarcane, rice, and other land increased from 96.42%, 63.37%, and 88.43% to 98.01%, 83.21%, and 95.71%, respectively. The agricultural planting area results obtained by the proposed algorithm can be used as input data for regional ecological models, which is conducive to the development of accurate and real-time crop growth change models.


Introduction
Remote sensing technology plays an important role in agricultural monitoring, geological survey, military survey, and target detection [1][2][3][4]. As a use of remote sensing technology, land cover classification is a hot and challenging research topic. Accurate land cover classification is important for agricultural production and grain yield assessment, urban planning and construction, and ecological change monitoring [5][6][7]. Researchers have proposed multiple classifiers to classify land cover using remote sensing images [8], and the classification methods based on optical satellite images can be broadly grouped into spectral-based and spectral-spatial classification methods [9]. The spectrum-based classification methods use the spectral values obtained from remote sensing images as features, and statistical clustering or machine learning classification algorithms, including support vector machines (SVMs) [10], maximum likelihood [11], and random forest [12], to classify pixels. The spectral-spatial-based classification methods combine their own spectral values and construct corresponding auxiliary information from the image neighborhood space as feature vectors to achieve pixel-by-pixel classification of remote sensing images [13].
Although the above classifiers can accurately classify land cover, they still need to be improved for precision agriculture classification. Thus, with the development of deep U-Net skip connections, and highlights the salient features in specific local regions. Highresolution remote sensing images have certain features. Self-attention was originally used for NLP tasks. Due to the excellent performance of Transformer [34], Alexey et al. [36] proposed Vision Transformer (ViT) using the encoder part of Transformer. Although ViT performs well on large datasets such as imageNet, its memory consumption is large, so ViT has high hardware requirements. To improve efficiency, Chen et al. [37] proposed TransUNet, which uses the image blocks from the feature map of CNN for the input sequence of ViT, and then combines it with U-Net. The decoder upsamples the output features of ViT and then combines it with the CNN feature map of the same size to recover the spatial information, which effectively improves the semantic information of the image. Because the Swin Transformer [38], having powerful global modeling capability, has shown superior performance on several large visual datasets, some researchers have combined the Swin Transformer with U-Net to achieve semantic segmentation [20,21,39]. The Swin Transformer used by ST-UNet [20] was paralleled with CNN to improve the accuracy of small-scale target segmentation by combining the global features of the Swin Transformer with the local features of CNN through an aggregation module.
The algorithms described above, having different improved optimization approaches for different classification tasks, provided the reference ideas for the study described in this paper. To further improve the classification accuracy of remote sensing images, the deepening of the network was experimentally found to lead to the reduction in spatial resolution and the divergence in spatial information. Therefore, in this paper, an improved U-Net remote sensing classification algorithm that integrates attention and multiscale features is proposed. First, the algorithm uses dilated convolutions of different scales to expand the receptive field so that the network effectively integrates multiscale features and enhances shallow features. Second, through the fusion residual module, the shallow and deep features are deeply fused, and the characteristics of shallow and deep features are effectively used. Third, to integrate more spatial information into the upsampling feature maps, a spatial attention module (SAM) is used to fuse the feature maps obtained from skip connections with the upsampling feature maps to enhance the combination of spatial and semantic information.
In this paper, U-Net is combined with ASPP and SAM to improve and optimize it for land cover classification from Landsat 8 remote sensing images. The main contributions include (1) exploring the effect of different hierarchical features of U-Net on the ground cover classification of 30 m resolution Landsat 8 images; (2) introducing the ASPP module to perform residual connections with the original U-Net, which not only increases the fusion of multiscale features but also enhanced the expression of shallow information; (3) introducing SAM to obtain the spatial weight matrix for the feature maps with richer spatial information, so that the spatial weight matrix acts on the corresponding semantic feature maps to obtain the feature maps combining spatial and semantic information; and (4) conducting dynamic change analysis of land use in the study area, focusing on the dynamic change in the crop planting area.
The rest of the paper is structured as follows: Section 2 introduces the experimental data, the data processing methods, and the effects of different level features on Landsat 8 image ground cover classification. The proposed improved U-Net model is also described. Section 3 outlines the experimental results in detail. Section 4 provides a discussion based on the experimental results, and conclusions are drawn in Section 5.

Study Area Overview
Guangxi Province is located in South China; Guangxi has a subtropical monsoon climate with a tropical monsoon season and warm climate, abundant rainfall and sunshine. Guangxi has some of the most abundant precipitation in China, so the province is suitable for fruit and crop cultivation, including citrus, mango, banana, lychee, sugarcane, rice, and so on. The landscape is generally composed of six categories: mountains, hills, terraces, plains, rocky hills, and water surface. The area selected for this study is located in the central part of Guangxi Zhuang Autonomous Region, mainly including Xingbin District, Heshan District and Xincheng County of Laibin City, Liujiang District of Liuzhou City, and Shanglin County of Nanning City. The study area is located 108°19 9 E-109°47 26 E and 23°3 29 N-24°23 54 N ( Figure 1). Its unique climatic and geographical factors make sugarcane and rice the main crops in Guangxi, where sugarcane and rice have two seasons per year. Sugarcane and spring sowing are from January to March and harvest occurs from May to August; autumn sowing is from August to September and harvest is in December. The first rice season ranges from sowing in April to harvesting in July, and the second season is July to October. The area under sugarcane accounts for about 60% of the total sugarcane cultivation area in China. The sugarcane planting area in the study area is mostly continuous, exceeding 64% of the total agricultural land. As such, accurately and effectively determining the planting area of sugarcane is important for local agricultural development, accurate management, and yield estimation.

Field Sampling and Remote Sensing Image Preprocessing
To determine the distribution of the actual types of ground objects in the study area, field sampling was performed during the period when crops were growing in October, and samples of different ground object types were obtained through field data collection and field observation. During the process of collecting sugarcane and rice samples in the field, contiguous planting areas of more than 900 m 2 were preferentially selected, and the obtained data were used for accumulating prior knowledge and accuracy verification.
In this study, a multispectral image covering the study area taken by the Landsat 8 satellite on 2 October 2019, with a resolution of 30 m, containing 11 bands, was used as the data source. Landsat 8 carries the Operational Land Imager and Thermal Infrared Sensor. The Operational Land Imager includes nine bands with a spatial resolution of 30 m, including a 15 m panchromatic band, where the imaging width is 185 × 185 km. The Thermal Infrared Sensor includes two separate thermal infrared bands with a resolution of 100 m. To obtain more effective image information, the images were preprocessed with ENVI software for geolocation, radiometric calibration, atmospheric correction, mosaicking, and cropping. After that, the sample library data were obtained by a combination of indoor supervised classification and field validation, which involved 22,868,430 samples. The dataset was divided as shown in Figure 2; dataset I was used for all experiments and dataset II was used to further verify the validity of SA-UNet. As shown in Figure 3, 60% of the samples were used for training, 20% for validation, and 20% for testing. To ensure each slice did not contain background, a small range of intersection areas was included in the training, validation, and test sets, which did not affect the results. To increase the diversity of training samples, a sliding window method was used to crop the images into 256 × 256 blocks, with the window sliding 32 pixels each time to ensure 224 overlapping pixels between blocks of images with close boundaries. The training set was cropped to 3060, 256 × 256 samples, the validation set was cropped to 72, 256 × 256 samples, and the test set was cropped to 72, 256 × 256 samples.    [40]. U-Net++ [28] and U-Net3+ [41] use dense skip connections to achieve a full-scale U-Net. However, for Landsat 8 remote sensing images, denser skip connections do not necessarily lead to higher-accuracy classification results of ground objects: some skip connections may even have a negative impact. Therefore, the data from the study area were used to deeply analyze the impact of U-Net on the classification results of Landsat 8 images using different feature levels. The results are shown in Figure 5. (1) The U-Net without skip connections performed the worst, and the overall accuracy is 18.62% lower than the original U-Net. The decoder of 'U-Net-None' is not combined with the information of the encoder, only upsampling the feature map to restore the input size, and the information is significantly lost. (2) The different levels of features provided different contributions to the classification results: the accuracy of U-Net was 93.33%, the accuracy of U-Net-L1 was 93.08%, and the accuracy of U-Net-w/o L1 was 89.69%. Compared with L2, L3, and L4, L1 contributed the most to U-Net. Therefore, shallow features play an important role in the classification results. (3) U-Net-w/o L1 was 3.39% less accurate than U-Net-L1: only the first-level skip connections positively affected U-Net. When removing the second-, third-, and fourth-level skip connections alone, the accuracy did not drop, but instead increased, especially when the fourth-level skip connections were moved. The reason for this finding may be that the high-level features were not suitable for feature fusion. Therefore, features at different levels provided different contributions to the results. So, the expressive power of features with large contributions should be enhanced, and that of features with negative effects should be reduced. These results demonstrate the importance of enhancing shallow feature representation as well as improving semantic information.

SA-UNet
In response to the problems identified in Section 2.3, the SA-UNet network structure was designed. The overall structure of SA-UNet is shown in Figure 6. SA-UNet includes two parts, an encoder and a decoder, which are combined through a y skip connection. The overall structure includes five residual modules, five ASPP modules, and four SAMs. Each convolutional layer of the backbone network is accompanied by a batch normalization layer and a ReLU layer. In the ASPP module, the null convolution is accompanied by a ReLU layer with a pooling size of 2 × 2, a convolutional kernel size of 2 × 2, and a step size of 2 × 2 for the transposition convolution.

ASPP
The ASPP module is shown in Figure 7, having a total of five branches in parallel. The 1st branch is a 1 × 1 ordinary convolutional layer; the 2nd, 3rd, and 4th branches use 3 × 3 dilated convolutions with dilation coefficients of 6, 12, and 18, respectively. The 5th branch adopts global average pooling, outputs (batchsize, in_channel, 1, 1), then changes the number of channels through 1 × 1 convolution, and finally uses bilinear interpolation to restore the feature map to the input size. The features obtained from the five branches are superimposed in dimension. There are five times as many output channels as input channels. The number of channels is changed by a 1 × 1 convolution to obtain the final output. The ASPP module uses an inflated convolution that adds voids to the normal convolution to expand the perceptual field, which alleviates the problem of decreases in spatial resolution due to the maximum pooling layer [42]. Therefore, through the receptive field and integrating multiscale features, the expressive ability of shallow features is enhanced. In addition, by fusing the residual structure and combining the ASPP module with the backbone network, the shallow and deep features are deeply fused, and the characteristics of the shallow and deep features are more effectively used.

SAM
The spatial attention mechanism focuses on where the information is located in the current task. In remote sensing images, the types of ground objects are diverse and their distributions are complex, and the use of SAM to aggregate semantic information and spatial information is a way to improve the distinction of ground objects. SAM draws on the idea of CBAM [31], where SAM first uses the feature map of spatial information path (X SP ) to obtain the spatial feature weight map W S , and then multiplies the semantic information path feature map (X SE ) by the corresponding spatial location to obtain representative features, and then sums them with X SE at the corresponding spatial location to obtain X SE .
To learn the spatial weights, first obtain two channel information feature descriptors, X S SP avg ∈ R 1×H×W and X S SP max ∈ R 1×H×W , by using average pooling and max pooling on the channel axis of the feature map. Then X S SP avg and X S SP max are concatenated, and a 7 × 7 convolution is used to generate the spatial attention map. Finally, the spatial attention map is scaled to 0∼1 using the sigmoid function to obtain the spatial feature weight map W S . The spatial attention is calculated as follows: where (X SE ) and (X SP ) represent the semantic and spatial information path feature maps, respectively; σ represents the sigmoid function; f 7×7 represents the convolutional operation with a convolution size of 7 × 7; AvgPool represents the average pooling of each pixel along the channel axis; and MaxPool represents the max pooling of each pixel along the channel axis.

Loss Function
Due to the superior performance of the cross-entropy loss function, for multiclassification tasks, the cross-entropy loss function is usually chosen as the loss function. The cross-entropy loss function compares the predicted class with the target class for each pixel, and is expressed as follows: where M represents the number of classes; y ic is the sign function (0 or 1): when the true class of sample i is equal to c, y ic takes a value of 1, and otherwise, it takes 0. p ic is the predicted probability that the observed sample i belongs to class c.

Evaluation Metrics
To evaluate the semantic segmentation of remote sensing images, four evaluation metrics based on the confusion matrix were used: Accuracy, which is the ratio of the number of correct predictions to the number of all predictions; Precision, which is the ratio of correct predictions to positive classes to all predictions to positive classes; mean intersection over union (mIoU), which averages the intersection over union (IoU) of each class, so as to more accurately reflect the overall prediction performance of the model compared to IoU; and the Kappa coefficient, which is an indicator of the consistency of two variables. The formulas for these quantitative assessment metrics are as follows: where TP is the positive category that is classified accurately, FP is the negative category that is misclassified as positive, TN is the negative category that is classified accurately, FN is the positive category that is misclassified as negative, p o is the overall precision, and p e is the number of the misclassified samples divided by the total number of samples.

Experimental Environment
In the experiments, ENVI software was used to obtain sample points from the outdoor sampling and the learned prior knowledge to label the regions of interest. A supervised classification method was used to obtain the label maps needed for the experiments. In the training phase, the cross-entropy loss function was selected, the batchsize was set to 8, the maximum epoch was 20, and Adam was selected as the optimizer. The Adam optimizer uses a gradient descent algorithm, which has an adaptive learning rate, and the momentum gradient descent algorithm, which can alleviate the effects of the gradient oscillation problem. The initial learning rate was 0.01, and the learning rate decayed to 0.1 times the original rate every five iterations. All experimental code was implemented by Python 3.9 in pytorch 1.10.2, and the model was trained in win10+i5-8500CPU+NVIDIA GeForce RTX 3060 12 G GPU.

Results
The details of the sample pool in the study area are shown in Tables 1 and 2, where 60% was used for training, 20% for validation, and 20% for testing. The following experiments were conducted using the sample pool data in Tables 1 and 2. Due to space limitations, construction land, other land, and bare land are abbreviated as CL, OL, and BL, respectively. The test set true-color images and labeled maps are shown in Figure 8.   In order to select a suitable learning rate, the proposed model was experimented in this paper with learning rates set to 0.1, 0.01 and 0.001, respectively, and the experimental results are shown in Table 3, where the values of Acc, mIoU, and Kappa were maximal for the learning rate = 0.01.

Ablation Study
To evaluate the proposed network structure and the performance of two important modules, ablation experiments were performed on the dataset I in the study area using U-Net as the base network.
(1) Effect of ASPP: The results are shown in Table 4. The ASPP module was introduced into U-Net in the form of residuals to segment the test set images. The overall accuracy increased by 2.92%, the mIoU increased by 7.35%, and the Kappa coefficient increased by 4.06%. In particular, the recognition accuracy of rice (+17.03%), other land (+5.57%), and bare land (+6.53%) considerably improved. The recognition accuracy rates of sugarcane (+1.63%), water (+3.48%), construction land (+1.6%), and forest (+1.89%) also improved. This verified the effectiveness of integrating the ASPP module into U-Net in the form of residuals. As shown in the first row in Figure 9, the bridges in the construction land were segmented after U-Net was added to ASPP, whereas U-Net was not implemented. The second row in Figure 9 shows that the U-Net+ASPP segmentation of rice was more accurate than that of U-Net, and sugarcane, rice, and forest were more easily and accurately segmented. The results showed that combining ASPP with residual units enables the network to focus not only on global information, but also on detailed information.
(2) Effect of SAM: The results are shown in Table 4 for the combination of SAM with U-Net for the test set images. The accuracy increased by 2.2%, the mIoU increased by 4.68%, and the Kappa coefficient improved by 3.07%. SAM combined spatial with semantic information and increased the network recognition rate of different category differences for sugarcane (+0.77%), rice (+14.75%), water (+2.75%), construction land (+0.75%), forest (+1.8%), other land (+5.15%), and bare land (+2.13%). The detection accuracies of sugarcane and rice decreased by 1.4% and 10.16%, respectively. The mixing matrix showed that the number of misclassified samples within the groups of sugarcane, rice, and forest decreased, and the number of misclassified samples between the groups decreased. Comparing the data in Table 4 shows that SAM had the most obvious effect on forest recognition accuracy. With the addition of SAM, the network more easily distinguished between categories with large between-group differences, but less easily distinguished between categories with small within-group differences. The first row of Figure 9 shows that the inclusion of SAM facilitated distinguishing construction land from water; the second row of Figure 9 shows that SAM was more effective than U-Net for the segmentation of concentrated rice planting areas, but did not improve the recognition of areas where rice planting distribution was scattered, compared with U-Net. The results showed that SAM enhanced the intergroup differences and improved the segmentation between features with large differences.
(3) Effect of ASPP+SAM: Table 4 shows that ASPP, SAM, and U-Net set not only reduced the misclassification within the group, but also reduced the misclassification between groups. Sugarcane (+1.59%), rice (+19.84%), water (+3.45%), construction land (+1.59%), other land (+7.28%), and bare land (+6.68%) recognition accuracies increased, showing the most impact on rice and other land recognition. As shown in the third row, the contour of the other land segmented by U-Net+ASPP+SAM was more accurate, as was the extraction of small-area rice. Therefore, SA-UNet can focus on information at different scales to increase the accuracy of the classification results of ground objects. Although the accuracy of UNet+ASPP+SAM is the same as that of UNet+ASPP, the corresponding mIoU and Kappa coefficients are a bit higher for the former.

Comparison of Multiple Methods
To evaluate the performance of SA-UNet in feature coverage classification, comparative experiments and analyses were conducted with U-Net [22], U-Net++ [28], SAR-UNet [42], Res-UNet++ [24], Attention-UNet [18], UCTransNet [29], and Swin-UNet [39]. U-Net++ provides an improvement in full-scale feature fusion based on U-Net, SAR-UNet, and Res-UNet++. In Attention-UNet, an attention mechanism is added on the basis of U-Net. UCTransNet is a channel-wise cross-fusion transformer that functions on the basis of joining U-Net. Swin-UNet is a U-shaped network composed of pure Swin-transformer. The proposed algorithm SA-UNet incorporates ASPP into U-Net in the form of residuals, and combines semantic and spatial information through a spatial attention mechanism in the decoder process. None of the above methods were pretrained.
The segmentation results of the different algorithms on the test set from the dataset I are shown in Table 5. For three evaluation metrics (Acc, mIoU and Kappa), the proposed SA-UNet outperformed the other algorithms in the three metrics of Acc (96.25%), mIoU (89.33%) and Kappa (94.84%). The segmentation accuracy of SA-UNet for rice (83.21%), water (97.46%), construction land (97.02%), other land (95.71%) and bare land (88.45%) is also the highest compared with the other algorithms mentioned above. U-Net++ used a full-scale fusion approach, but the generalization ability of UNet++ became worse compared to U-Net. SAR-UNet achieved the highest segmentation accuracy for sugarcane, at 98.74%, but its segmentation results for rice (−7.13%), construction land (−5.67%) and other land (−8.79%) were inferior to those of U-Net. Res-UNet++ only slightly improved the accuracy of the segmentation of sugarcane (+1.27%), water (+0.58%) and forest (0.14%) compared with U-Net, although Res-UNet++ had slightly higher Acc and Kappa than U-Net, but a lower mIoU value.Attention-UNet has similar performance compared to U-Net. UCTransNet slightly improves the segmentation of Water (0.43%), and Construction land (0.48%), Other land (1.95%) and bare land (0.77%) compared to UNet, but unsatisfactory results for the rice. Swin-UNet received a lower Acc (−0.09%), mIoU (−4.7%), and Kappa (−1.24%) than U-Net. SA-UNet performed the best overall, producing the improvement in sugarcane (+1.59%), rice (+19.84%),water (+3.45%), forest (+1.21%), other land (+7.28%), and bare land (+6.68%) segmentation, indicating that SA-UNet provides advantages in the land cover classification of Landsat 8 remote sensing images. Table 6 shows the segmentation results of different methods for the test set of dataset II. Acc (96.62), mIoU (89.20) and Kappa (95.16) of SA-UNet are still the highest compared with other methods.  Figure 10 shows the segmentation results of the remote sensing images from the test set in the study area by different methods. Generally, the image segmentation results of each algorithm were almost the same. The proportions of construction land, forest, and sugarcane in the test set were relatively large; the planting of sugarcane and rice was more concentrated; and the proportions of other land types were smaller. The segmentation result graph of the test set is shown in Figure 11, and the effectiveness of SA-UNet can be reflected by combining Table 6 with Figure 11 (for the following experiments, except for Figure 11 and Table 6, all of experiments are with dataset I).  Comparing Figure 10 with Figure 12 shows that the differences between the algorithms are mainly reflected in the local area. From local observations, rice distribution was more scattered in this image, indicated by the yellow and red boxes; for this scattered rice distribution, only the proposed algorithm produced results most similar to the ground truth. The other algorithms all produced visible errors. SAR-UNet poorly segmented rice and mixed rice with forest. As seen from the colored baskets, some networks experienced some difficulty in identifying bridges over rivers: U-Net, UNet++, and UCTransNet failed to segment the bridges from the water. SAR-UNet, Res-UNet++, Attention-UNet, and Swin-UNet were also unsatisfactory, and only SA-UNet showed some minor differences from ground truth. As seen from the black box, U-Net, U-Net++, SAR-UNet, and UCTransNet were unable to distinguish sugarcane from forest in detail, easily misclassifying sugarcane as forest. Swin-UNet and SA-UNet were more accurately able to distinguish sugarcane and forest.

Land Use Change in Study Area
In this study, four issues of Landsat 8 series remote sensing images dated 13 October 2015, 28 October 2017, 2 October 2019 and 13 October 2021 were downloaded from the USGS website (https://earthexplorer.usgs.gov/ accessed on 1 January 2021) to classify the land cover of the study area. Higher-resolution imagery, field collection data, and a priori knowledge were used for supervised classification to obtain the 2015, 2017, 2019, and 2021 sample base data for the study area. The method of Table 5 and the dataset I were used to analyze the four phases of images. Finally, the proposed algorithm was used to classify and evaluate the four phases of images in turn for land cover classification. The results of the multiple methods segmentation of the test sets for 2015, 2017, 2019 and 2021 are shown in Figures 10 and 13-15, respectively. As shown by the evaluation metrics in Tables 5 and 7-9, the Acc, mIoU, and Kappa values of the four-phase images met the needs of the study.
The spatial distributions of land use and land use changes are shown in Figure 16. From the spatial analysis of land use distribution, the study area was mostly covered by forest, the planting areas of sugarcane and rice were relatively concentrated, sugarcane was the most widely planted among crops, the main rivers ran through the whole study area, large cities were mainly distributed on both sides of the main rivers, and lakes and reservoirs were randomly distributed throughout the whole study area. From the analysis of land use changes, the conversions of sugarcane plantation areas into other land, forest plantation areas into other land, and bare land into other land were more common.
The land use and land use change rate for the seven-year period are shown in Table 10. Combining the results in Table 10 with those of Figure 16, the proportion of forested land cover area in the study area was found to be the largest among the four images, followed by sugarcane and construction land. From the analysis of land use change, the areas of two major crops, sugarcane and rice, changed the most every year. The area of bare land increased each year because, first, distinguishing bare land from freshly planted fruit trees and crops in Landsat 8 remote sensing images is difficult and, second, human activities. In 2019-2021, the water area changed by −60.2%, indicating that a drought occurred during the period, causing the reservoir as well as some tributaries to dry up. The overall change in construction land was small, with basically no change in large cities and sporadic changes in villages, various factories, and concrete floors. In 2019-2021, other arable land (−79.99%) and sugarcane (−28.16%) mainly transformed into forest (+19.74%), increasing the forest area to that observed in 2015, proving that green ecological awareness is increasing.

Crop Land Use Change
The classification results of remote sensing images in 2015, 2017, 2019 and 2021 show that the proposed algorithm increases the classification accuracy of sugarcane, rice, and other land. Therefore, the proposed algorithm was used to monitor the dynamic changes in crops over a seven-year period. The crop acreage in the four periods in the study area was analyzed and compared, and the classification results and the dynamic changes in crops are shown in Figure 17. The statistics of the results of monitoring the area for regional changes in crop cultivation are shown in Table 10

Discussion
In Landsat 8 remote sensing image feature classification, U-Net is more accurate than U-Net++. Section 2.3 shows that not all skip connections are beneficial to the classification result, so the strategy of full-scale fusion adopted by U-Net++ has a negative effect on the overall feature coverage classification accuracy. However, full-scale fusion has a better effect on improving intergroup differences, so when the differences in feature types are small, ablation experiments must be conducted on skip connections to select the most appropriate scale features for fusion. When the differences in feature type are large, full-scale fusion is beneficial for increasing intergroup differences. The channel attention mechanism increases the SAR-UNet attention to sugarcane, which leads to a sugarcane segmentation accuracy of 98.74%. Although SAR-UNet includes ASPP modules in the transition layer of the network to increase semantic information, Section 2.3 shows that increasing the expressiveness of shallow features is more beneficial in the classification of Landsat 8 remote sensing images. Res-UNet++ has a similar structure to SAR-UNet, except that Res-UNet++ has a spatial attention mechanism in the decoder part to enable focus on the key regions of the feature maps. Unlike the spatial attention mechanism module used in this study, Res-UNet++ combines two feature maps of different sizes to generate spatial attention weights and obtain new semantic information. This is consistent with the idea of Attention-UNet, but Attention-UNet reduces the redundant features of hopping connection transmission and highlights the salient features of specific local regions. The analysis of the results shows that Res-UNet++ is slightly more accurate than Attention-UNet. The structures of SAR-UNet, Res-UNet++, and Attention-UNet and the results show that adding channel attention to the encoder and then adding spatial attention to the decoder little impacts the classification results. UCTransNet uses the CTrans module instead of traditional skip connections. CTrans uses transformer to cross-fertilize multiscale information on the features of the four different levels of U-Net, which reduces the negative effects produced by some skip connections. CTrans works best for the segmentation of two types of features with the most obvious characteristics, namely forest and construction land, and may produce accurate results if used for road segmentation. Swin-UNet improves Swin-Transformer by using a U-shaped structure to achieve semantic segmentation. Because the continuous attentional layer structure of Swin-Transformer can substantially improve the expression of the model, the segmentation results for forest of Swin-UNet were better than those of the other comparison methods. In the 30 m high-resolution remote sensing image land cover classification task, the shallow features strongly influenced the experimental results, but as the network deepened, the spatial resolution was reduced and spatial information was dispersed. Therefore, the proposed SA-UNet uses different scales of cavity convolution to expand the perceptual field, and enables the multiscale fusion of features to increase the ability to express shallow features. In addition, SA-UNet fuses shallow with deep features by fusing residual modules to effectively use the characteristics of both shallow and deep features. To integrate more spatial information into the upsampling feature map, the feature map obtained by jump connection is fused with the upsampling feature map using the spatial attention module to enhance the combination of spatial and semantic information. Compared with UNet, SA-UNet improved the classification accuracy of all ground objects. As rice is a major food crop, accurately extracting rice growing areas from remote sensing images is important, but the rice segmentation accuracy of the proposed method is still insufficient and needs further improvement.

Conclusions
In this study, the main focus was improving the ability to express shallow features in remote sensing images, enhancing the effective combination of spatial and semantic information, to obtain global contextual information and improve the segmentation effect of ground objects. In this study, U-Net and an ASPP module were fused by means of residuals. This fusion not only expands the perceptual field through the convolution of cavities of different sizes, promotes the fusion of multiscale features, and enhances the expression ability of shallow features, but also enables the deep fusion of shallow and semantic features, which mitigates the effects of the problem where local complex feature types interfere with each other. In addition, a spatial attention module was used to fuse the feature maps obtained from jump connections with the upsampled feature maps, which alleviates the problem caused by the inadequate use of spatial information in the upsampling process. The results showed that the proposed SA-UNet produced relatively more accurate feature classification results from Landsat 8 remote sensing images in the study area compared with U-Net, U-Net++, SAR-UNet, Res-UNet++, Attention-UNet, UCTransNet, and Swin-UNet with an accuracy rate of 96.25%. For future work, the features of jump connections should be further optimized so that the model can enhance both inter and intragroup differences.

Data Availability Statement:
The SA-UNet and 2019 data set in the study area will be available at https://github.com/Yancccccc/SA-UNet accessed on 9 June 2022.

Conflicts of Interest:
The authors declare no conflict of interest.