Improved U-Net Remote Sensing Classiﬁcation Algorithm Based on Multi-Feature Fusion Perception

: The selection and representation of remote sensing image classiﬁcation features play crucial roles in image classiﬁcation accuracy. To effectively improve the classiﬁcation accuracy of features, an improved U-Net network framework based on multi-feature fusion perception is proposed in this paper. This framework adds the channel attention module (CAM-UNet) to the original UNet framework and cascades the shallow features with the deep semantic features, replaces the classiﬁcation layer in the original U-Net network with a support vector machine, and ﬁnally uses the majority voting game theory algorithm to fuse the multifeature classiﬁcation results and obtain the ﬁnal classiﬁcation results. This study used the forest distribution in Xingbin District, Laibin City, Guangxi Zhuang Autonomous Region as the research object, which is based on Landsat 8 multispectral remote sensing images, and, by combining spectral features, spatial features, and advanced semantic features, overcame the inﬂuence of the reduction in spatial resolution that occurs with the deepening of the network on the classiﬁcation results. The experimental results showed that the improved algorithm can improve classiﬁcation accuracy. Before the improvement, the overall segmentation accuracy and segmentation accuracy of the forestland increased from 90.50% to 92.82% and from 95.66% to 97.16%, respectively. The forest cover results obtained by the algorithm proposed in this paper can be used as input data for regional ecological models, which is conducive to the development of accurate and real-time vegetation growth change models.


Introduction
Remote sensing technology plays an important role in the fields of crop monitoring, geological investigation, and precision agriculture [1][2][3]. Carbon balance has always been a topic of concern worldwide, and forest resources largely contribute to the global carbon balance, so it is necessary to accurately monitor the dynamic changes of forest resources [4]. However, the use of remote sensing images to identify different features with high accuracy, and to classify and count various kinds of feature information, is a popular and difficult research point in remote sensing information extraction. The essence of the image-specific target segmentation challenge in remote sensing is to construct a target feature space and its mapping model. The current mainstream remote sensing classification methods mainly include traditional machine learning methods and semantic segmentation methods based on deep learning, and the corresponding algorithms will be introduced in the following section.

Related Work
Traditional remote sensing image classification methods, such as the k-means clustering method [5], watershed algorithm [6], and active contour model [7], manually extract feature values corresponding to targets in a remote sensing image space to form a feature space and construct a mapping model from the feature space to the target space. However, the mapping model from the feature space to the target space is a high-dimensional, strongly nonlinear relationship, which is difficult to implement using manual methods. Thus, some scholars have proposed learning-based remote sensing image segmentation methods to establish mapping models through sample learning. Dong et al. [8] introduced a number of single complementary features combined with back propagation (BP) neural networks to improve the accuracy of single tree detection. Sun et al. [9] introduced a Mahalanobis Distance kernel to improve the classification performance of support vector machines (SVMs) for remote sensing images. Li et al. [10] combined spectral features, vegetation indices, texture features, and topography to establish a random forest model to identify the forest types in the HeiLongJiang Cap Mountains. The aforementioned early remote sensing image classification methods mainly use the low-level features of images for model training. However, there is an insufficient utilization of feature information, which needs to be improved for feature refinement classification, and it is difficult to distinguish complex feature types.
The semantic segmentation method based on deep learning is applied to remote sensing image classification and shows good performance. Deep convolutional neural networks (CNNs) can automatically extract different classes of features in remote sensing images [11][12][13][14][15] with good accuracy. Kussual et al. [16] proposed a multi-level deep learning method for land cover and crop type classification using multitemporal multisource satellite images to classify 11 classes of crops, such as wheat, corn, and sunflower. Alshehhi et al. [17] combined low-level features with high-level semantic features extracted by CNNs to classify roads and buildings in cities. Csillik et al. [18] used CNNs to identify citrus trees from UAV images. Nowadays, deep neural networks are highly capable of image feature extraction, and extreme learning machines (ELMs) and SVMs, which are traditional linear classifiers, have strong capabilities in classification. Therefore, the use of ELMs or SVMs in classification has been proposed to improve accuracy after the feature extraction of CNNs is in effect. Wang et al. [19] proposed a CNN and ELM fusion method, where a CNN is used for feature extraction and an ELM is used as a classifier. Cao et al. [20] designed a combined CNN and SVM method to identify ships. Meng et al. [21] used a CNN to classify remote sensing images of wetlands and compared it with methods based on spectral SVM and texture and spectral SVM. Sun et al. [22] designed a seven-layer CNN structure, trained the samples with the CNN, and then used an SVM to classify remote sensing images and tested them with volcanic ash clouds. The aforementioned studies used CNNs for feature extraction and used ELM and SVM classifiers to improve classification accuracy. However, these studies only extracted the features of one layer and did not consider the features of different layers together. Long et al. [23] proposed a fully convolutional network (FCN) model. The FCN model replaces the fully connected layers in a CNN with convolutional layers, so it can accept the input of arbitrary size and can output the corresponding size. The FCN also extends the classification at the image level to the pixel level. Fu et al. [24] proposed an FCN-CRF (fully convolutional network-conditional random field) remote sensing classification algorithm with an average improvement in accuracy of 2% compared to the FCN. SegNet [25] uses inverse pooling in the encoder to upsample a feature map to bring it back to the input scale. Although this operation helps to maintain the integrity of the semantic information, it ignores the proximity information when inverse pooling is performed on low resolution feature maps. U-Net [26] was initially applied to segmentation in the medical imaging domain, and was applied in several domains such as remote sensing images for its practicality and its ability to learn with small data volumes. Therefore, several U-Net based networks and improved U-Net networks were used in remote sensing image segmentation studies [27][28][29][30]. Deeplabv3 [31] uses ResNet50 [32], InceptionRseNetV2 [33], MobileNet [34], Xcepition [35] as a backbone network to extract features, the extracted features are used as input of the atrous spatial pyramid pooling (ASPP) module, the output of the ASPP module is upsampled through bilinear interpolation and concatenated with the features extracted from the backbone network, and bilinear interpolation is then performed again to achieve semantic segmentation.
Reducing the interference of redundant information and extracting discriminative features in a limited sample are also a challenge in remote sensing image classification. The attention mechanism tells us where to focus our attention [36], and weighting the features using the attention mechanism is an effective approach [37]. Because U-Net requires less data and has excellent segmentation in several domains, many networks add the attention mechanism to the original U-Net to focus on important features. Attention-UNet [38] proposed the attention-gate structure, which implements the attention mechanism by supervising the features of the next level to the features of the previous level. To alleviate the gradient disappearance problem, the traditional convolutional blocks in U-Net are replaced with residual structures [39][40][41]. EAR-UNet [39] uses EfficientNetB4 [42] as an encoder based on the U-Net framework, replaces the convolutional blocks in the decoder with residual blocks, and adds the attention-gate structure in the jump connection. SAR-UNet [40] replaces the convolution of U-Net with the residual module based on the U-Net framework, while introducing the Squeeze and Excitation (SE) block [42] in the encoder and replacing the transition and output layers with the ASPP module. Res-UNet [41] also replaces the convolution of U-Net with the residual module, replaces the upsampling operation with bilinear interpolation, and finally introduces a CRF to postprocess the network output. However, the above methods only use deep learning features for classification, and the method of classifying different categories using a single feature needs to be improved because the salient features of different categories in remote sensing images are not the same.
Previous research works mainly use the residual module or attention mechanism to improve and optimize the U-Net network, or use a CNN to extract crop features and then combine SVM or ELM as classifiers, which provided the research idea for this paper. In this paper, an improved network structure of CAM-UNet, which adds the channel attention module (CAM) to the original U-Net framework, is proposed using SVM to replace the classification layer in the original U-Net network, using this network to preferentially select three different levels of features for multifeature cascade as input of the SVM, and finally using the majority voting game theory algorithm. The majority voting game theory algorithm is applied to the classification results of the SVM to obtain the final classification results. This algorithm can provide a new idea for the improvement of the classification accuracy of remote sensing images.

Study Area Overview
The study area selected for this study is located in Xingbin District, Laibin City, Guangxi Zhuang Autonomous Region (GZAR) (108°43 43 E-109°36 7 E and 23°15 58 N-24°4 38 N) ( Figure 1). The study area has a subtropical monsoon climate. The unique climatic and geographical factors make sugarcane one of the major crops in Guangxi, and its planting area accounts for approximately 60% of the country. The planting area of sugarcane in the study area is more than 80% of the total agricultural land, so it is important to accurately and effectively obtain the planting area of sugarcane for local agricultural development, accurate management, and yield estimation.

Field Sampling and Remote Sensing Image Preprocessing
To obtain the distribution of actual feature types in the study area, a total of 2876 sample points of different feature types were obtained through field data collection and field observations ( Table 1). Among sample points, in the process of the field collection of sugarcane and rice samples, priority was given to continuous planting areas with an area larger than 900 m 2 . The acquired data were used to accumulate a priori knowledge and to verify the accuracy at a later stage. In this study, multispectral images covering the study area taken by the Landsat 8 satellite with a resolution of 30 m from 2-8 October 2019, containing 11 bands, were used as the data source ( Figure 1). The images from 2-8 October 2019, were taken during the peak growth period of sugarcane and rice. To obtain more effective image information, preprocessing such as geolocation, radiometric calibration, atmospheric correction, mosaicking, and cropping was performed on the images to obtain the sample library data through a combination of indoor supervised classification and field validation, and the sample library data had 4,874,817 samples, 60% of which were used for training, with 20% for validation and 20% for testing ( Figure 1). In this paper, highresolution Sentinel-2 satellite data were used as an aid to validate classification accuracy. The Sentinel-2 Level-C image on 3 October 2019, was downloaded from the USGS website, and the Sentinel-2 Level-C image multispectral data were first corrected for atmosphere, topography, and cirrus clouds using the Sen2Cor software. Subsequently, the SNAP software was used to upsample the bands, increase the resolution to 10 m, and convert them to ENVI format. The 12 bands of the multispectral image were then fused using the ENVI software, and the Seamless Mosaic tool was used to mosaic the image and import the vector data of the study area for cropping. Finally, the latitude and longitude information of the field collected data was imported into the corresponding Sentinel-2 images of the study area to obtain the sample data of the corresponding location, and the classification result map based on the Sentinel-2 images was obtained by supervised classification and accuracy verification of the sample data.

Improvements to U-Net
As a network goes further, semantic information becomes richer, but spatial resolution becomes lower. To maintain the spatial resolution and semantic features, the U-Net [27] model uses the skip connection operation to fuse the feature maps of different levels. In this paper, we added a channel attention mechanism based on U-Net, used the model trained by the CAM-UNet network to extract three different levels of features, put them into SVM classification, and then analyzed the voting results after the majority voting game to obtain the final results and evaluate classification accuracy. A flowchart of the proposed multifeature fusion perception algorithm framework is shown in Figure 2. The framework consists of four main components: the U-Net model, CAM, SVM classifier, and majority voting game module.

U-Net Model
The U-Net model is an end-to-end semantic segmentation network, which is named U-Net because its structure is symmetrical like the letter U. The U-Net model consists of an input layer, a convolutional layer, a pooling layer, a transposed convolutional layer, an activation function, and an output layer. The convolution layer uses multiple convolution kernels with a size of 3 × 3 and a step size of 1 to perform the convolution operation, and the output after this operation is a feature map. In the convolution process, all input information shares a set of weights (weight sharing), which significantly reduces the training parameters and increases the computational speed. Convolution also has the ability of local perception, which improves neural network signal transmission to a certain extent. The activation operation is the process of increasing the nonlinearization of neural network, which makes the neural network better fit the nonlinear mapping and improves the expressiveness of the model. The commonly used activation functions include sigmoid, Tanh, and ReLU. The U-Net model chooses ReLU as its activation function, which is defined as ReLU has a one-sided suppression capability, outputting directly positive values for positive numbers and zero for numbers less than zero. This capability speeds up network training while converting dense features into sparse features, effectively improving the robustness of the features, and the sparse features are mapped into a high-dimensional feature space with stronger linear differentiability. The essence of transposed convolution is upsampling, and the feature map size is restored to the original image size by multiple transposed convolution operations. The U-Net model proposes the skip connection to retain the information at each level and improve the generalization ability of the network. Upsampling is fused with the downsampled feature channel dimension splicing at the same time, which effectively fuses the image detail information with the contour information. Fusion is performed. Finally, the feature vectors are mapped to the desired number of classes using a 1 × 1 convolution kernel. The loss function, also known as the optimization performance metric, is the optimal performance metric to be achieved by varying the weights of the neural network, and is used to indicate how similar the predicted value is to the true value. U-Net uses boundary weights as its loss function. It is defined as where p (x) is the loss function of Softmax, and :Ω→ {1, . . . , k} is the label value of the pixel point.

Classifier
The logistic regression layer of the traditional U-Net model uses the Softmax function to achieve classification, which is based on the principle of regression, and its loss function is a probabilistic model considering global data. It normalizes the data in the feature space and presents the classification results in the form of probabilities. It is defined as follows: Let there be N classes of sample data. The output of the final convolution layer is Y = (y 1 , y 2 , . . . , y N ) T ,and the output after Softmax calculation is S = (s 1 , s 2 , . . . , s N ) T , where SVM has superior performance compared to Softmax. The basic idea of an SVM is the introduction of a kernel function, which maps linearly indistinguishable features to a high-dimensional feature space and thus makes the feature data linearly distinguishable. The essence of SVM makes the search for the optimal classification hyperplane and does not cause the change in the hyperplane due to the change in nonsupport vector samples. However, in Softmax, any changes in the samples lead to a change in the decision plane.

Channel Attention Module
In recent years, the channel attention mechanism has been used for image classification and segmentation with significant success, and it has obtained good results in the field of remote sensing image segmentation [43]. To obtain a more effective feature map, channel attention is introduced to extract image features adaptively before the maximum pooling layer. The specific operation of the CAM is as follows: A feature map (H × W × C) is obtained by global average pooling F c avg (1 × 1 × C) and global maximum pooling F c max (1 × 1 × C). F c avg and F c max are then fed into the shared network consisting of two fully connected layers and an activation layer. Finally F c avg and F c max passing through the shared network are operated by the Add function and fed into the sigmoid function to obtain the channel attention map M c ∈ R 1×1×C . Afterwards, the size of the first fully connected layer is R 1×1×C/r , and the size of the second fully connected layer is restored to R 1×1×C . The channel attention module is shown in Figure 3. Channel attention is calculated as follows:

Feature Extraction and Fusion
The most meaningful three levels of the features of the original image are extracted by the network model and put into the SVM for classification, and the final classification is obtained by voting on the three classification results. In this study, to prefer features at different levels, we first extracted images with a size of 256 × 256 containing all classification labels from remote sensing images as experimental samples, and selected all convolutional layers in the network model to train the SVM on the features extracted from the samples separately. The size of the training, validation, and testing samples were 60%, 20%, and 20% of the total sample size, respectively. The features extracted from Layers 2, 56, and 57 of the network model were finally selected for SVM classification through empirical comparison, and the classification results were then subjected to voting games.

Experimental Environment
In deep learning networks, hyperparameters need to be obtained based on empirical debugging, including the learning rate, the small batch extracted in each iteration, gradient clipping, and other hyperparameters. The learning rate is a hyperparameter that controls the convergence speed of the model. The lower the learning rate is, the slower the change rate of the loss function is and the slower the convergence time is, but it can ensure that the best accuracy is achieved locally. On the contrary, if the learning rate is too high, the local minima will be missed, and the gradient threshold is usually set to enable gradient clipping and suppress the network gradient explosion caused by a very high learning rate. After several experimental debuggings, the experimental learning rate is set to 0.01, and each minibatch contains pixel patches with a size of 256 × 256. There are 30 rounds in total, and 1000 minibatches are extracted in each iteration of each round for a total of 30,000 iterations. The gradient threshold and gradient decay rate are 0.05 and 0.0001, respectively. The code of the experiment is performed using Matlab 2021b, and the experimental environment consists of Intel(R) Core(TM) i5-8500 CPU with an NVIDIA GeForce RTX 2060 GPU.

Results
Information on the data categories of the sample pool in the study area is shown in Table 2. Sixty percent was used for training, with 20% for validation and 20% for testing. The following experiments were conducted using the sample pool data in Table 2. To verify the effect of MinibatchSize on the network, it was designed with parameter optimization, and overall accuracy (OA), average accuracy (AA), and kappa coefficient were used as evaluation indexes of classification performance. The specific experimental results are shown in Table 3. MinibatchSize is the size of the small batch processing for each training iteration. The larger the MinibatchSize is, the longer it takes for each iteration, but within a certain reasonable range, the larger the MinibatchSize is, the more accurate its determined descent direction is. When MinibatchSize = 8, each iteration takes 1 s, and when MinibatchSize = 16, each iteration takes 2 s. Although it takes twice as long, the overall accuracies of the test set and the real set is improved by 1.66% and 0.5%, respectively, as seen from the comparison of the classification accuracy of U-Net. The accuracies of forestland in the test set and real set were improved by 0.87% and 0.23%, respectively. Because this experiment is for the fine classification of crops, it is worth spending twice as much time to train and improve accuracy. To verify the effect of different network structures on classification accuracy, this paper compared the U-Net structure with only the channel attention mechanism added (CAM-UNet) with the U-Net structure with both the residual units and attention mechanism added (Res-CAM-UNet) for experimental analysis, and the corresponding experimental results are shown in Table 4. Table 4 shows the effect of residual units on CAM-UNet. Res-CAM-UNet has a higher classification accuracy in sugarcane and construction land compared to CAM-UNet, with an improvement in the test and validation sets: 6.53%, 10.74%, 2.18%, and 5.86%, respectively. However, the classification accuracies obtained in forestland and rice were lower and decreased in the test and validation sets: 30.48%, 41.18%, 3.73%, and 6.71%, respectively.

. Comparison of Multiple Methods
To evaluate the performance of deep network methods with multifeature fusion perception in remote sensing classification, the improved algorithm proposed in this paper was compared with U-Net [26], SegNet [25], Attention-UNet [38], SAR-UNet [40], Res-Net [41], Deeplabv3 + ResNet50, Deeplabv3 + Xception and Deeplabv3 + MobileNet [31], the methods were tested and analyzed for comparison. Tables 5 and 6 show the classifi-cation accuracies of the different methods for the test set and validation set of the remote sensing images of the study area. The algorithms proposed in this paper had the highest values of OA, AA, and kappa compared to the other algorithms. In the test set, the OA was also improved by 1.16% after adding an SVM to U-Net, which improved the classification accuracies of forestland, rice, and water bodies. Compared with the original U-Net, the algorithm proposed in this paper improved in each category, and the improvement was greater for rice, water bodies, construction land, forestland, bare land, and other cultivated land: 14.2%, 1.38%, 1.75%, 1.5%, 1.22%, and 5.93%, respectively. The OA was improved by 2.32%. When compared with U-Net + SVM, the OA of the algorithm proposed in this paper was improved by 1.16%, the classification accuracy of each category was slightly improved, and the improvement is more obvious in sugarcane and other cultivated land: 2.41% and 3.17%, respectively. The overall classification accuracy of Deeplabv3+, SegNet, and SAR-UNet was poor. The classification accuracy of SAR-UNet for sugarcane and construction land was the highest. The classification accuracy of Deeplabv3+ was approximately 92% for forestland and 81% for sugarcane. Res-UNet had a higher classification accuracy for water bodies and forestland than other networks after adding the SVM. In the validation set, CAM-UNet + SVM still performed outstandingly, its classification accuracy was superior to those of other networks, except for sugarcane and construction land, and the overall classification accuracy and kappa value were the highest.   Tables 7 and 8 show the mixture matrix of the algorithm proposed in this paper, and it can be seen that the percentages of forestland, sugarcane, and rice misclassified into each other were large. The probabilities of the misclassification of forestland and rice into sugarcane in the test set were 3.12% and 1.02%. The probabilities of the misclassification of forestland and sugarcane into rice were 18.24% and 2.19%. The probabilities of the misclas-sification of sugarcane and rice into forestland were 1.39% and 0.71%. The probabilities of the misclassification of forestland and rice as sugarcane in the validation set were 5.71% and 1.41%. The probability of the misclassification of forestland and sugarcane into rice were 16.71% and 3.86%. The probabilities of the misclassification of sugarcane and rice into forestland were 1.08% and 0.88%. Sugarcane, rice, and forestland are misallocated from each other because sugarcane is the most planted cash crop in the study area, covering most of the planting area in the study area, and woodland covers almost half of the study area. The unique spatial distribution resulted in the intersection of the woodland, rice, and sugarcane planting areas. Restricted by the 30 m resolution of the Landsat 8 remote sensing images, the scattered woodlands were not easily distinguished from the rice and sugarcane plantation areas. This also caused the mixing of agricultural land and forestland. The construction land contains cities, villages, and roads. Many rural roads are made of different materials, including stone, dirt, cement, and asphalt. The difference in materials causes some rural roads to be classified as bare land. Villages are surrounded by cropland and woodland, and it is normal that one image element at 30 m resolution may contain construction land, woodland, and cropland and does not distinguish them well.  Figure 4 shows the results of the remote sensing image segmentation of the study area by different methods. As shown in Figure 4, most of the construction land in the study area is concentrated in the central part, and small towns and villages are scattered. The study area is mainly dominated by forests and sugarcane, with few other crops, more concentrated rice cultivation land, and a very low percentage of bare land. In this paper, the decoded data of the Sentinel-2A satellite covering the study area with higher resolution were used to verify the accuracy of the classification results of Landsat 8, and the classification categories and total accuracy of both were roughly the same from the county scale.

Land Use Change in Laibin
In this study, the 2 November 2010, 14 April 2015, and 2 October 2019 Landsat series images were downloaded from the USGS website (https://earthexplorer.usgs.gov/, acceseed on 8 February 2022) to carry out feature classification of the study area, where the 2010 images were Landsat 7 images and the 2015 and 2019 images were Landsat 8 images. To obtain more effective image information, the images were preprocessed with geolocation, radiometric calibration, atmospheric correction, mosaicking, and cropping. Owing to the sensor failure of the Landsat 7 satellite on 31 May 2003, the Landsat 7 images since then have the problem of strip loss, and after its repair, six bands were finally obtained. In this study, higher resolution images and field collection data were used for supervised classification to obtain the sample library data of the 2010 and 2015 study areas, and the algorithm proposed in this paper was used to classify and evaluate the images of the 2010, 2015, and 2019 study areas. Tables 9-11 show the mixing matrices of the remote sensing images of the study area in 2010, 2015, and 2019 based on the algorithm proposed in this paper, respectively, and the OAs were 94.02%, 90.41%, and 93.62%, respectively, which meet the needs of the study. The spatial distribution of land use and land use change are shown in Figure 5. From the spatial distribution of land use, forestland, and sugarcane are the main land use types in the study area, followed by construction land, which is mainly concentrated in the central part, and the rest is scattered in the study area. The main rivers run through the whole study area, and the lakes and reservoirs are distributed more evenly. Rice, bare land, and other arable land are located in a small area. In terms of land use change, there is more conversion of sugarcane to forestland and more conversion of other arable land to forestland and sugarcane. The types of land use changes in the last decade are shown in Table 12. The area of forestland has the largest ratio to the total area of the study area, followed by sugarcane, other land, construction land, bare land, and rice, and water bodies have the smallest share of the total area of the study area. In terms of land use changes, the areas of forestland, rice, and other cultivated land increased by 25.74%, 116.15%, and 255.37%, respectively. By contrast, the areas of sugarcane, construction land, water bodies, and bare land decreased by 20.15%, 30.13%, 16.16%, and 50.17%, respectively.

Changes in Forest Dynamics
From the classification results of the remote sensing images of the study area for the three periods of 2010, 2015, and 2019, it can be seen that the algorithm proposed in this paper has the highest classification accuracy for forestland. Therefore, the algorithm was used to monitor the dynamic change of forest resources in the past 10 years, and the forest areas for the three periods in the study area were analyzed and compared. The classification results and dynamic change of forests are shown in Figure 6, and the forest change monitoring area statistics are shown in Table 12. From the results of the forest change monitoring in the study area, it was obtained that the forest area was 1580.59608 km 2  government's policy of returning farmland to forest, people planted trees to make the overall forest area increase significantly. The forest area in 2015-2019 decreased overall, mainly because eucalyptus trees planted under the policy of returning farmland to forest absorbed a high amount of groundwater, which caused drought in some areas. Therefore, the government introduced a new policy to encourage farmers to plant trees that benefit the ecological environment more than eucalyptus trees, so it is normal for some of the area to decrease and for some forestland to be converted into cropland.

Discussion
The traditional U-Net fuses multilayer features while upsampling, and the fused features are then trained. On the other hand, multifeature fusion adds the channel attention mechanism to U-Net, fuses the multilayer features into the network model, extracts the optimal three levels of features for SVM classification, and performs the majority voting game on the three classification results. The algorithm adopts the form of multifeature cascade to reduce the problem of gradient dispersion, and introduces channel attention to assign feature weights, which is a big improvement compared with U-Net. Remote sensing image classification has many difficulties from the acquisition of remote sensing image resources to classification, and the sample size has a great influence on the results. As shown in Table 5, the U-Net segmentation accuracy is higher than that of Deeplabve3+ and SegNet, which are basic semantic segmentation models, so U-Net is chosen as the backbone network. The features obtained by SAR-UNet after the SE module are combined with the upsampled features so that it can have good results for the objects with very obvious single features such as sugarcane, cities and water bodies, which are more beneficial for the binary classification problem. In this paper, the feature map obtained by CAM-UNet was only connected to the pooling layer without combining with the upsampled features, which made the attention domain larger and more beneficial for the multiclassification of remote sensing images. By combining with jump links, there was no significant improvement in the classification accuracy of the multispectral remote sensing images in the study area compared with U-Net, which may be more suitable for super-resolution remote sensing images. The OA was improved after adding U-Net network to SVM. Thus, extracting different levels of feature classification and then performing the majority voting game reduced misclassification to a certain extent. The accuracy of woodland and sugarcane planting area in this study was relatively satisfactory, but misclassifying woodland into sugarcane planting area was larger, as observed by the data sampled in the field. Woodland will be mixed in the large area of sugarcane and rice planting area, and sugarcane and rice will be planted around the large area of woodland. Further extraction of the planting areas of woodland, sugarcane and rice on Landsat 8 remote sensing images is the direction that needs further research.

Conclusions
U-Net suffers from insufficient information utilization and pays insufficient attention to some features. In this study, to improve and optimize the U-Net, we combined it with a CAM and replaced the classifier of the original U-Net with an SVM. The CAM-UNet model was used to extract multiple features from the study area, and the SVM, in turn, was used to classify multiple features. The final classification results were obtained using the majority voting game on the classification results of each feature, and the accuracy of the classification results was evaluated and analyzed with the field research data. We used a multifeature cascade to reduce gradient divergence and added a CAM to each convolutional unit of the U-Net encoder to make the network learn image features adaptively and focus more on important features. The results showed that the improved deep network algorithm with multifeature fusion perception has better classification results with images in the study area compared to U-Net, SegNet, Deeplabv3+, Attention-UNet, SAR-UNet, and Res-UNet. Adding the channel attention mechanism to the U-Net encoder can effectively improve network performance, and using the classification results of SVM for the majority voting game can reduce misclassification and improve classification accuracy, especially in forestland monitoring. This improved depth network algorithm based on multifeature fusion perception can better identify feature information and can effectively improve the classification accuracy of remote sensing images. This algorithm can provide a new technical reference for remote sensing image classification.

Conflicts of Interest:
The authors declare no conflict of interest.