A Semantic Segmentation Method Based on AS-Unet++ for Power Remote Sensing of Images

In order to achieve the automatic planning of power transmission lines, a key step is to precisely recognize the feature information of remote sensing images. Considering that the feature information has different depths and the feature distribution is not uniform, a semantic segmentation method based on a new AS-Unet++ is proposed in this paper. First, the atrous spatial pyramid pooling (ASPP) and the squeeze-and-excitation (SE) module are added to traditional Unet, such that the sensing field can be expanded and the important features can be enhanced, which is called AS-Unet. Second, an AS-Unet++ structure is built by using different layers of AS-Unet, such that the feature extraction parts of each layer of AS-Unet are stacked together. Compared with Unet, the proposed AS-Unet++ automatically learns features at different depths and determines a depth with optimal performance. Once the optimal number of network layers is determined, the excess layers can be pruned, which will greatly reduce the number of trained parameters. The experimental results show that the overall recognition accuracy of AS-Unet++ is significantly improved compared to Unet.


Introduction
Recently, how to automatically plan transmission lines for power has attracted many researchers.Based on geographic information, such as houses, roads, forests, rivers, etc., selection algorithms can plan a reasonable transmission line.Hence, accurate geographic information is essential for the automatic construction of transmission lines [1,2].With the progress of satellite technology, the image resolution acquired by remote sensing satellites is constantly improving.The different types of feature information in remote sensing images can satisfy the requirements for the planning of transmission lines [3].In the literature, the methods for feature recognition from remote sensing images can be divided into two kinds.One is based on the classical image threshold segmentation technology and the manual map marking.The other method is called image semantic segmentation, which is based on deep learning.
The traditional image segmentation method is based on the color, shape, texture, and other features of the image, which leads to the image content being divided into different regions according to the edge features [4][5][6].The classification results can be optimized to some extent by extracting geometric information from the image and combining it into class-by-class pixels [7].However, the traditional image segmentation method can only split the target and background of the image [8], which ignores other feature information from remote sensing images [9].
With the development of computer technology, using deep learning for image recognition and semantic segmentation has become a research hotspot.Deep learning technologies continue to advance, and existing methods are constantly optimized and improved, gradually improving performance and robustness.Pedestrian recognition in videos is interfered with by many factors, such as environmental changes, occlusions, and so on.The adaptive interference elimination framework has been adopted to model the motion path of each pedestrian in recent work, which can effectively solve these interference problems [10].Capturing detailed and informative features from the input images, utilizing pixel-level supervision to learn discriminative feature representations for forest smoke recognition [11], introducing label relevance and multi-directional interactions to improve the recognition accuracy, and using enhanced deformable convolution to extract more accurate feature representations [12] has enhanced the recognition ability of forest smoke, effectively warning of a forest fire.
With the progress of satellite technology, the image resolution of remote sensing satellites is gradually increasing.In this case, with the development of deep learning and graphics processing units [13,14], the semantic segmentation of remote sensing images has become a hotspot in the study of transmission line planning [15].Semantic segmentation based on full convolutional neural networks (FCNN) can satisfy input images with arbitrary size [16].By modifying the FCNN, many different semantic segmentation networks have been proposed in the literature, such as SegNet [17,18], Unet [19], and Deeplab [20].Although the segmentation accuracy of the target is improved, to some extent, the detailed feature processing is still unsatisfactory.For example, in the structure of Deeplabv3+, the down-sampling operation leads to the loss of some of the image information [21], which results in low recognition accuracy of the small targets in highresolution semantic segmentation.
Semantic segmentation has been widely applied in power line construction.Using semantic segmentation to identify power equipment in high-voltage transmission lines and substation scenarios can realize automatic inspection of power systems.Detecting power transmission infrastructure from aerial images using deep learning [22] and identifying objects like buildings, roads, forests, and rivers in remote sensing images using neural networks [23] provides effective information for transmission line planning.However, there is a relative lack of research on using neural networks to identify and segment object elements in transmission line planning, and the neural networks used are mostly basic ones, which have significant drawbacks.Improving neural networks and accurately identifying objects like buildings, roads, forests, and rivers in transmission line planning is of great significance.
Unet is an improved neural network based on (FCNN) [24].Compared with FCNN, Unet has higher sensitivity to image details, higher processing accuracy, and the ability to realize spatial consistency by considering pixel-to-pixel relationships.However, in the structure of Unet, due to the down-sampling methods of using convolution with step size and pooling operations, there is a loss of targeted and detailed spatial information.Hence, it is not ideal for the extraction of targets with a small pixel share in remote sensing images.In addition, the Unet structure is a neural network with a fixed number of layers, which can only extract features with a fixed depth [25].That is to say that it is difficult to extract features with different depths and shades of information in remote sensing images.
To conquer the down-sampling problem of Unet, in this paper, we firstly add the ASPP (atrous spatial pyramid pooling) and the squeeze-and-excitation (SE) module [26] to Unet such that the sensing field can be expanded and important features can be enhanced.This improved Unet is called AS-Unet in this paper.Secondly, note that the information features contained in remote sensing images are of different depths and the number of layers of the network have different performances for the features of different depths.Meanwhile, the selection of the number of layers will have a great impact on the performance of image segmentation.To solve the problem, this paper combines AS-Unet with a different number of layers in accordance with the network structure of Unet++, which is called AS-Unet++.The advantage of an improved network is that it can automatically determine an optimal performance depth for different depth features.Once the optimal number of network layers is fixed, the redundant layers can be pruned, which greatly reduces the number of trained parameters.Finally, by comparing and analyzing the experimental data, the trained AS-Unet++ shows better recognition accuracy in the semantic segmentation of remote sensing images compared to Unet.It will provide reliable remote sensing image data for the automatic planning of transmission lines.

AS-Unet++ for Remote Sensing of Images
In this section, we introduce the main construction process of the proposed AS-Unet++.In practice, there are two main factors that affect image segmentation accuracy.One is that the feature information from remote sensing images has different depths.The other one is that the feature distribution is not uniform, and the proportion of different factors is not the same.To this end, our proposed AS-Unet++ is built with the following steps.
Step 1: design of AS-Unet.Add the ASPP and SE modules to Unet such that the sensing field can be expanded and important features can be enhanced.
Step 2: construction of AS-Unet++.Since the choice of the number of layers has a great impact on the performance of image segmentation, the Unet++ is a network structure that combines and connects 2 to 5 layers of Unet; compared to Unet, Unet++ allows the network to automatically learn features of different depths and determine a depth with optimal performance.Based on the same idea as the structure of Unet++, two to five layers of AS-Unet are combined such that each AS-Unet shares a common left feature extraction part.Each layer of AS-Unet is a Unet that incorporates the ASPP with the SE Model, which becomes the AS-Unet++ studied in this paper.It is shown in Figure 1.Next, we will give a detailed introduction to this new network structure.

Unet
Unet is an improved network of FCNN [27], and the structure is a symmetric U-shaped structure.It mainly consists of two parts: feature extraction and feature enhancement, which is shown in Figure 2. The feature extraction part is on the left side and the feature strengthening part is on the right side.It has five layers.
The number above each layer in Figure 2 is the number of feature layers contained in that layer, and the number to the left of each layer is the size of the layer.(1) The part of feature extraction Different layers are connected to each other with a 2 × 2 max-pooling layer, which is labeled by the green arrow in Figure 2. The size of the max-pooled image is halved every time it passes through the max-pool.Since the padding is not set, some feature information is lost if the size is odd.Therefore, it is required to carefully set the size of the input image and keep the image length and width as an even number of pixel points.With the increasing number of channels in the convolutional layer, the number of feature channels in the image will increase.
(2) The feature enhancement The layers are connected by a 2 × 2 inverse max-pooling up-sampling layer, which is labeled by the purple arrow in Figure 2. The size of the image will be doubled after each up-sampling.Each layer of the feature enhancement network will fuse the features from the left feature extraction part, which is shown by the gray arrow.However, the image on the left side is larger than that on the right side, so some shearing is needed before feature fusion.
Each layer between the two parts will conduct a 3 × 3 convolution operation, which is labeled by the brown arrow in Figure 2.Then, it is followed by a Relu activation function layer.The operation mode of the convolution operation is a valid mode where the stride is 1 and the convolution kernel is 3 × 3. Since the padding is also not set, the image size will be reduced by two after each convolution operation.The last layer of the output is classified by a 1 × 1 convolution layer, which is shown by the magenta arrow in Figure 2.
In the original structure of Unet, the down-sampling methods, such as convolution and pooling with step size, will lead to the loss of target spatial information and detailed features.Hence, the extraction of targets with small pixel proportions in remote sensing of images is not ideal.The Unet structure is a neural network with fixed layers (usually five layers) and can only extract features of a fixed depth.Therefore, it is difficult to adapt to the information features of different depths in remote sensing images.

AS-Unet Structure
Remote sensing images contain complex feature information, the feature distribution is not uniform, and the proportion of different factors is not the same.In order to solve the problems in the original Unet structure that down-sampling methods create, such as convolution and pooling with step size, which lead to the loss of targeted spatial information and detailed information, the ASPP and SE models are added to Unet.This is called AS-Unet, which expands the sensing field such that the loss of information can be reduced and the important features can be enhanced.The structure of AS-Unet is shown in Figure 3.The SE model is added to the feature fusion part of each layer, which can automatically evaluate the importance of each feature channel.Different weight coefficients can be added to each channel such that important features are strengthened and unimportant features are suppressed.After the convolution and activation function in the last layer of the network, adding an ASPP module can expand the sensory field such that information loss can be minimized and the network's ability to capture multi-scale information is improved.

Atrous Spatial Pyramid Pooling (ASPP)
ASPP uses parallel null convolutional layers with multiple different sampling rates, where the features extracted for each sampling rate are further processed in separate branches and fused to generate the final result.This expands the sensor field while ensuring that the resolution is not degraded by the Unet down-sampling operation.Meanwhile, it enhances the ability to capture multi-scale contextual information.Figure 4 illustrates the specific structure of ASPP.The r = 6, 12, and 18 in Figure 4 represent convolution kernels with null rates of 6, 12, and 18, respectively.
ASPP constructs convolutional kernels with different receptive fields by different atrous rates and obtains the multi-scale contextual information through parallel structures [28].The information of different scales is integrated by the concat method [29].The structure can be given as: where O[j] is the output of the convolution operation performed on the pixel with index j, I is the input feature mapping, r is the atrous rate of the convolution kernel, f is the convolution kernel with weights, and n is the convolution kernel position index.
The ability to change the receptive field (RF) size by varying the value of the cavitation rate r is calculated as follows: where r is usually chosen to be 6, 12, or 18.Too large of a value will lead to too sparse sampling of the input signal, which results in no correlation between the remote information.

Squeeze-and-Excitation (SE)
The feature distribution of remote sensing images is not uniform.If the house elements are divided, the training effect of this part of the training set will be poor due to the small proportion of house elements in some training sets.
The squeeze and excitation can utilize the relationship between different channel feature mappings such that the specific semantic features can be strengthened [30].The SE module can assign different weights to each channel, which means that channels containing important information features are strengthened and the channels with non-important information features are weakened [31] to optimize the training effect.The SE structure is shown in Figure 5.The SE module contains two main operations, squeeze and excitation, which are F s q and F e x in Figure 5. f c is the feature map with feature channel c, H is the height of the feature map, and W is the width of the feature map.z is the feature vector of 1 × 1 × c, s is the weight, and f c is the weighted feature channel.The compression operation is represented by the following equation: where f c is the feature map with feature channel c, h is the feature map height, and w is the feature map width.The squeeze operation can be performed by global average pooling based on the width and height of the feature maps, and the scalar that represents the global receptive field [32].The feature maps with dimensions w × h × c, which contain the global information, are compressed into 1 × 1 × c feature vectors z, such that the generated channel statistic z contains contextual information.It can alleviate the channel dependency problem.
The excitation operation is accomplished by fitting the nonlinear relationship between the channels through two fully connected layers and an activation function.To reduce the computational effort, the first fully connected layer compresses the c channels and then a Relu function is used as the activation function.The second fully connected layer restores the number of channels to c, and the weights s are obtained by activating the Sigmoid activation function.s is calculated as follows: where ω represents the parameters of the fully connected layer, δ represents the Relu activation function, and σ represents the sigmoid activation function.
The original channels are weighted by using the obtained weights, s.Valid feature channels have larger weights and invalid or unimportant feature channels have smaller weights.

Unet++ 2.3.1. Unet with Different Depths for Each Layer in Unet++
The feature information contained in remote sensing images is rich and the distributed categories are not uniform, which means that the information is characterized by different depths.Unet models of different depths will have different performances.
Unet++ is a network structure connected by a combination of different layers of Unet, and different layers of Unet can realize the extraction of different depth features.Figure 6 shows the Unet structure with different depths.Forest element segmentation and lake element segmentation are performed on the two remotely sensed images using different depths of Unet.The segmentation results are shown in Figure 7.As can be seen from Figure 8, the segmentation performance for forests can be improved by adding the number of network layers.However, for lakes, the Unet with four layers performs better than the Unet with five layers.
In the recognition of forest elements, there is a serious leakage of recognition in the shallow Unet, and there is also misrecognition of non-forest elements in the Unet of layers 2 and 3.As the number of network layers increases, the phenomenon of missed recognition becomes less and less.The 5-layer Unet basically alleviates the phenomenon of missed recognition and misrecognition.In the recognition of lake elements, the Unet of layers 2 and 3 has the phenomenon of missed recognition.The Unet of layer 4 can recognize the lake elements completely and accurately.However, the Unet of layer 5 has the phenomenon of misrecognition of non-lake elements similar to the contours of a lake with an increased number of layers.From the above analysis, it can be seen that remote sensing images contain complex and rich information; the feature depth of the information is not the same.The number of layers of the network will also have different performances for features with different depths.An increase in the number of layers of the network may not necessarily represent an improvement in the recognition performance.

Unet++ Structure
The choice of the number of Unet layers has an important impact on the performance of image segmentation.By combining the different layers of Unet, as shown in Figure 9, different levels of features can be captured.However, because of the lack of connections in the middle region of the network structure, the gradient cannot pass through this region.It implies that the backpropagation breaks down here and the network cannot be trained.By adding connections to the middle units, the problem can be solved.Connecting all the middle units is the structure of Unet++, which is shown in Figure 10.From Figure 10, Unet++ is a network structure that combines 2 to 5 layers of Unet together.Compared to Unet, Unet++ allows the network to automatically learn features at different depths and determine a depth with optimal performance.Once the optimal number of network layers is determined, the excess layers can be pruned, which greatly reduces the number of trained parameters.
In addition, it can be seen from Figure 10 that the Unet feature extraction parts of each layer are superimposed together.That is to say that the Unet ++ allows different levels of Unet to share a left feature extraction unit, which reduces the training amount of multiple Unets.

AS-Unet++
AS-Unet++ is based on the Unet++ structure, which combines an AS-Unet of 2 to 5 layers.Each AS-Unet shares a common left feature extraction part.Each layer of AS-Unet is a Unet that joins ASPP with SE Model; its structure is shown in Figure 1.
Compared to Unet, due to the use of ASPP and SE models, AS-Unet++ enhances the ability of neural networks to extract important feature information and capture multi-scale contextual information.AS-Unet++ allows the network to automatically learn features at different depths and determine an optimal performance depth.Once the optimal number of network layers is fixed, the excess layers can be pruned to reduce the number of parameters to be trained.When the AS-Unet feature extraction portions of each layer are stacked together, the AS-Unet of different layers share a common left-side feature extraction unit, which reduces the amount of training for multiple AS-Unets.

Data
The image data used in this paper are taken from high-resolution remote sensing images of Fuyang City, Anhui Province, China.The image size is cut to 4000 × 4000 with 160 images in total, and vector semantic segmentation labels are made using QGIS software (Version 3.10, Gary Sherman, Cedar Rapids, USA).Then, the remote sensing image is cropped into 300 × 300 deep learning samples by the sliding window cropping method.The categories of feature information in the remotely sensed images are houses, roads, forests, and lakes.
All the data are divided into two sets, training set and validation set, and the ratio of training data to validation data is 4:1.The training set images are fed into the network for training, and then the validation set is fed into the trained network for prediction to evaluate the results and performance of the training.
The labels of the images are created using QGIS, and different shades of gray are used to refer to different things in the image.Pixels of the same type of things are labeled with a fixed gray value, and the background is solid black.Figure 11 shows the images of houses, roads, forests, and lakes along with the labels.By randomly flipping the training data horizontally, vertically, diagonally, and with appropriate linear stretching, the generalization ability of the model can be enhanced, and the number of data sets can be expanded.In addition, blurring the image and adding noise can prevent the model from learning unnecessary noise and inhibit overfitting.Among them, 15% of the images were randomly rotated by 90°, 5% were flipped horizontally, 5% were flipped vertically, and 10% had blur and noise added to them.After cropping and data enhancement, a total of 12116 remote sensing images with 300×300 resolution were obtained as training data.To simulate the variable domain, the validation dataset is linearly stretched by 0.8%, 1%, 1.5%, and 2% at random.

Environment and Parameter Configuration
The environment configuration for AS-Unet++ is shown in Table 1.The training parameters used in the experiment are shown in Table 2.

Evaluation Indicator
The evaluation indicators are Precision, Recall, IoU, and MIoU.The evaluation indicators are calculated on the basis of a confusion matrix, as shown in Table 3.

Positive Negative
Positive TP FN Negative FN TP Precision represents the proportion of correctly predicted pixels in a certain category, its calculation formula is as follows: Recall represents the proportion of the total number of pixels in a certain category that have been correctly recognized by the network, the calculation formula of which is as follows: IoU is the ratio of the intersection and union between the predicted result and the true value of each class.MIoU can be obtained by adding the intersection ratio of each In the recognition of house elements, Unet++ has the phenomenon of missed recognition of some houses due to the difference of light and color, and the edge segmentation effect of houses is not good in recognition.In A-Unet++ with only the addition of ASPP, although the edge segmentation effect of houses has been improved, the phenomenon of missed recognition has not been improved.In S-Unet++ with only the SE model, although the phenomenon of missed recognition has been improved, the edge segmentation effect of houses has not been improved.The AS-Unet++ network with the addition of both modules improved in both missed recognition and the edge segmentation effect.Compared with A-Unet++ with the addition of a single module, it was not improved in the recognition of houses.In S-Unet++ with only the SE model, although the omission recognition phenomenon was improved, the edge segmentation effect of houses was not improved.
AS-Unet++ with both modules improved in both omission recognition and edge segmentation effects, and the improvement was more obvious compared with A-Unet++ and S-Unet++ with a single module.In remote sensing images, the house element accounts for a relatively small proportion, and the lack of SE model has poor performance in capturing semantic features of the house, which leads to the phenomenon of missed recognition.While the lack of ASPP leads to the loss of target spatial information and detailed information caused by downsampling methods, such as convolution and pooling with step size in the original structure of the Unet, it does not have much effect on the house information, such as illumination and color differences, but causes the edge feature information to be lost.Although it does not have much effect on the information of light and color differences in the house, it will cause the loss of edge feature information, resulting in an unsatisfactory edge segmentation effect.
In the recognition of road elements, Unet++ has the phenomenon of missing recognition of some roads due to small widths, and the edge segmentation effect is also poor.A-Unet++ improves the edge segmentation effect of roads, but the phenomenon of missing recognition remains unimproved.S-Unet++ improves the phenomenon of missing recognition, but the edge segmentation effect has not been improved.AS-Unet++ improves the phenomenon of missing recognition in both aspects.Similar to the house element, the road element occupies a relatively small proportion in remote sensing images, and the lack of SE model results in poor capture of road information, which leads to the phenomenon of missed recognition.The lack of ASPP results in the loss of edge feature information, which leads to unsatisfactory edge segmentation effects.
In the recognition of forest elements, Unet++ recognizes the surrounding forest pixels poorly due to the interference of the wire pixels in the lower right corner.S-Unet++ completely recognizes the wire pixels as forest pixels compared with Unet++, with no significant improvement in the recognition performance.A-Unet++ can split the wire pixels and the forest pixels better than Unet++, and the recognition effect is closer to that of AS-Unet++.The lack of SE model does not affect the network's ability to capture forest information in remote sensing images because the forest elements account for a large proportion of the remote sensing image.The interference of power lines crossing from the forest in remote sensing images become elements with a small proportion in the remote sensing image, and the lack of ASPP results in the loss of interference information, which in turn leads to poor anti-interference ability.
In the recognition of lake elements, the recognition performance of Unet++ and S-Unet++ is similar, and the recognition in the edge part is not satisfactory enough.The recognition performance of A-Unet++ and AS-Unet++ is similar, and both of them improve in the recognition of edges.Same as the forest element, the lake element occupies a relatively small proportion in the remote sensing image, and the lack of SE model does not affect the network's ability to capture the lake information.While the lack of ASPP leads to the loss of edge feature information, which in turn leads to an unsatisfactory effect of edge segmentation.
The Precision, Recall, and IoU of various networks for house, road, forest, and lake predictions in the test sets are shown in Table 4.The MIoU of AS-Unet++ on the test set was 90.2%.Meanwhile, Unet++ had 83.2% MIoU for the test set, A-Unet++ had 86.6% MIoU for the test set, and S-Unet++ had 86.2% MIoU for the test set.Compared with Unet++, A-Unet++, and S-Unet++, the MIoU of AS-Unet++ was improved by 7.0%, 3.6%, and 4.0%, respectively.
In the recognition of house elements and road elements, S-Unet++ was higher compared to A-Unet++, which shows that the SE model improves the performance of recognition of elements with smaller pixel occupancy more significantly.In the recognition of forest elements and lake elements, A-Unet++ was higher compared to the three metrics of S-Unet++, and ASPP had better recognition performance in the recognition of elements with large pixel occupancy because of better edge segmentation and better resistance to interference with small occupancy.It can be seen that after training, the MIoU of the AS-Unet++ verification set reached 88.9%.However, the MIoU of Unet and AS-Unet on the verification set was 80.8% and 85.8%, respectively.
The Precision, Recall, and IoU of the verification sets of each network for road elements, forest elements, and lake elements are shown in Table 5.It can be seen from the above data that the AS-Unet network is superior to the Unet network in all indicators, and the AS-Unet++ network, as a further optimization of the AS-Unet network, has improved in all aspects of accuracy compared with the AS-Unet network.
Compared with the AS-Unet and Unet network, the MIoU of the AS-Unet++ network increased by 3.1% and 8.1%, respectively.In Figure 13, the overall convergence speed of the three kinds of differences was small, and, only in the road elements recognition training, the AS-Unet++ network convergence speed was slightly faster than the other two networks.In addition, in the training process, AS-Unet++ compared with the other two network oscillations was smaller, especially in the roads, forests, and lakes element recognition training.In the identification of house elements, the Precision index increased by 2.7% and 6.4%, the Recall index increased by 3.2% and 7.4%, and the IoU index increased by 3.1% and 7.3%, respectively.In the recognition of road elements, the Precision index increased by 2.5% and 9.3%, Recall increased by 2.1% and 9.6%, and IoU increased by 2.2% and 9.3%, respectively.In the recognition of forest elements, the Precision index increased by 6.8% and 14.5%, Recall increased by 6.8% and 13.8%, and IoU increased by 6.7% and 14.0%, respectively.In the identification of lake elements, the Precision index increased by 4.5% and 6.0%, Recall increased by 4.6% and 5.8%, and IoU increased by 4.3% and 5.7%, respectively.The improvement of forest identification accuracy was particularly obvious.
Figure 14 shows a comparison of the predicted segmentation images of houses, roads, forests, and lakes achieved by the three networks.As can be seen from Figure 14, although Unet is able to recognize the corresponding elements, there is still some misrecognition and omission.In the recognition of houses, a small number of roof pixels are incompletely recognized due to the difference in light received by different surfaces of the roof.In the recognition of roads, there are omissions in the recognition of roads with small widths.In the recognition of forests, the segmentation interference of the power lines at the lower right side leads to the leakage of recognition of the surrounding pixels.There is leakage recognition in the curved part of the lake edge.
Compared with Unet, AS-Unet significantly improved the recognition and segmentation of various elements.In the recognition of houses, the missing recognition phenomenon of Unet has been improved, but there are still a small number of pixels missing recognition in places with large differences in house lighting, which leads to incomplete recognition of all house pixels.In the identification of roads, the phenomenon of misidentification of banded wasteland similar to roads has been significantly improved, but the problem of the missing identification of roads with small widths still exists.In the recognition of forest, the missing recognition is obviously improved, but there is also a phenomenon of misidentifying grassland as forest.In lake recognition, the edge with a complex shape can be segmented correctly, and the performance is obviously improved.
The segmentation effect of AS-Unet++ is improved compared with both Unet and AS-Unet.In the recognition of houses, AS-Unet++ can identify the houses in the figure more accurately.Moreover, there is no missing recognition phenomenon like Unet caused by differences in lighting for a single house.In the road identification, the problem of road leakage identification with small widths can be solved and the banded wasteland similar to the road is not misidentified.In the forest identification of AS-Unet++, the missing identification phenomenon caused by power lines in the lower right is solved, so that the identification area is larger.In the recognition of lakes, the edges with complex shapes can also be correctly segmented.
The Precision, Recall, and IoU of the test sets of each network for road elements, forest elements, and lake elements are shown in Table 6.It can be seen from the above data that AS-Unet++ is superior to Unet and AS-Unet in each index of the test sets.Compared with AS-Unet and Unet, the MIoU of AS-Unet++ increases by 4.7% and 9.7%, respectively.In the identification of housing elements, the Precision index increased by 3.3% and 7.0%, the Recall index increased by 3.5% and 7.9%, and the IoU index increased by 3.4% and 7.5%, respectively.In the recognition of road elements, Precision index increased by 2.0% and 9.0%, Recall index increased by 2.6% and 9.5%, and IoU index increased by 2.6% and 9.8%, respectively.In the recognition of forest elements, the Precision index increased by 7.7% and 14.9%, Recall increased by 7.5% and 14.5%, and IoU increased by 7.4% and 14.9%, respectively.In the recognition of lake elements, the Precision index increased by 5.1% and 6.6%, Recall increased by 5.4% and 6.5%, and IoU increased by 5.3% and 6.6%, respectively.
In the recognition based on CDE, there are cases where pixels that should have belonged to the corresponding element are not recognized.For example, some houses are not recognized in the house element because the color of different houses varies greatly.In the road element, a road with a small width is not recognized.In the forest element, most of the elements around the lower right side are not recognized because of the interference of power lines.The lake element is not identified because of the color difference of some waters.
AS-Unet++ overcomes the phenomenon that CE Loss misidentifies other similar elements, such as other non-house that also present rectangles in the identification of house elements, other elements that also present banded distributions in the identification of road elements, other grasslands similar to forests in the identification of forest elements, and other similar non-lake elements in the identification of lake elements.AS-Unet++ also overcomes the phenomenon of missing recognition in CDE, such as houses with large color differences in house feature recognition, roads with smaller widths in road feature recognition, and forests in forest feature recognition.The interference of the lower right wire is overcome to identify the pixels belonging to the forest, and there is no omission of different colored waters in the lake feature.Compared with the other two networks, AS-Unet++ has better performance.
The Precision, Recall, and IoU of the test sets of each network for road elements, forest elements, and lake elements are shown in Table 8. a new AS-Unet++.Compared with the traditional Unet, the ASPP, and the SE Model are added in AS-Unet++, which enhances the neural network's ability to extract important feature information and capture multi-scale context information.The AS-Unet feature extraction parts of each layer are stacked together, which reduces the amount of training for multiple AS-Unets.AS-Unet++ reduces the number of training parameters compared with the Unet.
Experimental results have shown that the overall recognition accuracies of AS-Unet++ are significantly improved compared to Unet.In the prediction segmentation image, the addition of ASPP improves the edge segmentation, and the addition of the SE model makes the network perform better for the segmentation of houses and roads, which are small elements in the image.In addition, AS-Unet++ can effectively reduce the occurrence of misidentification and missed identification.
Although the method in this paper improves the segmentation accuracy to some extent, the generalization condition is still a great challenge when facing complex and variable remote sensing images, such as elements under different illumination conditions or complex shapes.Future work should be focused on improving the model generalization ability as well as improving the segmentation accuracy even further.

Figure 7 .Figure 8 .
Figure 7. Predicted results of Unet with different depths.The MIoU obtained from the results of Figure7is shown in Figure8.

Figure 9 .
Figure 9. Combination of Unet with different depths.

Figure 11 .
Figure 11.Remote sensing images and labels.

( 2 )
Comparison of AS-Unet++, Unet, and AS-UnetComparing AS-Unet++ with Unet and the AS-Unet model in the training sets and test sets allows for visualization of the performance optimization of the network.The graphs of MIoU in the three kinds of networks during the training of houses, roads, forests, and lakes are shown in Figure13.

Table
Parameter setting.

Table 4 .
Comparison results of different networks for the ablation experiment.

Table 5 .
Comparison results of different networks on different verification sets.

Table 6 .
Comparison results of different networks on different test sets.