Method of Building Detection in Optical Remote Sensing Images Based on SegFormer

An appropriate detection network is required to extract building information in remote sensing images and to relieve the issue of poor detection effects resulting from the deficiency of detailed features. Firstly, we embed a transposed convolution sampling module fusing multiple normalization activation layers in the decoder based on the SegFormer network. This step alleviates the issue of missing feature semantics by adding holes and fillings, cascading multiple normalizations and activation layers to hold back over-fitting regularization expression and guarantee steady feature parameter classification. Secondly, the atrous spatial pyramid pooling decoding module is fused to explore multi-scale contextual information and to overcome issues such as the loss of detailed information on local buildings and the lack of long-distance information. Ablation experiments and comparison experiments are performed on the remote sensing image AISD, MBD, and WHU dataset. The robustness and validity of the improved mechanism are demonstrated by control groups of ablation experiments. In comparative experiments with the HRnet, PSPNet, U-Net, DeepLabv3+ networks, and the original detection algorithm, the mIoU of the AISD, the MBD, and the WHU dataset is enhanced by 17.68%, 30.44%, and 15.26%, respectively. The results of the experiments show that the method of this paper is superior to comparative methods such as U-Net. Furthermore, it is better for integrity detection of building edges and reduces the number of missing and false detections.


Introduction
A building is a key artificial ground object with a roof and walls and is the central place of human habitation. Building information is essential to carry out various urban and regional activities such as city planning [1], disaster monitoring [2], traffic management [3], and scientific planning of the eco-environment [4]. Accurate extraction of building information from remote sensing images is difficult due to the complexity of building features and surrounding ground objects such as buildings, vegetation, roads, and bare areas. Building features can be diverse due to differences in the material and configuration of ground objects [5]. A great deal of unnecessary, non-building information can be confounded in the detection network when the spatial distribution of surface features is complex, resulting in poor classification detection effects. Traditional image processing with manual selection of design features is time-consuming and labor intensive. Consequently, it is difficult to satisfy real-time latest requirements and to apply on a large scale. At present, there is a large gap between intelligent judgment and drawing technology research and its practical applications.
In recent years, research on the extraction of building information has been conducted using a high-resolution remote sensing image classification method based on the convolutional neural network (CNN) [6]. These studies reviewed the semantic representation SegFormer stresses robustness and validity in Transformer series networks. While having challenges against image interference, the semantic segmentation task is completed efficiently and with higher precision. Issues of building detection in remote sensing images can be solved better due to the improvements made based on SegFormer.

Improvement Strategy
The improved network structure presented in this paper is shown in Figure 1. The improvement has two main aspects: (1) The structure of the detection network is optimized, and the sampling of semantic and detailed features of buildings is reinforced to address the poor applicability of the existing models for building extraction from remote sensing images. The transposed convolutional network is coupled with the sampling module to correlate semantic feature information, fuse high and low-layer feature information, and cascade inter-layer receptive fields. This can lead to receding the impact of spatial heterogeneity and intricate ground object background and intensifying the establishment of a detection network model for building features. Multiple normalization activation layers are fused [24]. The screening efficiency of building semantic information is enhanced by reducing the quantity of the parameters, inhibiting the disappearance of the gradient, and speeding up the convergence. To efficiently gain the semantic information features for buildings, the effective screening and classification of the intra-class feature information are realized by embedding the activation layer [25]. (2) The decoding output part of the model is optimized, and the atrous spatial pyramid pooling (ASPP) decoding module is used to address issues such as excessive interference, loss of semantic and detailed feature information of buildings, and missing and false detection of building features extracted from remote sensing images. Multi-scale contextual information is explored by applying multi-sampling rate atrous convolution [26] and multi-receptive field convolution on the input feature map. A sufficient amount of contextual information is captured by pooling operations at different resolutions, and lucid target boundaries are captured by gradually recovering spatial information. The semantic fusion module is used in the output part to aggregate rough information in the shallow layer and fine information in the deep layer. The interaction of feature information from different layers is achieved by using the shuffle [27]. The shuffle is applied to address issues such as the partial loss of building information and the lack of long-distance information. It is also helped to efficaciously relieve the loss of feature information resulting from the difference in image resolution due to miscellaneous data sources. Consequently, issues such as the loss of edge details of multi-scale buildings and the missing and false classification of tiny buildings can be addressed.

Sampling Module Based on the Transposed Convolutional Network
The information of remote sensing images is intricate, with many interferences. The higher the distinction degree between detected ground objects and background information, the easier the identification of stable features with strong anti-interference. Therefore, in this paper, we design the enhancement extraction mechanism for interlayer features. This is based on transposed convolution from the angle of reinforcing the expression of enhancing features and inhibiting futile background information. The learning is intensified in the training iteration to acquire the optimum sampling parameters by establishing an adaptive learning model. having challenges against image interference, the semantic segmentation task is completed efficiently and with higher precision. Issues of building detection in remote sensing images can be solved better due to the improvements made based on SegFormer.

Improvement Strategy
The improved network structure presented in this paper is shown in Figure 1. The improvement has two main aspects: (1) The structure of the detection network is optimized, and the sampling of semantic and detailed features of buildings is reinforced to address the poor applicability of the existing models for building extraction from remote sensing images. The transposed convolutional network is coupled with the sampling module to correlate semantic feature information, fuse high and low-layer feature information, and cascade inter-layer receptive fields. This can lead to receding the impact of spatial heterogeneity and intricate ground object background and intensifying the establishment of a detection network model for building features. Multiple normalization activation layers are fused [24]. The screening efficiency of building semantic information is enhanced by reducing the quantity of the parameters, inhibiting the disappearance of the gradient, and speeding up the convergence. To efficiently gain the semantic information features for buildings, the effective screening and classification of the intra-class feature information are realized by embedding the activation layer [25].
(2) The decoding output part of the model is optimized, and the atrous spatial pyramid pooling (ASPP) decoding module is used to address issues such as excessive interference, loss of semantic and detailed feature information of buildings, and missing and false detection of building features extracted from remote sensing images. Multi-scale contextual information is explored by applying multi-sampling rate atrous convolution [26] and

Transposed Convolution
The size of the feature map gradually lessens following the convolutional operation of the images several times. Transposed convolution can efficaciously couple the feature information of the high layer and the low layer, and cascade the receptive fields between layers. This will address the issue of over-flow of non-building information due to inconsistent image resolution or due to different scales in the feature map.
In convolution operation C, the convolution Y output for the n-dimensional vector X can be expressed as CX = Y, and its transposed convolution can be expressed as X = CTY. The purpose of transposed convolution is to add holes and fillings to restore more semantic information via rearranging the input and kernel. The specific implementation pattern of transposed convolution is as follows: By overlooking the number of channels, it is hypothesized that the stride is s and the padding is p. The shape of the input tensor is nh×nw, the shape of the convolution kernel is kh × kw, and the transposed operation yields nh × nw intermediate results. Each intermediate result is a tensor of (n h + k h − 1) × (n w + k w − 1) (The parameter is initialized to 0). The approach to calculating the intermediate tensor is as follows: Each element in the input tensor is multiplied by the convolution kernel; thereby, the tensor of k h × k w is attained, which replaces a part of the intermediate tensor.
when n and k are 2, s = 1, p = 0, and the input parameters are {0, 1, 2, 3}; the calculation form is shown in Figure 2: textual information is captured by pooling operations at different resolutions, and lucid target boundaries are captured by gradually recovering spatial information. The semanti fusion module is used in the output part to aggregate rough information in the shallow layer and fine information in the deep layer. The interaction of feature information from different layers is achieved by using the shuffle [27]. The shuffle is applied to addres issues such as the partial loss of building information and the lack of long-distance infor mation. It is also helped to efficaciously relieve the loss of feature information resultin from the difference in image resolution due to miscellaneous data sources. Consequently issues such as the loss of edge details of multi-scale buildings and the missing and fals classification of tiny buildings can be addressed.

Sampling Module Based on the Transposed Convolutional Network
The information of remote sensing images is intricate, with many interferences. Th higher the distinction degree between detected ground objects and background infor mation, the easier the identification of stable features with strong anti-interference. There fore, in this paper, we design the enhancement extraction mechanism for interlayer fea tures. This is based on transposed convolution from the angle of reinforcing the expres sion of enhancing features and inhibiting futile background information. The learning i intensified in the training iteration to acquire the optimum sampling parameters by estab lishing an adaptive learning model.

Transposed Convolution
The size of the feature map gradually lessens following the convolutional operation of the images several times. Transposed convolution can efficaciously couple the featur information of the high layer and the low layer, and cascade the receptive fields between layers. This will address the issue of over-flow of non-building information due to incon sistent image resolution or due to different scales in the feature map.
In convolution operation C, the convolution Y output for the n-dimensional vector X can be expressed as CX = Y, and its transposed convolution can be expressed as X = CTY The purpose of transposed convolution is to add holes and fillings to restore more seman tic information via rearranging the input and kernel. The specific implementation pattern of transposed convolution is as follows: By overlooking the number of channels, it is hypothesized that the stride is s and th padding is p. The shape of the input tensor is nh×nw, the shape of the convolution kerne is kh × kw, and the transposed operation yields nh × nw intermediate results. Each inter mediate result is a tensor of (nh + kh − 1) × (nw + kw − 1) (The parameter is initialized to 0) The approach to calculating the intermediate tensor is as follows: Each element in the in put tensor is multiplied by the convolution kernel; thereby, the tensor of kh × kw is attained which replaces a part of the intermediate tensor. Based on the aforesaid theory, the transposed convolutional network designed in this paper is shown in Figure 3: Multi-layer perception output channels of 4C, two 3 × 3 2D convolution layers, two 4 × 4 2D transposed convolution layers and one 1 × 1 2D convolution layer are embedded in the subsampled part by following the characteristic map. Through the transposed convolution operation, some information can be efficaciously recovered while the fuzzy boundary is retained, thereby outputting more targeted detailed information and semantic information features. Based on the aforesaid theory, the transposed convolutional network designed in thi paper is shown in Figure 3: Multi-layer perception output channels of 4C, two 3 × 3 2D convolution layers, two 4 × 4 2D transposed convolution layers and one 1 × 1 2D convolu tion layer are embedded in the subsampled part by following the characteristic map Through the transposed convolution operation, some information can be efficaciously re covered while the fuzzy boundary is retained, thereby outputting more targeted detaile information and semantic information features.

Multiple Normalization Activation Layers
The convolution operation abstracts the local features of the building images; how ever, it neglects the interrelationship between pixels. This can lead to inter-class incon sistency, dramatically affecting the reliability and integrity of building edge segmentation The addition of the extra normalization and activation layers upon convolution operatio can result in a stable selection of feature parameters by holding back over-fitting regular ization expression and simultaneously averting the influence of outliers and extreme Furthermore, the addition of both layers can inhibit noise interference, strengthen the ex pression capacity of the network, and can increase the focus on feature channels and spa tial positions with strong significance and enormous information. As a result, the interna correlation of images or features can be acquired easily; thereby, the issue of precise ex traction of inter-class differences of buildings can be solved.
The normalization layer is set based on the introduction of the transposed convolu tional network, and the setting mode is shown in Figure 4.

Spatial Pyramid Pooling Decoding Module Fused Atrous Convolution
The fusion approach of absolute skip connection of semantically dissimilar feature is unable to address the semantic gap resulting from the lack of multi-scale features in th design of the network. As a consequence, the extraction capacity of the network will b severely restrained to large-scale building edges and tiny building objects. Based on thi this paper designs a spatial pyramid pooling decoding module fused with the atrous con volution. Firstly, the deep-layer network features are dilated according to distinct dilatio rates, the contextual information of multi-scale features is captured by the atrous spatia pyramid pooling module, and the global average pooling module is introduced to sup plement, thereby refining the contextual information. Secondly, in light of the semantic guided fusion module, the corresponding relationship between feature pixels at differen layers is established, and the deep-layer semantic information is fused into the shallow

Multiple Normalization Activation Layers
The convolution operation abstracts the local features of the building images; however, it neglects the interrelationship between pixels. This can lead to inter-class inconsistency, dramatically affecting the reliability and integrity of building edge segmentation. The addition of the extra normalization and activation layers upon convolution operation can result in a stable selection of feature parameters by holding back over-fitting regularization expression and simultaneously averting the influence of outliers and extremes. Furthermore, the addition of both layers can inhibit noise interference, strengthen the expression capacity of the network, and can increase the focus on feature channels and spatial positions with strong significance and enormous information. As a result, the internal correlation of images or features can be acquired easily; thereby, the issue of precise extraction of inter-class differences of buildings can be solved.
The normalization layer is set based on the introduction of the transposed convolutional network, and the setting mode is shown in Figure 4. Based on the aforesaid theory, the transposed convolutional network designed in this paper is shown in Figure 3: Multi-layer perception output channels of 4C, two 3 × 3 2D convolution layers, two 4 × 4 2D transposed convolution layers and one 1 × 1 2D convolution layer are embedded in the subsampled part by following the characteristic map Through the transposed convolution operation, some information can be efficaciously recovered while the fuzzy boundary is retained, thereby outputting more targeted detailed information and semantic information features.

Multiple Normalization Activation Layers
The convolution operation abstracts the local features of the building images; however, it neglects the interrelationship between pixels. This can lead to inter-class inconsistency, dramatically affecting the reliability and integrity of building edge segmentation The addition of the extra normalization and activation layers upon convolution operation can result in a stable selection of feature parameters by holding back over-fitting regularization expression and simultaneously averting the influence of outliers and extremes Furthermore, the addition of both layers can inhibit noise interference, strengthen the expression capacity of the network, and can increase the focus on feature channels and spatial positions with strong significance and enormous information. As a result, the internal correlation of images or features can be acquired easily; thereby, the issue of precise extraction of inter-class differences of buildings can be solved.
The normalization layer is set based on the introduction of the transposed convolutional network, and the setting mode is shown in Figure 4.

Spatial Pyramid Pooling Decoding Module Fused Atrous Convolution
The fusion approach of absolute skip connection of semantically dissimilar features is unable to address the semantic gap resulting from the lack of multi-scale features in the design of the network. As a consequence, the extraction capacity of the network will be severely restrained to large-scale building edges and tiny building objects. Based on this this paper designs a spatial pyramid pooling decoding module fused with the atrous convolution. Firstly, the deep-layer network features are dilated according to distinct dilation rates, the contextual information of multi-scale features is captured by the atrous spatia pyramid pooling module, and the global average pooling module is introduced to supplement, thereby refining the contextual information. Secondly, in light of the semanticguided fusion module, the corresponding relationship between feature pixels at different layers is established, and the deep-layer semantic information is fused into the shallow

Spatial Pyramid Pooling Decoding Module Fused Atrous Convolution
The fusion approach of absolute skip connection of semantically dissimilar features is unable to address the semantic gap resulting from the lack of multi-scale features in the design of the network. As a consequence, the extraction capacity of the network will be severely restrained to large-scale building edges and tiny building objects. Based on this, this paper designs a spatial pyramid pooling decoding module fused with the atrous convolution. Firstly, the deep-layer network features are dilated according to distinct dilation rates, the contextual information of multi-scale features is captured by the atrous spatial pyramid pooling module, and the global average pooling module is introduced to supplement, thereby refining the contextual information. Secondly, in light of the semanticguided fusion module, the corresponding relationship between feature pixels at different layers is established, and the deep-layer semantic information is fused into the shallow image details in a bottom-up way. The input feature maps are dilated and divided into groups according to the dilation coefficient, the features at different scales are extracted by adopting distinct dilation kernels, and the shuffle of extraction features is conducted to achieve information about the interaction of different layers and to strengthen global semantic information expression.

Atrous Convolution (Dilated Convolution)
In contrast to the traditional convolution layer and pooling layer, the atrous convolution possesses the following advantages: 1 After the replacement of the atrous convolution, the training of reasonably dilated new parameters can be avoided when the calculation parameters are increased, and the performance can be continuously optimized in the training process. 2 The feature maps at different resolutions can be attained by setting different atrous rates to alter the size of the receptive field, which decreases the loss of position information in the subsampled process.
In this paper, the conventional convolution is replaced by the atrous convolution, and the receptive field in the pooling process is altered by changing the atrous rate of the atrous convolution. The experiment discovers that distinct dilation rates have a certain influence on the performance of the model as well. The atrous convolution mode with a reasonable dilation rate is shown in Figure 5. The authors of [28] show that: 1 the use of atrous convolution with the identical dilation rate can result in the discontinuity of the convolution kernel, leading to the "Gridding Effect", as shown in Figure 6; 2 the dilation rate combinations cannot contain the common divisor greater than 1, and the dilation rate Ri must satisfy Formula (2) as follows; 3 it is hypothesized that the dilation rates corresponding to n atrous convolutions with the convolution kernel size of k × k are [r1, . . . , ri, . . . rn], respectively, and R2 ≤ k should be satisfied, where ri represents the dilation rate of the ith atrous convolution, Ri represents the largest dilation rate of the ith atrous convolution, and the default is Rn = rn. Based on the above conclusion, this paper designs the atrous convolution combination with the dilation rate of 1, 3, 11, 17 image details in a bottom-up way. The input feature maps are dilated and divided int groups according to the dilation coefficient, the features at different scales are extracte by adopting distinct dilation kernels, and the shuffle of extraction features is conducte to achieve information about the interaction of different layers and to strengthen globa semantic information expression.

Atrous Convolution (Dilated Convolution)
In contrast to the traditional convolution layer and pooling layer, the atrous convo lution possesses the following advantages: ① After the replacement of the atrous convo lution, the training of reasonably dilated new parameters can be avoided when the calcu lation parameters are increased, and the performance can be continuously optimized i the training process. ② The feature maps at different resolutions can be attained by se ting different atrous rates to alter the size of the receptive field, which decreases the los of position information in the subsampled process.
In this paper, the conventional convolution is replaced by the atrous convolution and the receptive field in the pooling process is altered by changing the atrous rate of th atrous convolution. The experiment discovers that distinct dilation rates have a certai influence on the performance of the model as well. The atrous convolution mode with reasonable dilation rate is shown in Figure 5. The authors of [28] show that: ① the use o atrous convolution with the identical dilation rate can result in the discontinuity of th convolution kernel, leading to the "Gridding Effect", as shown in Figure 6; ② the dilatio rate combinations cannot contain the common divisor greater than 1, and the dilation rat Ri must satisfy Formula (2) as follows; ③ it is hypothesized that the dilation rates corre sponding to n atrous convolutions with the convolution kernel size of k × k ar [r1,...,ri,...rn], respectively, and R2 ≤ k should be satisfied, where ri represents the dilatio rate of the ith atrous convolution, Ri represents the largest dilation rate of the ith atrou convolution, and the default is Rn = rn. Based on the above conclusion, this paper design the atrous convolution combination with the dilation rate of 1,3,11,17 R max R 2r , R 2 R r , r (2

Atrous Spatial Pyramid Pooling
The atrous spatial pyramid pooling module consists of a series of atrous convolution and spatial pyramid pooling structures with different dilation rates. Multi-scale infor mation of the image is extracted through the parallel connection of multiple atrous con volutions with distinct dilation rates. The global information of the image is gained b introducing global average pooling (GAP). When using single atrous convolution, th ASPP module can overcome the disadvantages of local information loss and long-distanc information deficiency due to the Gridding Effect. The feature information of differen scales and clear high-resolution building edge information can be acquired without usin a pooling layer. image details in a bottom-up way. The input feature maps are dilated and divided int groups according to the dilation coefficient, the features at different scales are extracted by adopting distinct dilation kernels, and the shuffle of extraction features is conducted to achieve information about the interaction of different layers and to strengthen globa semantic information expression.

Atrous Convolution (Dilated Convolution)
In contrast to the traditional convolution layer and pooling layer, the atrous convo lution possesses the following advantages: ① After the replacement of the atrous convo lution, the training of reasonably dilated new parameters can be avoided when the calcu lation parameters are increased, and the performance can be continuously optimized i the training process. ② The feature maps at different resolutions can be attained by set ting different atrous rates to alter the size of the receptive field, which decreases the los of position information in the subsampled process.
In this paper, the conventional convolution is replaced by the atrous convolution and the receptive field in the pooling process is altered by changing the atrous rate of th atrous convolution. The experiment discovers that distinct dilation rates have a certai influence on the performance of the model as well. The atrous convolution mode with reasonable dilation rate is shown in Figure 5. The authors of [28] show that: ① the use o atrous convolution with the identical dilation rate can result in the discontinuity of th convolution kernel, leading to the "Gridding Effect", as shown in Figure 6; ② the dilatio rate combinations cannot contain the common divisor greater than 1, and the dilation rat Ri must satisfy Formula (2) as follows; ③ it is hypothesized that the dilation rates corre sponding to n atrous convolutions with the convolution kernel size of k × k ar [r1,...,ri,...rn], respectively, and R2 ≤ k should be satisfied, where ri represents the dilatio rate of the ith atrous convolution, Ri represents the largest dilation rate of the ith atrou convolution, and the default is Rn = rn. Based on the above conclusion, this paper design the atrous convolution combination with the dilation rate of 1,3,11,17 R max R 2r , R 2 R r , r (2

Atrous Spatial Pyramid Pooling
The atrous spatial pyramid pooling module consists of a series of atrous convolution and spatial pyramid pooling structures with different dilation rates. Multi-scale infor mation of the image is extracted through the parallel connection of multiple atrous con volutions with distinct dilation rates. The global information of the image is gained b introducing global average pooling (GAP). When using single atrous convolution, th ASPP module can overcome the disadvantages of local information loss and long-distanc information deficiency due to the Gridding Effect. The feature information of differen scales and clear high-resolution building edge information can be acquired without usin a pooling layer.

Atrous Spatial Pyramid Pooling
The atrous spatial pyramid pooling module consists of a series of atrous convolutions and spatial pyramid pooling structures with different dilation rates. Multi-scale information of the image is extracted through the parallel connection of multiple atrous convolutions with distinct dilation rates. The global information of the image is gained by introducing global average pooling (GAP). When using single atrous convolution, the ASPP module can overcome the disadvantages of local information loss and long-distance information deficiency due to the Gridding Effect. The feature information of different scales and clear high-resolution building edge information can be acquired without using a pooling layer.
Firstly, the ASPP module dilates the input feature F into i groups, which are denoted as ASPPi (i = 1, 2, 3, 4) (Figure 7). Input calculations of different groups can alter the dense connection mode between the original feature channels and can gain the gridding parameters of different corresponding positions. Secondly, the multi-scale object context information of each grouping feature of ASPPi is attained by conducting the F pooling, upsampling, and then fusing with the corresponding feature ASPPi'. Firstly, the ASPP module dilates the input feature F into i groups, which are denoted as ASPPi (i = 1,2,3,4) (Figure 7). Input calculations of different groups can alter the dense connection mode between the original feature channels and can gain the gridding parameters of different corresponding positions. Secondly, the multi-scale object context information of each grouping feature of ASPPi is attained by conducting the F pooling, upsampling, and then fusing with the corresponding feature ASPPi'.

Semantic-Guided Fusion Module
This paper puts forward an idea of using the Semantic Fusion Module (SFM) ( Figure  8). In this module, the detailed information obtained by rough feature extraction in the shallow layer is fused with fine semantic information in the deep layer gained by the ASPP multi-scale atrous pyramid pooling network operation. Then, the channel shuffle of fusion information is conducted (Figure 9). The feature channel sequence of the original grouping is disarranged and rearranged by "remodeling-transposition-remodeling." The "semantic flow" information between features with different resolutions is forecasted by network autonomous learning. Rough features are rectified to fine features with higher resolution, and pixels between features with different resolutions are "aligned." The feature layer with abundant multi-scale information is obtained by shuffling information from different channels, reconciling multi-scale information, and intensifying the information exchange between feature channels of different groupings. This will effectively transmit semantic information from the deep layer to the shallow layer and the efficacious fusion of features with different resolutions.

Semantic-Guided Fusion Module
This paper puts forward an idea of using the Semantic Fusion Module (SFM) (Figure 8). In this module, the detailed information obtained by rough feature extraction in the shallow layer is fused with fine semantic information in the deep layer gained by the ASPP multiscale atrous pyramid pooling network operation. Then, the channel shuffle of fusion information is conducted (Figure 9). The feature channel sequence of the original grouping is disarranged and rearranged by "remodeling-transposition-remodeling." The "semantic flow" information between features with different resolutions is forecasted by network autonomous learning. Rough features are rectified to fine features with higher resolution, and pixels between features with different resolutions are "aligned." The feature layer with abundant multi-scale information is obtained by shuffling information from different channels, reconciling multi-scale information, and intensifying the information exchange between feature channels of different groupings. This will effectively transmit semantic information from the deep layer to the shallow layer and the efficacious fusion of features with different resolutions.
Firstly, the ASPP module dilates the input feature F into i groups, which are denoted as ASPPi (i = 1,2,3,4) (Figure 7). Input calculations of different groups can alter the dense connection mode between the original feature channels and can gain the gridding param eters of different corresponding positions. Secondly, the multi-scale object context infor mation of each grouping feature of ASPPi is attained by conducting the F pooling, upsam pling, and then fusing with the corresponding feature ASPPi'.

Semantic-Guided Fusion Module
This paper puts forward an idea of using the Semantic Fusion Module (SFM) ( Figure  8). In this module, the detailed information obtained by rough feature extraction in the shallow layer is fused with fine semantic information in the deep layer gained by the ASPP multi-scale atrous pyramid pooling network operation. Then, the channel shuffle of fusion information is conducted (Figure 9). The feature channel sequence of the original group ing is disarranged and rearranged by "remodeling-transposition-remodeling." The "se mantic flow" information between features with different resolutions is forecasted by net work autonomous learning. Rough features are rectified to fine features with higher res olution, and pixels between features with different resolutions are "aligned." The feature layer with abundant multi-scale information is obtained by shuffling information from different channels, reconciling multi-scale information, and intensifying the information exchange between feature channels of different groupings. This will effectively transmi semantic information from the deep layer to the shallow layer and the efficacious fusion of features with different resolutions.   eters of different corresponding positions. Secondly, the multi-scale object conte mation of each grouping feature of ASPPi is attained by conducting the F pooling, pling, and then fusing with the corresponding feature ASPPi'.

Semantic-Guided Fusion Module
This paper puts forward an idea of using the Semantic Fusion Module (SFM) 8). In this module, the detailed information obtained by rough feature extractio shallow layer is fused with fine semantic information in the deep layer gained by th multi-scale atrous pyramid pooling network operation. Then, the channel shuffle o information is conducted (Figure 9). The feature channel sequence of the origina ing is disarranged and rearranged by "remodeling-transposition-remodeling." T mantic flow" information between features with different resolutions is forecasted work autonomous learning. Rough features are rectified to fine features with hig olution, and pixels between features with different resolutions are "aligned." The layer with abundant multi-scale information is obtained by shuffling informati different channels, reconciling multi-scale information, and intensifying the info exchange between feature channels of different groupings. This will effectively t semantic information from the deep layer to the shallow layer and the efficaciou of features with different resolutions.

Experimental Dataset
The effectiveness and generalization of the method presented in this paper are validated by the multi-method tests. The results are compared with the Aerial Image Segmentation Dataset (AISD) [29], the Massachusetts Buildings Dataset (MBD) [30], and the Wuhan University (WHU) aerial remote sensing image dataset [31]. Samples of different datasets have been shown in Figure 10.
The effectiveness and generalization of the method presented in this paper are validated by the multi-method tests. The results are compared with the Aerial Image Segmentation Dataset (AISD) [29], the Massachusetts Buildings Dataset (MBD) [30], and the Wuhan University (WHU) aerial remote sensing image dataset [31].
(1) The AISD contains four ground object labels-buildings, roads, water systems, and vegetation. In this paper, building labels are only used for the analysis. The dataset covers areas of Berlin, Chicago, Paris, Potsdam, and Zurich, and the images have a resolution of 0.3 m. This dataset includes 1672 images with 3328 × 2816 pixels deriving from Google Earth and pixel-level label of Open Street Map.
(2) The MBD covers the cities and suburbs areas of Boston and contains 151 images with 1500 × 1500 pixels. The occupied area of each image is 2.25 km², with a resolution of 1m.
(3) The WHU includes 204 images with 512 × 512 pixels in Satellite Dataset(Global Cities), and the coverage area is the global urban satellite images, with a resolution of 0.3 m.
Samples of different datasets have been shown in Figure 10. The division of datasets is shown in Table 1:  The division of datasets is shown in Table 1:

Experimental Environment and Parameter Setting
The experimental operating system is Windows 11, the CPU version is 12th Gen Intel(R) Core(T.M.)i9-12900H, the GPU is NVIDIA GeForceRTX3080TI, and the deep learning framework is torch 1.11.0 + cu115. The experiment adopts the transfer learning strategy, and the pre-training weights in the ImageNet-1K dataset are used for training. The accuracy of the detection results is enhanced by the network model by carrying out transfer learning and by depending on its capacity of unsaturated continuous learning. The training process is divided into the freezing stage and the unfreezing stage. The training configuration information is shown in Table 2:

Ablation Experiment
Ablation experiments are performed on the AISD, the MBD, and the WHU dataset with SegFormer as the baseline system to validate the improved mechanism presented in this paper. The settings of experimental parameters are consistent with the experimental environment, and the results are shown in Table 1.

Ablation Experiment with Different Mechanism Modules
To validate the improved mechanism, the ablation experiments are conducted on the AISD, the MBD, and the WHU dataset, and SegFormer is used as the baseline system. The promotion effect of model performance is assessed through the evaluation indicators of mean Intersection over Union (mIoU), F1 score, precision, and recall. Therein, mIoU is referred to the percentage mean of intersection-over-union of predicted building pixels and actual building pixels; the F1 score is taken into account the influence of precision and recall; precision is meant by the percentage of pixels of the building correctly predicted by the model to the actual building; recall is referred to the percentage of the correctly predicted building to the actual building. Suppose that the predicted building is represented as T, the non-building as F, the actual building as P, and the non-actual building as N, then Formulas (3)-(6) of the four evaluation indicators are as follows. The network training process is visualized, and a comprehensive analysis of the effect of the improved network is conducted.
(1) Ablation experiments with different improved mechanisms The original network model and the network model with different mechanisms are tested on three datasets, and the comparison of the evaluation indicators is shown in Table 3. When the experiments are conducted on the AISD, the MBD, and the WHU dataset, the precision evaluation indicators obtained for the improved network are superior compared to those obtained for the original network (Table 3). This suggests the validity of the improved network for building extraction suggested in this paper. For the improved network, the enhancing effects of the precision indicators on the MBD are the best, where mIoU, F1, precision, and recall indicators increased by 9.76%, 10.54%, 10.58%, and 10.45%, respectively. However, the applicability of the original network to this dataset is inferior, with a mIoU of 66.09%. After the optimization, the improved network shows a stronger capability of feature extraction and fusion. This can improve the focus and recovery of semantic and feature information of different layers and efficaciously elevate the segmentation precision of the whole model.
(2) Visualization of the training process During the training process, variations of the loss value and segmentation precision of both original and improved SegFormer network models are tested on the MBD and are shown in Figure 11.
During the training process, the overall training curve shows a gentle tendency, and the fluctuation of the precision curve of the improved network is smaller than that of the original network after adding the transposed convolutional network module (Figure 11a,b). Higher precision of the improved network can be realized in a shorter time compared with the original network. During the training process, the improved network can dynamically adjust according to the contextual information, as shown by the contrast of the dynamic variation curves of precision and loss. Furthermore, the improved network can more efficiently supplement the detail features and semantic features, possessing preferable extraction precision, generalization capacity, and universality. During the training process, the overall training curve shows a gentle tendency, and the fluctuation of the precision curve of the improved network is smaller than that of the original network after adding the transposed convolutional network module ( Figure  11a,b). Higher precision of the improved network can be realized in a shorter time compared with the original network. During the training process, the improved network can dynamically adjust according to the contextual information, as shown by the contrast of the dynamic variation curves of precision and loss. Furthermore, the improved network can more efficiently supplement the detail features and semantic features, possessing preferable extraction precision, generalization capacity, and universality.

Ablation Experiment of Multiple Normalization Activation Layers
The multiple accumulations on the MBD are adopted to analyze the improvement effect of the stacking normalization activation layer. This will increase the understanding of the availability and contribution of the added modules in the established transposed convolutional networks, and investigate the promotion effect of cascading multiple normalization activation on the network expression capacity and precision. The addition method is shown in Figure 4.

Ablation Experiment of Multiple Normalization Activation Layers
The multiple accumulations on the MBD are adopted to analyze the improvement effect of the stacking normalization activation layer. This will increase the understanding of the availability and contribution of the added modules in the established transposed convolutional networks, and investigate the promotion effect of cascading multiple normalization activation on the network expression capacity and precision. The addition method is shown in Figure 4.
The selected evaluation indicators of mIoU, F1, precision, and recall are improved along with the multiple superpositions of normalization activation layers (Table 4). Among them, mIoU, F1, precision, and recall increased by 3.11%, 1.27%, 0.58%, and 1.81%, respectively. The precision improvement is reached to the maximum when the four normalization activation layers are embedded in the convolution network. Experiments validate that the detection capacity of the improved network can be enhanced after the introduction of multiple normalization activation layers. This paper also verifies the effects of different dilation rate combinations on the performance of the ASPP module. The comparative experimental results based on the MBD are shown in Table 5, which demonstrates that the effect of non-identical continuous convolution is superior. The precision of the combination of 1, 3, 6, 9 is increased by 0.33% compared with the combination of 1, 2, 2, 2, and the effect of prime number combination is better with the large dilated kernel. When the combination is 1, 5, 11, 17, the precision is elevated by 0.27% compared with the combination of 1, 6, 12, 18. The second dilation rate is not greater than the maximum size of the convolution kernel. However, when the combination is 1, 3, 11, 17, the precision is enhanced by 2.15% relative to the combination of 1, 5, 11, 17. Based on this, the discontinuous atrous convolution combination of 1, 3, 11, 17 with the common divisor of 1 and a large dilation rate is chosen. During the experiment, the precision of this combination is enhanced by 1.96-3.93% relative to other combinations. The loss of relevant information can be reduced, and the effect of capturing contextual information of objects at different scales can be advanced at a reasonable dilation rate.

Comparison Experiment
The experimental dataset is obtained from the AISD, the MBD, and the WHU dataset, and the analysis is based on high-precision pixel-level labels. Therefore, the scope, distribution, and geometric outline of buildings on images can be precisely shown. In this experiment, the label information of the original dataset is visualized on images and compared with the experimental results. The validity of this improved algorithm is demonstrated using several kinds of classical mainstream deep learning semantic segmentation algorithms. High-resolution net (HRnet), PSPNet, U-Net, DeepLabv3+, and the original SegFormer network are selected to compare the algorithm of this paper, and comparative results are presented in Table 6: The overall experimental precision of the improved method is superior to that of the original SegFormer model and the other mainstream methods used in the last few years (Table 6). Therein, on the AISD, mIoU, F1, precision, and recall are up to 88.43%, 93.81%, 93.85%, and 93.77%, respectively; on the MBD, these four indicators are 75.85%, 88.24%, 93.07%, and 83.89%, respectively; on the WHU dataset, the values of corresponding four indicators are reached to 81.69%, 91.09%, 92.46%, and 89.76%, respectively. On three datasets, relative to the comparison method, the optimal promotion of mIoU of the improved method is 17.68%, 30.44%, and 15.26%, and the optimal promotion in precision is 11.17%, 20.02%, and 11.86%.
When diverse extraction methods are used to test the MBD, the experimental precision of all algorithms is significantly lower than that of the other two datasets (Table 6). This is because all images in the MBD are large-scale scene images. As shown in Figure 10, the coverage area of a single image is extremely wide, and there is a significant difference between the image scale of the MBD and that of the other two datasets. Additionally, when extracting buildings in wide-range scenes, the building styles, spectral features, and shadow features of the MBD are more diverse and complicated than those of the other two datasets. The experimental effect of the segmentation network on the MBD is not ideal. Nevertheless, after the network optimization, the improved network has efficaciously enhanced the detection precision of the MBD, among which mIoU, F1, precision, and recall have increased by 9.76%, 10.54%, 10.58%, and 10.45%, respectively. After the network improvement, the experimental detection precision has been elevated by the order of magnitude. This is indicated by the strengthened applicability of large-scale scene images, better performance on large-scale datasets, and stronger robustness in the face of intricate buildings and variable environments, validating the effectiveness of the improved algorithmsection.
The model complexity and efficiency of Hrnet, PSPNet, UNet, Deeplabv3+ and the original SegFormer network are compared with the algorithm in this paper, and the results are shown in Table 7. In the experiment, the input size is set to 512 × 512 when calculating GFLOPs. The training time is the time required for one iteration of the MBD. As can be seen from Table 7, the mIoU of the algorithm in this paper improves greatly when GFLOPs, parameters and training time increase slightly, which is of high application value while costing less.  Figure 12 shows the experimental prediction results of HRnet, PSPNet, U-Net, DeepLabv3+, the original SegFormer algorithm, and the algorithm presented in this paper. The performance of various algorithms is assessed according to the visualization of detection results. Therein, column (a) is the original image, column (h) is the dataset label, columns (b)-(f) are the detection results of the other networks, and column (g) is the detection results of the improved network. According to Figure 12, while testing on the identical image, the detection results of the improved network are better than those of other networks in the following aspects: (1) The integrity of the detection edge is better, and the boundary transition is more regular and smoother: the architectural form of the building is neat. The 1st and 2nd rows in Figure 12 show that the detection of a single independent building by the HRnet, PSPNet, and SegFormer is poor, especially depicting the right-angle side of the building. The U-Net and improved network can exactly locate the sideline of the building. Nevertheless, in edge detection, the improved network is more intact and reasonable in the transition of edge detection details than the U-Net network.
The improved network in this study can better differentiate the background from the building boundary relative to the SegFormer network. The improved network revises the regular edges of buildings better and obtains apparent enhancement in the detection effect. (2) The construction of building feature information is improved, the missing and false detection due to the inter-class similarity of ground objects is efficaciously inhibited, the multi-scale generalization capacity is strong, and the detection rate of tiny buildings is enhanced. In the 3rd and 4th rows of Figure 12, when the color of tiny buildings in the original image is similar to that of the ground objects in surrounding background and when there is a large-scale difference between tiny buildings and the surrounding buildings, the phenomenon of missing buildings occurs in the PSPNet and DeepLabv3+ experiments. In the experiments of HRnet, U-Net, and SegFormer, though the positioning of some buildings can be realized, they cannot be validly detected. In contrast, the improved network exactly locates tiny buildings and detects their distribution range and edges. (3) The reservation of detailed feature information is abundant and intact, and the differentiation capacity for buildings and their surrounding backgrounds is powerful. According to the 5th and 6th rows in Figure 12, when detecting polygonal buildings and aiming at the transition of edge texture, the evident edge fitting, the severe loss of edge details in angle and corner, and the erroneous judgment of building distribution on a large scale exist in the comparison algorithms. However, the improved network restores more intact details compared with other networks. There is little difference in the distribution range and outline when the detection results of the improved network are compared with the label information. (4) Limited by feature similarity: the spectral similarity interferes greatly and the inhibition of negative features is weak. When the background with similar characteristics is mixed, the building classification performance is mediocre. There are still great challenges in the discrimination of detection algorithms.

DeepLabv3+
, the original SegFormer algorithm, and the algorithm presented in this paper. The performance of various algorithms is assessed according to the visualization of detection results. Therein, column (a) is the original image, column (h) is the dataset label, columns (b)~(f) are the detection results of the other networks, and column (g) is the detection results of the improved network. According to Figure 12, while testing on the identical image, the detection results of the improved network are better than those of other networks in the following aspects: (1) The integrity of the detection edge is better, and the boundary transition is more regular and smoother: the architectural form of the building is neat. The 1st and 2nd rows in Figure 12 show that the detection of a single independent building by the HRnet,

Conclusions
To obtain high-precision extraction of buildings in remote sensing images, based on SegFormer, this paper presents a method for building detection in remote sensing images. This novel method consists of an improved sampling mechanism and a decoding structure, which ameliorates issues such as missing and false detection of buildings, poor edge integrity, and poor effects of intelligent interpretation of remote sensing images resulting from inter-class similarity and intra-class inconsistency of buildings. The main improvements of the new algorithm are the use of a spatial pyramid pooling decoding module fusing atrous convolution and the introduction of the transposed convolutional network to optimize the sampling mode. These improvements make the algorithm efficiently screen positive information about buildings while inhibiting and filtering negative information and easing the issue of missing semantic information and detail features. The ablation and contrast experiments are carried out on the remote sensing image AISD, MBD, and WHU dataset. The experimental results show that the improved algorithm is more accurate in boundary segmentation than the other methods, and the false detection of buildings is reduced. It can be applied to large-scale scene detection and intelligent judgment and drawing, which provides strong support for surveying and mapping production. In the next step, we will explore the building detection method to suppress the spectral imaging interference of remote sensing images.