A Novel Squeeze-and-Excitation W-Net for 2D and 3D Building Change Detection with Multi-Source and Multi-Feature Remote Sensing Data

: Building Change Detection (BCD) is one of the core issues in earth observation and has received extensive attention in recent years. With the rapid development of earth observation technology, the data source of remote sensing change detection is continuously enriched, which provides the possibility to describe the spatial details of the ground objects more ﬁnely and to characterize the ground objects with multiple perspectives and levels. However, due to the different physical mechanisms of multi-source remote sensing data, BCD based on heterogeneous data is a challenge. Previous studies mostly focused on the BCD of homogeneous remote sensing data, while the use of multi-source remote sensing data and considering multiple features to conduct 2D and 3D BCD research is sporadic. In this article, we propose a novel and general squeeze-and-excitation W-Net, which is developed from U-Net and SE-Net. Its unique advantage is that it can not only be used for BCD of homogeneous and heterogeneous remote sensing data respectively but also can input both homogeneous and heterogeneous remote sensing data for 2D or 3D BCD by relying on its bidirectional symmetric end-to-end network architecture. Moreover, from a unique perspective, we use image features that are stable in performance and less affected by radiation differences and temporal changes. We innovatively introduced the squeeze-and-excitation module to explicitly model the interdependence between feature channels so that the response between the feature channels is adaptively recalibrated to improve the information mining ability and detection accuracy of the model. As far as we know, this is the ﬁrst proposed network architecture that can simultaneously use multi-source and multi-feature remote sensing data for 2D and 3D BCD. The experimental results in two 2D data sets and two challenging 3D data sets demonstrate that the promising performances of the squeeze-and-excitation W-Net outperform several traditional and state-of-the-art approaches. Moreover, both visual and quantitative analyses of the experimental results demonstrate competitive performance in the proposed network. This demonstrates that the proposed network and method are practical, physically justiﬁed, and have great potential application value in large-scale 2D and 3D BCD and qualitative and quantitative research. squeeze-and-excitation W-Net deep learning network. and change detection data


Introduction
Change detection is a process of qualitatively and quantitatively analyzing and determining changes on the earth's surface in different time dimensions. It is one of the essential technologies in the field of remote sensing applications, and it has been widely and deeply applied in the fields of land planning, urban changes, disaster monitoring, size, and achieving pixel-level labeling, it has achieved better results than traditional CNN in remote sensing image classification [39,40] and change detection [41][42][43][44]. U-Net [45,46], developed from FCN, has proved to have better performance than FCN. It is used in many tasks in the field of remote sensing, such as image classification, change detection, and object extraction (buildings, water bodies, roads, etc.). In many studies, various network variants based on FCN or U-Net have been proposed. These networks have achieved corresponding functions according to specific research content and have received certain results. For example, the authors in [47] used a region-based FCN (R-FCN), which was combined with the ResNet-101 network to try to determine the center pixel of the object before segmenting the image. Related research in [48,49], the introduction of the residual module into U-Net brought about an improvement in model performance and efficient information dissemination. Although these types of networks overcome the shortcomings of single segmentation scale and low information dissemination that exist in FCN or U-Net to a certain extent, they have not considered combining multiple data as input. A variety of features derived from remote sensing data tend to show characteristics such as stable nature, little impact by radiation differences, and not easily affected by remote sensing image time phase changes [3]. Using spatial or spectral features to detect changes in the state of objects or regions has become a hot spot for researchers. In addition, the phenomenon of "the same object with the different spectrum, the same spectrum of different matter" appears in large numbers in HR/VHR remote sensing images, making it more difficult to detect small and complex objects such as buildings or roads in cities. Emerging deep learning methods have the potential to extract features of individual buildings in complex scenes. However, the feature extraction method of deep learning represented by convolution operation only extracts the abstract features of the original image through the continuous deepening of the number of convolution layers and do not consider the use of useful derivative features of the ground objects [3,27]. Various feature information derived from the original image, such as color, texture, shape, et al., can also be used as input to the network to participate in the process of information mining and abstraction. As far as we know, many existing networks fail to take multiple features as input to participate in task execution.
In this article, based on U-Net, we designed a new type of bilateral end-to-end W network. It can simultaneously input multi-source/multi-feature homogeneous and heterogeneous data and consider the internal relationship of input data through the squeeze-andexcitation strategy. We named it squeeze-and-excitation W-Net. Although there have been related studies on network transformation based on U-Net [50,51], as far as we know, we are the first to transform U-Net into a more valuable network. It has two-sided input and single-output, independent weights on both sides can take into account the data on both sides (homogeneous and heterogeneous data) and can be used for change detection tasks in the field of remote sensing. The main contributions of this article are concluded as follows: (1) The proposed squeeze-and-excitation W-Net is a powerful and universal network structure, which can learn the abstract features contained in homogeneous and heterogeneous data through a structured symmetric system. (2) The form of two-sided input not only satisfies the input of multi-source data but also is suitable for multiple features derived from multi-source data. We innovatively introduced the squeeze-and-excitation module as a strategy for explicit modeling of the interdependence between channels, which makes the network more directional and can recalibrate the feature channels, emphasize essential features, and suppress secondary features. Moreover, the squeeze-and-excitation module is embedded between each convolution operation, which can overcome the insufficiency of the convolution operation that can only take into account the features information in the local receptive field and improve the global reception ability of the network. (3) The idea of multi-source and multi-feature combination as model input integrates information advantages such as spectrum, texture, and structure, which can significantly improve the robustness of the model. For buildings, which present complex spatial patterns, have multi-scale features, and have large differences between individuals, they are more targeted, and the detection accuracy of the model is significantly higher.
The rest of the article is organized as follows. The second section introduces in detail the construction process of squeeze-and-excitation W-Net, the production details of multifeature input, and the evaluation method of the network. The third section is the data set information, network settings, experiments, and results. The fourth section is the discussion part. And the fifth section summarizes the article.

Methodology
The proposed 2D and 3D change detection method for buildings based on squeezeand-excitation W-Net includes three parts: (1) Construct the squeeze-and-excitation W-Net to satisfy the input of homogeneous data, heterogeneous data, and multi-feature combined images. When performing 2D change detection, the left and right sides of the network, accept the original image and the characteristic image, respectively. When performing 3D change detection, the left and right sides of the network accept the original image and its feature image and height data and its feature image, respectively. (2) Use color moment, gray level co-occurrence matrix (GLCM), and edge operator to extract the color feature, texture feature, and shape feature of the image, respectively, and merge the extracted features with the original image as network input. (3) Train the squeeze-and-excitation W-Net, save the model with higher validation accuracy and lower validation loss, and perform change detection in the experimental area. The workflow of the proposed change detection method is shown in Figure 1.
tween each convolution operation, which can overcome the insufficiency of the convolution operation that can only take into account the features information in the local receptive field and improve the global reception ability of the network.
(3) The idea of multi-source and multi-feature combination as model input integrates information advantages such as spectrum, texture, and structure, which can significantly improve the robustness of the model. For buildings, which present complex spatial patterns, have multi-scale features, and have large differences between individuals, they are more targeted, and the detection accuracy of the model is significantly higher.
The rest of the article is organized as follows. The second section introduces in detail the construction process of squeeze-and-excitation W-Net, the production details of multifeature input, and the evaluation method of the network. The third section is the data set information, network settings, experiments, and results. The fourth section is the discussion part. And the fifth section summarizes the article.

Methodology
The proposed 2D and 3D change detection method for buildings based on squeezeand-excitation W-Net includes three parts: (1) Construct the squeeze-and-excitation W-Net to satisfy the input of homogeneous data, heterogeneous data, and multi-feature combined images. When performing 2D change detection, the left and right sides of the network, accept the original image and the characteristic image, respectively. When performing 3D change detection, the left and right sides of the network accept the original image and its feature image and height data and its feature image, respectively. (2) Use color moment, gray level co-occurrence matrix (GLCM), and edge operator to extract the color feature, texture feature, and shape feature of the image, respectively, and merge the extracted features with the original image as network input. (3) Train the squeeze-and-excitation W-Net, save the model with higher validation accuracy and lower validation loss, and perform change detection in the experimental area. The workflow of the proposed change detection method is shown in Figure 1.   Although the convolutional layer of CNN is a structure suitable for extracting spatial context features and spectral features simultaneously, its receptive field is limited, and the output is a category label corresponding to a fixed-size image. It cannot achieve the pixellevel positioning of category labels in visual tasks [45]. However, the image processing form of semantic segmentation can classify each pixel on the image to obtain the image classification result of the located pixel.
U-Net is an extension of FCN and is currently a widely used semantic segmentation network with good scalability [52]. The excellent characteristics of U-Net make it widely used in remote sensing image classification and change detection and have achieved good results [50,52]. However, the number of convolutional layers is small, and the Batch Normalization layer is lacking, which causes problems such as low learning efficiency, learning effect greatly affected by the initial data distribution, and gradient explosion in the backpropagation process [3,47,52]. In addition, the U-Net single-side input and single-side output network structure also limits the comprehensive use of multi-source remote sensing data, making the data input single, and the feature extraction of the data limited to a small number of convolution operations. Although its skip connection strategy can merge low-dimensional features and high-dimensional features, it is challenging to balance the effective extraction of features and the comprehensive use of data in the face of complex data types and diversified data features.
In order to make up for these shortcomings of U-Net, we designed a two-sided input W-shaped network, which contains a contracting path on both sides and an expansive path in the middle. The contraction path on both sides contains four sets of encoding modules, but the encoding module deepens the number of layers of convolution and introduces the Batch Normalization layer. The expansion path contains four sets of decoding modules and also adds the Batch Normalization layer. Among them, the Batch Normalization layer can normalize the input data of each batch with the mean and variance so that the input of each layer maintains the same distribution, which can speed up the speed of model training. In addition, the Batch Normalization layer can increase noise through the idea of updating the mean and variance of each layer, thereby increasing the robustness of the model and effectively reducing overfitting. The calculation of the Batch Normalization layer is performed as in Equation (1) x where,x (k) is the activation value of the k-th neuron after transformation; x (k) is the neuron of each batch of training data; E x (k) is the average value of each batch of training data neurons; Var x (k) is the standard deviation of the activation of each batch of training data neurons.
In addition, in order to meet the needs of change detection tasks, we use a binary cross-entropy function as the loss function of the W-shaped network in the network. The formula is performed as in Equation (2) where, N represents the number of predicted values output by the model; y n is the sample label;ŷ n is the predicted label of the sample by the model; the optimizer used by the network is Adam. The skip connection strategy of the W-shaped network is extended to two sides. That is, the low-dimensional features of symmetrical positions on both sides are copied to the expansive path, combined with the high-dimensional features, and convolution is performed. This strategy can divide different data sources into two inputs, avoid the mutual exclusion of data, better retain the original characteristics of the data, and give play to the most significant advantage between different data. In addition, the network weights of the contracting paths on both sides are independent of each other. During the back propagation of the network, the loss values obtained from the loss function are transmitted to both sides at the lowest end of the contracting path, and the network weights on both sides are updated at the same time. In this way, it can achieve the purpose of non-linear simulation of multi-source data at the same time.

Squeeze-and-Excitation W-Net
The W-shaped network is improved on the basis of U-Net, expanding the path of data input, deepening the convolution operation, accelerating the training speed of the model, improving the robustness of the model, and effectively preventing overfitting. However, the convolution operation can only be along the data input channel, fusing the spatial and channel information in the local receptive field [53]. In addition, when comprehensively considering the multi-source data and the multiple features derived from it, it is difficult to model the spatial dependence of the data based on the information feature construction method of the local receptive field. Moreover, the repeated convolution operation without considering the spatial attention is not conducive to the extraction of useful features.
We introduced the attention mechanism [54][55][56] strategy, using global information to explicitly model the dynamic non-linear relationship between channels, which can simplify the learning process and enhance the network representation ability. The main function of the attention mechanism is to assign weights to each channel to enhance important information and suppress secondary information. The main operation can be divided into three parts: squeeze operation F sq (·), excitation operation F ex (·, W), and fusion operation F scale (·, ·). Its operation flow chart is shown in Figure 2. most significant advantage between different data. In addition, the network weights of the contracting paths on both sides are independent of each other. During the back propagation of the network, the loss values obtained from the loss function are transmitted to both sides at the lowest end of the contracting path, and the network weights on both sides are updated at the same time. In this way, it can achieve the purpose of non-linear simulation of multi-source data at the same time.

Squeeze-and-Excitation W-Net
The W-shaped network is improved on the basis of U-Net, expanding the path of data input, deepening the convolution operation, accelerating the training speed of the model, improving the robustness of the model, and effectively preventing overfitting. However, the convolution operation can only be along the data input channel, fusing the spatial and channel information in the local receptive field [53]. In addition, when comprehensively considering the multi-source data and the multiple features derived from it, it is difficult to model the spatial dependence of the data based on the information feature construction method of the local receptive field. Moreover, the repeated convolution operation without considering the spatial attention is not conducive to the extraction of useful features.
We introduced the attention mechanism [54][55][56] strategy, using global information to explicitly model the dynamic non-linear relationship between channels, which can simplify the learning process and enhance the network representation ability. The main function of the attention mechanism is to assign weights to each channel to enhance important information and suppress secondary information. The main operation can be divided into three parts: squeeze operation (•), excitation operation (•, ), and fusion operation (•,•). Its operation flow chart is shown in Figure 2. The squeeze operation uses a global average pooling method to compress features along the spatial dimension and scale each two-dimensional feature map to a real number, which has a global receptive field and can represent global information. When the input is ∈ × × , the output after the regular convolution operation is ∈ × × . The squeeze operation is based on , and the input of size × × can be compressed into 1 × 1 × feature description. For a particular feature map, the squeeze calculation is performed as in Equation (3) where, is the c-th feature map; , represent the pixel positions in the feature map. The squeeze operation only obtains a 1 × 1 global descriptor, which cannot be used as the weight of each feature map. However, the excitation operation using two fully connected layers and the Sigmoid function can more comprehensively capture the interdependence between channels. Its calculation formula is performed as in Equation (4) The squeeze operation uses a global average pooling method to compress features along the spatial dimension and scale each two-dimensional feature map to a real number, which has a global receptive field and can represent global information. When the input is X ∈ R H ×W ×C , the output after the regular convolution operation is U ∈ R H×W×C . The squeeze operation is based on U, and the input of size H × W × C can be compressed into 1 × 1 × C feature description. For a particular feature map, the squeeze calculation is performed as in Equation (3) where, u c is the c-th feature map; i, j represent the pixel positions in the feature map. The squeeze operation only obtains a 1 × 1 global descriptor, which cannot be used as the weight of each feature map. However, the excitation operation using two fully connected layers and the Sigmoid function can more comprehensively capture the interdependence between channels. Its calculation formula is performed as in Equation (4) where, z is global description; δ is the ReLu activation function; W 1 ,W 2 represents the weight matrix of two fully connected layers; σ represents the Sigmoid function. The last step is the fusion operation. That is, the channel weight calculated by the excitation operation is fused with the original feature map, and the calculation is as shown in Equation (5) where, s c is the c-th global description; u c is the c-th original feature map. We innovatively embed the squeeze-and-excitation module into the left and right contracting paths of the W-shaped network and add a squeeze-excitation layer after each Remote Sens. 2021, 13, 440 8 of 26 convolution to learn the dependency of feature channels to improve the learning ability of the network. This can better deal with the complexity of multi-source and multi-feature data, and obtain a network that is more robust and more sensitive to specific features. The structure of squeeze-and-excitation W-Net is shown in Figure 3.
The last step is the fusion operation. That is, the channel weight calculated by the excitation operation is fused with the original feature map, and the calculation is as shown in Equation (5) = ( , ) = * where, is the c-th global description; is the c-th original feature map. We innovatively embed the squeeze-and-excitation module into the left and right contracting paths of the W-shaped network and add a squeeze-excitation layer after each convolution to learn the dependency of feature channels to improve the learning ability of the network. This can better deal with the complexity of multi-source and multi-feature data, and obtain a network that is more robust and more sensitive to specific features. The structure of squeeze-and-excitation W-Net is shown in Figure 3.

Multi-Feature Mapping and Information Mining
In this article, we are in order to provide more detailed and comprehensive, reliable data for the network. In terms of color, texture, and shape, a variety of features are extracted from the original image, and these features are combined with the original image as the input of the network.

Color Moment for Color Features Extraction
The color feature is the most widely used visual feature in color images, and it is widely used in image retrieval and video retrieval [57]. In addition, it has a small depend-

Multi-Feature Mapping and Information Mining
In this article, we are in order to provide more detailed and comprehensive, reliable data for the network. In terms of color, texture, and shape, a variety of features are extracted from the original image, and these features are combined with the original image as the input of the network.

Color Moment for Color Features Extraction
The color feature is the most widely used visual feature in color images, and it is widely used in image retrieval and video retrieval [57]. In addition, it has a small dependence on the size, direction, and angle of the image, stable performance, and strong robustness to image degradation and resolution changes [58]. Since the color information in the image is mostly distributed in the low-order moments of the image, we extract the color features of the image by calculating the first-order moment (mean), second-order moment (variance), and third-order moment (skewness) of the image. The color feature map of the entire image is extracted with a fixed-size sliding window. The calculation equations are shown in Equations (6)-(8) where, p k i,j is the k-th color component of the (i,j)-th pixel in the image; N is the number of pixels in the image.

GLCM for Texture Features Extraction
The texture feature is a visual feature that reflects the homogeneity phenomenon in the image, and it can reflect the periodically changing structural organization and arrangement properties of the surface of the ground object [59]. It can be obtained according to the change rule of the gray value of the image pixel within a specific range and is used to analyze better and understand the original image. The local texture feature is represented by the gray distribution of the pixel and its neighborhood, and the global texture feature is the repetition of local features. Therefore, there is a certain gray-scale relationship between two non-adjacent pixels in the image, that is, gray-scale spatial correlation characteristics. Capturing and quantitatively describing this spatial correlation characteristic helps to analyze the original image from the texture level. GLCM can quantitatively describe the texture characteristics of the image with the gray-scale spatial correlation characteristics in the image [59,60]. GLCM mainly extracts texture through the conditional probability density between gray levels of the image, which is a unique matrix. It describes a specific relationship between the gray values of adjacent pixels or adjacent pixels whose distance is a specific value. Usually, some scalars are used to characterize GLCM. In this paper, five scalars, including variance, homogeneity, contrast, dissimilarity, and entropy, were used to describe image texture characteristics. The equations used are shown in Equations (9)-(13) is the reference pixel; (x + dx, y + dy) is the shifting pixel; f (x, y) = i represents that the gray value of the reference point is i; f (x + dx, y + dy) = j represents the gray value of the shifting pixel is j; the step is fixed at a certain angle, δ is the shifting step size, and θ is the shifting angle.

Edge Detection Operator for Shape Features Extraction
Shape features are important information describing target objects, and they play an important role in the identification and detection of target objects [58]. It can provide clear edge information of objects and retain important structural attributes in the image. When detecting small and complex objects such as buildings, it can make up for the deficiencies of the spectrum and texture features that are easily confused and difficult to detect. In this paper, five edge detection operators, Canny, Log, Prewitt, Roberts, and Sobel, were used to extract the shape features of the objects in the image.

Combine Multiple Features
There are considerable differences between buildings, and conventional remote sensing methods are difficult to comprehensively describe their spectrum, texture, shape, and other features [61,62]. We extracted the color, texture, and shape features contained in the original data and combined the original data with these derived features to form an "artificial high-dimensional image" that contained rich image information. "Artificial high-dimensional image" was used as the input of the squeeze-and-excitation W-Net network, that is, using the means of deep learning to perform abstract learning and deep feature extraction of the original remote sensing data and features to achieve a detailed, comprehensive, and accurate description of the object to be detected.
The form of feature combination adopts staggered arrangement and grouping. Taking a 2D experiment as an example, the input on the left of the network is a combination of the original image, grayscale image, and color features, and the input on the right is a combination of texture features and shape features. The number of bands (Band-number) corresponding to each is shown in Table 1, and the combination is shown in Figure 4.
step size, and is the shifting angle.

Edge Detection Operator for Shape Features Extraction
Shape features are important information describing target objects, and they play an important role in the identification and detection of target objects [58]. It can provide clear edge information of objects and retain important structural attributes in the image. When detecting small and complex objects such as buildings, it can make up for the deficiencies of the spectrum and texture features that are easily confused and difficult to detect. In this paper, five edge detection operators, Canny, Log, Prewitt, Roberts, and Sobel, were used to extract the shape features of the objects in the image.

Combine Multiple Features
There are considerable differences between buildings, and conventional remote sensing methods are difficult to comprehensively describe their spectrum, texture, shape, and other features [61,62]. We extracted the color, texture, and shape features contained in the original data and combined the original data with these derived features to form an "artificial high-dimensional image" that contained rich image information. "Artificial highdimensional image" was used as the input of the squeeze-and-excitation W-Net network, that is, using the means of deep learning to perform abstract learning and deep feature extraction of the original remote sensing data and features to achieve a detailed, comprehensive, and accurate description of the object to be detected.
The form of feature combination adopts staggered arrangement and grouping. Taking a 2D experiment as an example, the input on the left of the network is a combination of the original image, grayscale image, and color features, and the input on the right is a combination of texture features and shape features. The number of bands (Band-number) corresponding to each is shown in Table 1, and the combination is shown in Figure 4.

Accuracy Assessment
To validate the effectiveness of the proposed squeeze-and-excitation W-Net. This paper evaluated it from two perspectives: (1) calculate the overall accuracy (OA), F_1 value, missing alarm (MA), and false alarm (FA) based on reference data to evaluate the network's ability to detect buildings; (2) compared with some widely used change detection methods, including RCVA, support vector machine (SVM), random forest (RF), deep belief network (DBN), U-Net, SegNet, and DeepLabv3+.

Comparison with Ground Truth Data
The overall accuracy represents the ratio of the number of pixels correctly recognized by the model to the number of all samples, and it represents the ability of the model to classify positive or negative samples correctly. The calculation is as in Equation (14) where  (15) and (16) are used: The missing alarm is the proportion of positive samples that are mistakenly classified as negative samples to all positive samples, and the false alarm is the proportion of negative samples that are mistakenly classified as positive samples to all negative samples. They reflect the degree of misjudgment of the sample by the model. For calculation the Equations (17) and (18) are used:

Comparison with Other Methods
We adopted seven widely used change detection methods and classified them into traditional methods (RCVA), machine learning methods (SVM and RF), transition methods (DBN) from machine learning to deep learning (hereinafter referred to as transition methods), and deep learning methods (U-Net, SegNet, and DeepLabv3+). We evaluated the performance of squeeze-and-excitation W-Net from these four aspects. Compared with traditional methods, machine learning methods, and transition methods, the evaluation index was mainly used as a reference basis. Compared with the deep learning method, in addition to the evaluation index, the model was evaluated in terms of running time and convergence rate. The description of each method is as follows: (1) RCVA [17] is an effective unsupervised multispectral image change detection method.
Based on the RCVA principle, this paper traverses all the pixels of the two images to obtain the changing area. (2) SVM [63] is a machine learning algorithm based on the small sample statistics theory.
It aims to find the optimal decision hyperplane to separate the sample data when the data points cannot be separated linearly. We used manually selected sample points to extract training feature values and train the SVM classifier. (3) RF [64] is a machine learning algorithm that combines ensemble learning theory and the random subspace method. Since it uses the bootstrap resampling technique to select training samples, we only provided training data and labels for RF. (4) DBN [65,66] is developed from the biological neural networks and shallow neural networks. It is a probabilistic generative model, which is composed of multi-layer restricted Boltzmann machine (RBM) and BP network. It uses the joint probability distribution to infer the sample data distribution. In this paper, a vector of gray values of pixels arranged in a fixed window is used as input to train multi-layer RBM, and a small number of labels are used to optimize the model. (5) U-Net is a classic semantic segmentation network, developed from FCN. We used the most primitive U-Net model for image segmentation experiments. (6) SegNet [67] is a semantic segmentation network based on deep convolution and fusion encoding-decoding structure. This article used the original network architecture developed from FCN and VGG16. (7) DeepLabv3+ [68] is considered one of the most advanced algorithms for semantic segmentation. It uses the encoding-decoding structure for multi-scale information fusion while retaining the original dilated convolution and ASSP layer. Its backbone network uses the Xception model, which improves the robustness and operating rate of semantic segmentation.

Experiments
To validate the effectiveness of the proposed method, we conducted 2D and 3D BCD experiments, respectively, and both experiments contain two sets of sub-experiments. The original remote sensing data for 2D experiments are VHR remote sensing images, and the 3D experiments were VHR remote sensing images and airborne LiDAR point cloud data.

Datasets for 2D Experiments
The data for the first set of sub-experiments came from the Building change detection dataset of the WHU Building Dataset [69]. The data are aerial images, acquired in April 2011 and April 2016, with a resolution of 0.075 m, including red, green, and blue bands. In the area covered by the image, a magnitude 6.3 earthquake occurred in February 2011, and the buildings were seriously damaged. When the image of the area was reacquired in 2016, the number of buildings increased significantly, and the types and shapes of the buildings were rich, so It is a high-quality data set for BCD. In this experiment, an area (red rectangle) with a size of 11,654 × 10,065 pixels and more building changes was selected from the data set as the experimental area. The image of the experimental area and the reference change map are shown in Figure 5. Among them, the reference change map is provided by the data publisher. The data of the second group of sub-experiments was the Multi-temporal Scene Wu-Han (MtS-WH) Dataset, which includes two large-size 7200 × 6000 HR remote sensing images obtained by IKONOS sensors, covering the area of Wuhan, China Hanyang District. The images were obtained in February 2002 and June 2009, respectively, and fused by the GS algorithm, with a resolution of 1 m and four bands (blue, green, red, and nearinfrared). Since the MtS-WH data set is mainly used for theoretical research and validation of scene change detection methods, and the original data only provides the category label of the scene, to obtain the changing scene, we obtained the reference change map of the building area by comparing the scene categories. The image and label are shown in Figure  6. Since the MtS-WH data set is mainly used for theoretical research and validation of scene change detection methods, and the original data only provides the category label of the scene, to obtain the changing scene, we obtained the reference change map of the building area by comparing the scene categories. The image and label are shown in Figure 6.
The data of the second group of sub-experiments was the Multi-temporal Scene Wu-Han (MtS-WH) Dataset, which includes two large-size 7200 × 6000 HR remote sensing images obtained by IKONOS sensors, covering the area of Wuhan, China Hanyang District. The images were obtained in February 2002 and June 2009, respectively, and fused by the GS algorithm, with a resolution of 1 m and four bands (blue, green, red, and nearinfrared). Since the MtS-WH data set is mainly used for theoretical research and validation of scene change detection methods, and the original data only provides the category label of the scene, to obtain the changing scene, we obtained the reference change map of the building area by comparing the scene categories. The image and label are shown in Figure  6.

Datasets for 3D Experiment
The data of the first set of 3D sub-experiments was the Vaihingen data set provided by ISPRS-Commision II Working Group II/4. The data set was obtained by the German Association of Photogrammetry and Remote Sensing in the Vaihingen area of Stuttgart, Germany. In addition to VHR remote sensing images (near-infrared, red, green) and reference data, DSM and LiDAR data are also provided. The spatial resolution of the VHR remote sensing image and DSM was 0.09 m. LiDAR data was acquired by an ALS50 sensor on 21 August 2008.
Since the Vaihingen data set contains only one period of data, we selected a certain number of buildings as the assumed change area in order to verify the effectiveness of the method in this paper. In addition, we used the CSF Plugin Instruction [70] tool to isolate the ground points in the LiDAR point cloud, generated a DEM based on the point cloud data, and resampled it to a spatial resolution of 0.09 m. We used the difference between DSM and DEM to obtain the nDSM data model that eliminated the influence of terrain and records the height information of all ground objects relative to the ground. The Vaihingen data set, simulation data, and reference data are shown in Figure 7.

Datasets for 3D Experiment
The data of the first set of 3D sub-experiments was the Vaihingen data set provided by ISPRS-Commision II Working Group II/4. The data set was obtained by the German Association of Photogrammetry and Remote Sensing in the Vaihingen area of Stuttgart, Germany. In addition to VHR remote sensing images (near-infrared, red, green) and reference data, DSM and LiDAR data are also provided. The spatial resolution of the VHR remote sensing image and DSM was 0.09 m. LiDAR data was acquired by an ALS50 sensor on 21 August 2008.
Since the Vaihingen data set contains only one period of data, we selected a certain number of buildings as the assumed change area in order to verify the effectiveness of the method in this paper. In addition, we used the CSF Plugin Instruction [70] tool to isolate the ground points in the LiDAR point cloud, generated a DEM based on the point cloud data, and resampled it to a spatial resolution of 0.09 m. We used the difference between DSM and DEM to obtain the nDSM data model that eliminated the influence of terrain and records the height information of all ground objects relative to the ground. The Vaihingen data set, simulation data, and reference data are shown in Figure 7.    The data of the second set of 3D sub-experiments were historical Google Earth images and UAV LiDAR point cloud data we independently obtained. The data covers an area in Changchun City, Jilin Province, China, with an image resolution of 0.13 m and a size of 4332 × 5267 pixels. The two phases of HR images were obtained in May 2009 and May 2019, including three bands of red, green, and blue. UAV LiDAR point cloud data were obtained in May 2019. Due to the lack of point cloud data corresponding to May 2009, we still assumed the point cloud data of 2009 by means of simulation. HR remote sensing image, point cloud, and its simulation data, reference data are shown in Figure 8.

Network Training and Parameter Selection
We constructed the squeeze-and-excitation W-Net based on the TensorFlow framework. For the operating environment we used an Intel(R) Core(TM) i9-990KF CPU, and a NVIDIA GeForce RTX 2080 SUPER GPU (8 GB). In the four sub-experiments, the input image size at the left and right ends of the network was 128 × 128 pixels, and the amount of training data in each batch was 16. In order to facilitate comparison with other methods

Network Training and Parameter Selection
We constructed the squeeze-and-excitation W-Net based on the TensorFlow framework. For the operating environment we used an Intel(R) Core(TM) i9-990KF CPU, and a NVIDIA GeForce RTX 2080 SUPER GPU (8 GB). In the four sub-experiments, the input image size at the left and right ends of the network was 128 × 128 pixels, and the amount of training data in each batch was 16. In order to facilitate comparison with other methods and minimize the time expenditure, the epoch of each experiment was set to 100, the training images used in the experiment were 1000, and the reduction ratio set in the network was 16, as provided in the original article [53].
For the traditional method, we obtained the binary change map by setting the threshold. For the machine learning method, we used the manually selected sample points to train the classifier to get the detection result. For the transition method, we used the reference change map to select the appropriate number of samples to train the network to get the detection results. For the deep learning method, we choose 1000 training images, set the epoch to 100, used the recommended hyperparameters, and trained the network to get the detection results.

2D Change Detection
The original data in 2D experiments (experiment 1 and experiment 2) only contained remote sensing images. The input of the first, second, and third methods was the original remote sensing image. The input of the fourth method is the original image and the feature image. The experimental details of each method are shown in Tables 2 and 3, and the detection results are shown in Figures 9 and 10.  ing images used in the experiment were 1000, and the reduction ratio set in the network was 16, as provided in the original article [53]. For the traditional method, we obtained the binary change map by setting the threshold. For the machine learning method, we used the manually selected sample points to train the classifier to get the detection result. For the transition method, we used the reference change map to select the appropriate number of samples to train the network to get the detection results. For the deep learning method, we choose 1000 training images, set the epoch to 100, used the recommended hyperparameters, and trained the network to get the detection results.

2D Change Detection
The original data in 2D experiments (experiment 1 and experiment 2) only contained remote sensing images. The input of the first, second, and third methods was the original remote sensing image. The input of the fourth method is the original image and the feature image. The experimental details of each method are shown in Tables 2 and 3, and the detection results are shown in Figures 9 and 10.

3D Change Detection
3D change detection also included two sets of sub-experiments (experiment 3 and experiment 4). Although DSM data was added to the 3D experiment, the image of the first group of sub-experiment was simulated data, so the comparative experiments of the first, second, and third methods were not performed. The experimental details of each method are shown in Tables 4 and 5, and the detection results are shown in Figures 11 and 12.

3D Change Detection
3D change detection also included two sets of sub-experiments (experiment 3 and experiment 4). Although DSM data was added to the 3D experiment, the image of the first group of sub-experiment was simulated data, so the comparative experiments of the first, second, and third methods were not performed. The experimental details of each method are shown in Tables 4 and 5, and the detection results are shown in Figures 11 and 12.    According to the formula in Section 2.3.1, the evaluation index corresponding to the four groups of detect results were calculated, as shown in Tables 6 and 7. Table 6. Accuracy assessment on 2D change detection results.    According to the formula in Section 2.3.1, the evaluation index corresponding to the four groups of detect results were calculated, as shown in Tables 6 and 7.   Table 6. Accuracy assessment on 2D change detection results. According to the formula in Section 2.3.1, the evaluation index corresponding to the four groups of detect results were calculated, as shown in Tables 6 and 7. It can be seen from Tables 6 and 7 that, except for individual cases, the detection results obtained by the proposed method, OA and F1 were both the maximum, and MA and FA are both the minimum. Moreover, except for the generally low detection accuracy of experiment 2, the OA values of the proposed methods were all greater than 0.97, and the F1 values were all greater than 0.86. This shows that the squeeze-and-excitation W-Net proposed in this paper with multi-source and multi-feature data as input could obtain higher quality detection results than other methods. Furthermore, our proposed network not only surpassed traditional methods, machine learning methods, and transition methods but also performed better than the typical semantic segmentation network of deep learning methods. It can also be seen Figures 9-12 that the detection results obtained by the method in this paper could accurately and clearly reflect the changing buildings. This proves that the squeeze-and-excitation W-Net we designed can be successfully applied to the 2D and 3D change detection of buildings.

Discussion
To evaluate the proposed method fully. Section 4.1 provides an intuitive evaluation of the methods from the aspect of comparison methods. Section 4.2 analyzes the performance of the squeeze-and-excitation network. The influence of multi-feature input on the model is discussed in Section 4.3.

Comparison with Previous Studies
The seven previous research methods used in this article were all representative and could fully illustrate the advantages of our method. RCVA eliminates the influence of image registration errors by considering neighborhood information [17]. It divides the spectral change intensity value of the pixel by the threshold and then obtains the changing area, which limits the detection accuracy to the quality of the threshold selection. However, the linear threshold does not have much physical meaning and is highly subjective. Besides, the degree of confusion of pixel spectral values in HR remote sensing images is so great that RCVA, which simply takes pixel spectral values as the research object, appears powerless when processing HR remote sensing images. SVM generates the optimal classification hyperplane by solving a convex quadratic programming problem [63]. It can perform nonlinear classification tasks. However, the training of the classifier requires enough training samples. We manually select a limited amount of training samples, and the detection results obtained after training the SVM are not ideal. The indexes in Tables 6 and 7 also show that the hand-selected training sample points of experiment 4 are several times more than experiment 1 and experiment 2. Compared with experiment 1, the OA and F1 of experiment 4 were increased by more than 5% and 3%, respectively, and the FA value was reduced by more than 7%. This shows that the number of training sample points directly affects the detection performance of SVM. Similarly, this situation is the same for RF. Because compared with experiment 1, the OA and F1 of experiment 4 were increased by more than 18%, and the FA was reduced by more than 23%. However, hyperparameters such as the number of decision trees and the maximum depth of the decision tree would affect the classification effect of RF [64]. However, under the premise of a fixed number of samples, we found that by adjusting the hyperparameters, the influence of the hyperparameters is much smaller than the training samples. It can also be observed Figures 9-12 that although the detection results of SVM and RF have a certain degree of error, they can reflect the main change areas of the building. This shows that compared to traditional methods, machine learning methods have certain advantages. The DBN network belonging to the transition method has higher OA and F1 and relatively lower MA and FA. However, the price of this improvement is the need to feed a larger number of training samples. We automatically selected 5000 training samples by reference map, avoiding the time-consuming way of manually selecting samples. This method of automatically selecting training samples can provide sufficient training samples for DBN. The trained model has a better detection effect, which may be related to this.
On the whole, the detection results of deep learning methods are of higher quality than the other three types of methods. Figures 9 and 12e-h show that the building areas were relatively straightforward and accurate, and there was almost no large-scale false detection. The indexes in Tables 6 and 7 also show that the detection results of the deep learning method had a higher level of accuracy, and the values of OA, F1, MA, and FA were all significantly improved compared to the other three methods. Among them, the maximum increase of OA was 26.29%, the maximum increase of F1 was 73.42%, the maximum decrease of MA was 60.88%, and the maximum decrease of FA was 18.92%. It is worth mentioning that in the longitudinal comparison with other methods, the squeezeand-excitation W-Net achieved the largest increase and decrease of the three indexes of OA, F1, and FA. This fully demonstrates that the squeeze-and-excitation W-Net we designed has powerful feature extraction, synthesis, and analysis capabilities and can correctly classify buildings and non-buildings. In the four groups of experiments, U-Net has shown relatively good performance, and the best indexes appear in almost every group of experiments. And in experiment 4, its OA and F1 are the same as the maximum value corresponding to the squeeze-and-excitation W-Net, which reflects the advantages of U-Net because this reflects the ability of the network to quickly complete network convergence and achieve higher validation accuracy under the premise of 100 training times. In contrast, the detection results of SegNet and DeepLabv3+ were relatively low. SegNet will calculate the category probability of each pixel at the end of the network and then obtain the category probability of the pixel through the Softmax function [67]. The premise to ensure that the model can correctly infer the pixel category is that the model is fully trained. This increases the time cost and makes the execution of the model inefficient. Therefore, under 100 limited training times, SegNet may not be well trained, which makes the detection result unsatisfactory. Since DeepLabv3+ was proposed, some people considered it to be one of the most advanced algorithms for semantic segmentation. Its encoder-decoder structure can fuse multi-scale information, and its dilated convolution and ASSP layer and backbone network Xception can improve the robustness and operating rate of semantic segmentation. However, the network did not seem to have strong robustness when dealing with complex multi-source data. After performing small-scale training, the detection results were not ideal, and even the highest MA appeared among the four deep learning methods. This also shows that although an optimized network structure can improve network performance, a lightweight and fast convergence network model should also be the focus of future research.

Analysis of Network Models
Although the squeeze-and-excitation W-Net has obtained good detection results, its network convergence rate, operating speed, and the ability to feature learning of the network have not been fully discussed. In Tables 2-5, we counted the time consumption of various methods when performing detection or training. The implementation of nondeep learning methods was different, and there was no unified standard for data usage and calculation methods, so their time consumption was not comparable. In addition, the implementation of DeepLabv3+ was not a TensorFlow framework, and it was not a Python platform, so its time consumption was not explained. U-Net, SegNet, and squeezeand-excitation W-Net are all constructed through the TensorFlow framework and were similar in terms of data input, network construction methods, and hyperparameter settings. Therefore, the three networks can be compared in terms of time consumption, convergence rate, and feature learning ability. From Tables 2-5, it can be seen that the training time of squeeze-and-excitation W-Net in the four sets of experiments is not the largest, which shows that the network we designed obtains higher quality detection results while adding less time cost. To analyze the execution performance of the network, we analyzed the excitation W-Net are all constructed through the TensorFlow framework and were similar in terms of data input, network construction methods, and hyperparameter settings. Therefore, the three networks can be compared in terms of time consumption, convergence rate, and feature learning ability. From Tables 2-5, it can be seen that the training time of squeeze-and-excitation W-Net in the four sets of experiments is not the largest, which shows that the network we designed obtains higher quality detection results while adding less time cost. To analyze the execution performance of the network, we analyzed the training details of the three networks in 4 sets of experiments and visualized the validation loss and validation accuracy during the network training process, as shown in Figure 13.  Figure 13a,c,e,g shows the validation loss values of the three models. It can be seen from the curve that the squeeze-and-excitation W-Net had a faster convergence rate, and the loss value decreases rapidly as the number of training increased and approached 0. The other two networks performed poorly, the loss value failed to drop to near 0 within a fixed number of training times, and even the loss value increased. This shows that the network model we propose has a strong ability to extract features of data and can mine deep features of data. The loss value has been declining, which may be attributed to the non-linear modeling effect of the squeeze-and-excitation module. Figure 13b,d,f,h show the validation accuracy curve. It can also be seen that the squeeze-and-excitation W-Net had high validation accuracy in the four sets of experiments and maintained a good upward trend. It is worth mentioning that the reference change map in experiment 2 did not clearly label the objects, and there was a slight confusion among the objects. The training accuracy of the other two models was obviously affected by this, and the accuracy curve fluctuated greatly or climbed slowly. The squeeze-and-excitation W-Net still maintained a high accuracy, and the accuracy value had an obvious upward trend. We believe that this relies on the independent data input form of the squeeze-and-excitation W-Net at the  Figure 13a,c,e,g shows the validation loss values of the three models. It can be seen from the curve that the squeeze-and-excitation W-Net had a faster convergence rate, and the loss value decreases rapidly as the number of training increased and approached 0. The other two networks performed poorly, the loss value failed to drop to near 0 within a fixed number of training times, and even the loss value increased. This shows that the network model we propose has a strong ability to extract features of data and can mine deep features of data. The loss value has been declining, which may be attributed to the non-linear modeling effect of the squeeze-and-excitation module. Figure 13b,d,f,h show the validation accuracy curve. It can also be seen that the squeeze-and-excitation W-Net had high validation accuracy in the four sets of experiments and maintained a good upward trend. It is worth mentioning that the reference change map in experiment 2 did not clearly label the objects, and there was a slight confusion among the objects. The training accuracy of the other two models was obviously affected by this, and the accuracy curve fluctuated greatly or climbed slowly. The squeeze-and-excitation W-Net still maintained a high accuracy, and the accuracy value had an obvious upward trend. We believe that this relies on the independent data input form of the squeeze-and-excitation W-Net at the left and right ends and the form of network training. Both the left and right ends are down-sampled at the same time, and the advantages of low-dimensional features are copied at the same time in the corresponding layer so that the network has stronger robustness when dealing with the confusion of positive and negative samples.

The Effect of Multi-Feature
The way we propose using multi-source and multi-feature data as the network input has played a non-negligible role in improving the detection accuracy of the deep learning network. Especially, it is difficult to accurately separate objects such as buildings with a high degree of diversity and complexity from a highly confusing background using a single feature. To verify the impact of multi-source and multi-feature data on the deep learning network, we conducted comparative experiments on the three networks with multi-feature and original image data as input. In the experiment with multiple features as input, we combined the original image and its features in the manner shown in Table 1 and Figure 4, and used this as the input to train the network. In the experiment where a single feature was the input, we only used the original image as the input to train the network. Taking Experiment 2 and Experiment 4 as examples, we visualized the comparison results, as shown in Figures 14 and 15. At the same time, we counted the changes in OA, F1, MA, and FA values. As shown in Figure 16, the histogram is an index value with a single feature as the input, and the range of change represents the increase or decrease in the accuracy index of a single feature relative to the accuracy index with multiple features as the input. gle feature. To verify the impact of multi-source and multi-feature data on the deep learning network, we conducted comparative experiments on the three networks with multifeature and original image data as input. In the experiment with multiple features as input, we combined the original image and its features in the manner shown in Table 1 and Figure 4, and used this as the input to train the network. In the experiment where a single feature was the input, we only used the original image as the input to train the network. Taking Experiment 2 and Experiment 4 as examples, we visualized the comparison results, as shown in Figures 14 and 15. At the same time, we counted the changes in OA, F1, MA, and FA values. As shown in Figure 16, the histogram is an index value with a single feature as the input, and the range of change represents the increase or decrease in the accuracy index of a single feature relative to the accuracy index with multiple features as the input.     The results in Figures 14 and 15 are shown that the detection results of the models obtained by the two data input methods were clearly different. The detection result corresponding to the multi-feature input method had less misjudgment of the building, and the obtained building area was further complete and had clear boundaries. This reflects the way that multi-feature data is used as the model input can make the deep learning model have strong robustness for detecting complex objects. In addition, the increase in OA and F1 and the decrease in MA and FA shown in Figure 16 were relatively significant. That is, when the input was converted from original image data to multi-feature data, the values of OA and F1 were increased except for individual cases, and the values of MA and The results in Figures 14 and 15 are shown that the detection results of the models obtained by the two data input methods were clearly different. The detection result corresponding to the multi-feature input method had less misjudgment of the building, and the obtained building area was further complete and had clear boundaries. This reflects the way that multi-feature data is used as the model input can make the deep learning model have strong robustness for detecting complex objects. In addition, the increase in OA and F1 and the decrease in MA and FA shown in Figure 16 were relatively significant. That is, when the input was converted from original image data to multi-feature data, the values of OA and F1 were increased except for individual cases, and the values of MA and FA were decreased. This can be further proved from the data level that the multi-feature input method can better train the model than the single-feature input method. This method also contributes more to the improvement of model performance and can play a role in improving the robustness of the model.

Conclusions
In this article, we proposed a new bilaterally symmetrical end-to-end network architecture called squeeze-and-excitation W-Net, which can perform 2D and 3D building change detection. The two-sided network input end can meet the comprehensive application of homogeneous and heterogeneous data. The deepened convolutional layer and the introduced Batch Normalization layer make the network feature extraction ability stronger, faster training rate, and more robust. The W-shaped network structure has two-sided skip connections, which can extend the low-dimensional features on both sides to the upsampling of high-dimensional features, and significantly improve the network's image restoration capability and detection accuracy. Furthermore, we innovatively carried out sufficient feature mining and information extraction on the original data. We obtained the spectrum, texture, shape, and other features in the original image and used these features together with the original image as input to train the network. Experiments showed that this idea effectively improved the network's detection ability and the ability to extract information from complex features. To make effective use of multiple features, we uniquely embed the squeeze-and-excitation module after each convolution in W-Net. The squeeze-and-excitation layer can learn the dependency relationship between feature channels, making the network more sensitive to essential features, and has a stronger ability to process complex multi-source and multi-feature data.
We applied our method to four challenging data sets. We selected four classic and commonly used methods of traditional methods, machine learning methods, transition methods, and deep learning methods for comparative experiments. The qualitative and quantitative analysis of the experimental results showed that, in most cases, our method obtained higher OA and F1 values and lowered MA and FA values. And while improving the detection accuracy, the time cost of the squeeze-and-excitation W-Net we designed is lower. This shows that the network is highly scalable and can be applied to large-scale change detection tasks. It is worth mentioning that both experiment 3 and experiment 4 used homogeneous and heterogeneous data simultaneously. This is a challenge to the performance of the network. Moreover, our network can use these two kinds of data together and achieve good detection results, and the network convergence and execution efficiency are high. In summary, this paper proposes a change detection method based on a new squeeze-and-excitation W-Net deep learning network. It can effectively perform building 2D and 3D change detection and has strong data mining capabilities and adaptability. It is a change detection method with strong practical value and promotion significance.