PGA-SiamNet: Pyramid Feature-Based Attention-Guided Siamese Network for Remote Sensing Orthoimagery Building Change Detection

: In recent years, building change detection has made remarkable progress through using deep learning. The core problems of this technique are the need for additional data (e.g


Background
Remote sensing imagery has found a wide range of applications because it can obtain change information occurring around the world, both in densely populated cities and in hard-to-reach areas. Meanwhile, change detection (CD), as a hot topic in the field of remote sensing analysis, has been studied for several decades. Because of its unique characteristics, many CD studies have been dedicated to solving large-scale and complicated problems using remote sensing images, for example, for the monitoring of forests and urban sprawl, and earthquake assessment over long periods. A lot of research institutions have made many intensive studies on CD, such as the seasonal and annual change monitoring (SATChMo) project [1] of Poland, earth watching [2] of European space agency (ESA) and Onera satellite change detection [3] of the IEEE geosciences and remote sensing association (IEEE GRSS).
Recently, the high-resolution (HR) or very-high-resolution (VHR) images have received a lot of attention because they can reveal more detailed information about the land surface, thereby increasing Recently, the high-resolution (HR) or very-high-resolution (VHR) images have received a lot of attention because they can reveal more detailed information about the land surface, thereby increasing the possibility of monitoring small but important objects such as buildings. Driven by this, building change detection (BCD) has attracted substantial attention in applications such as urbanization monitoring, illegal or unauthorized building identification, and disaster evaluation. In addition, automatic building change detection for remote sensing images has become a topical issue because carrying out the task manually is time consuming and tedious [4]. Therefore, there is a crucial need to investigate efficient building change detection algorithms for remote sensing images.
In general, the traditional change detection process consists of three major steps: preprocessing, change detection technique selection, and accuracy assessment. Moreover, these methods typically follow one of two approaches [5]: pixel-based or object-based methods [6][7][8][9][10]. Pixel-based methods are mainly used in large-scale change detection with low or medium resolution images (e.g., MODIS), while object-based methods are more popular for HR or VHR images (e.g., QuickBird, GeoEye-1, WorldView-1/2, Aerial imagery), because the high-frequency components in the VR/VHR images cannot be fully represented by pixel-based methods. These methods have been fully developed in various scenarios in the past years; however, the features used by change detection algorithms are almost hand crafted and so are weak in image representation [11]. The features are also susceptible to the process in the preprocessing stage, such as those involved in radiometric correction, geometric correction, and image registration. Besides this, the effects of changes in the appearance of objects caused by different photographic angles can be alleviated by orthorectification, but this also brings new problems affecting accurate detection. In high-resolution orthorectified images, the displacement of buildings (especially high-rise buildings) mainly caused by rectifications with a digital elevation model (DEM) and the poor alignment caused by the displacement can lead to many false positive changes. In addition, as the spatial resolution of satellite images increases, the accuracy of image registration tends to worsen [5,8]. Generally, the overall framework of the change detection technique can be summarized as feature extractions and change decisions, which is depicted in Figure  1. Considering the facts described above, it is not sufficient to obtain the building changes with the 2D information delivered by satellite images, in which many irrelevant changes are mixed with the desired changes. Recently, a lot of research has focused on improving the precision of building change detection. For example, when taking the features extended into 3D space, i.e., the height information, which is free of illumination variations and perspective distortions [12], building extraction and change detection accuracy is improved [13,14]. Moreover, with the Lidar system expansion, some approaches using laser scanning data have been making headway [15]. However, when it comes to large and remote areas, the data is either low frequency, or hard or even impossible to acquire. In addition, owing to breakthroughs in dense image matching (DIM) techniques, the availability of image-based 3D information [16] has greatly increased; this field being known as DSM (digital surface model)-assisted building change detection [17]. However, the relatively low quality Considering the facts described above, it is not sufficient to obtain the building changes with the 2D information delivered by satellite images, in which many irrelevant changes are mixed with the desired changes. Recently, a lot of research has focused on improving the precision of building change detection. For example, when taking the features extended into 3D space, i.e., the height information, which is free of illumination variations and perspective distortions [12], building extraction and change detection accuracy is improved [13,14]. Moreover, with the Lidar system expansion, some approaches using laser scanning data have been making headway [15]. However, when it comes to large and remote areas, the data is either low frequency, or hard or even impossible to acquire. In addition, owing to breakthroughs in dense image matching (DIM) techniques, the availability of image-based 3D information [16] has greatly increased; this field being known as DSM (digital surface model)-assisted building change detection [17]. However, the relatively low quality of the DSMs from Remote Sens. 2020, 12, 484 3 of 21 satellite data, which is strongly dependent on image matching technology, is still a major obstacle for detection accuracy.
Recently, the excitement around deep convolutional neural networks (CNN) has been tremendous, with successful applications in many real-world vision tasks, such as object detection [18], image classification [19], semantic segmentation [20], as well as change detection [21]. However, remote sensing applications present some new challenges for deep learning regarding multimodal and multisource data [22]. Thanks to learned feature representations, which are more robust to appearance and shape variations, significant performance improvements have been achieved. The researches in the field of building change detection can be divided into two: (1) post-classification-based methods; and (2) direct-detection-based methods. The first methods involve first extracting the buildings from the images at a different time in the same area, and then obtaining the changed buildings by comparing the extracted building maps. Furthermore, if applying the post-classification-based methods, the displacement of the objects becomes less important, and changes are found by verifying whether the two images contain the same object or not [23]. However, a high accuracy of building extraction is required, which is a hard task in itself and can lead to accumulated errors. The second set of methods are based on an end-to-end framework and have been successfully used to identify building changes. They avoid some of the weaknesses of classic methods, especially in dense urban areas. Rodrigo has trained an end-to-end Siamese architecture from scratch using a fully convolutional network, and the result surpassed the state-of-the-art methods in change detection, both in accuracy and in inference speed without any post-processing [24]. As mentioned above, the displacement of the buildings in the orthorectified images is a major challenge for most end-to-end building change detection methods. However, most current methods do not take the building displacement into account and ignore the correlation between the image pairs. Lebedev proposed a specially modified generative adversarial network (GAN) architecture based on pix2pix for automatic change detection in season-varying remote sensing images. In this research, object shift was considered, which is crucial for the buildings in orthorectified images [25].
Despite the wide availability of CNN, it lacks large amounts of available corresponding change annotations to provide training data, which is necessary to train a reliable change detector in a supervised approach. Focusing on this issue, many recent researches explore alternatives, such as training a weak supervised network [26][27][28], applying an unsupervised approach [29][30][31], or even focusing on noisy data [28]. However, studies on building change detection mainly concentrate on either two-stage detection accompanied with building detection or one-stage without taking the displacement of the buildings into account.
Recently, weakly supervised approaches have been proposed in an attempt to compensate for the dependence on the large annotated data, such as training with synthetic data obtained by a given geometric transformation [32][33][34]. However, detecting building changes with these methods has some restrictions: either an accurate position of the change cannot be given, or information of an independent building is required.

Attention Mechanism
The so-called attention is a way to observe the world and imitate the mechanism of humans [35].
Recently, it has been demonstrated to be a simple but effective tool for improving the representation ability of CNN through reweighting of the feature maps; this is achieved using spatial attention and channel attention to scale the features which are meaningful or useless [36][37][38][39][40][41][42][43].
Channelwise attention. In order to improve the discriminant of the abstract feature, the channels of high-level features may number in the thousands (in which there inevitably exists redundant information), and each of them can be regarded as a class-specific response [37]. To exploit the interdependencies between these maps, channelwise attention was created to emphasize the channels Remote Sens. 2020, 12, 484 4 of 21 that are relatively informative and focus on the meaningful input. Through obtaining the channel relationship matrix, the self-attention channel weight of the original feature map can be further calculated, helping to boost feature discriminability [36,41,44,45].
Spatialwise attention. In view of the fact that the low-level features usually contain a large number of details and the receptive field of the convolution layer in traditional FCN (full convolutional network) is limited, if the network only establishes a pixel relationship in the local neighborhood, it can easily lead to unsatisfactory results. In addition, the studies on how effectively obtained long-range dependencies without using deeply stacking convolution layers are more attractive. Instead of considering all positions, spatial attention is introduced to find the relationships of the position and highlight the meaningful region. To take full advantage of the information of the features, more and more researchers are preferring to combine the spatialwise attention and channelwise attention [36][37][38]45].
Co-Attention mechanism. More recently, in order to understand the fine-grained relationship and mine the underlying correlations between different modalities, co-attention mechanisms have been widely studied in vision-and-language tasks, such as in visual question answering (VQA) [46][47][48][49]. In the computer vision area, Lu was inspired by the above-mentioned works and built the co-attention module to capture the coherence between video frames and effectively suppress the current alternatives [40].

Semantic Correspondence Mechanism
In general, the issue of change detection can be attributed to finding and matching pairs of images [50][51][52]. The rule of the method is based on the idea that by determining whether a similar semantic object in the second image exists or not to express unchanged and changed. Therefore, finding the correspondence of the point has long been one of the fundamental problems in the field of computer vision and photogrammetry. Traditional methods have been quite successful; they usually employed hand-crafted descriptors (e.g., Scale Invariant Feature Transform (SIFT) [53], Histogram of Oriented Gradients (HOG) [54], Oriented FAST and Rotated BRIEF (ORB) [55]) to find the key point correspondences based on minimizing an empirical matching strategy and then rejecting the outliers with a geometric match model. Afterwards, several studies began to consider trainable descriptors [56,57] with CNN. However, these methods heavily depended on predefined features of sparse points and none of the models could be directly used for semantic alignment [32,58]. Given the recent success of using end-to-end CNN in various tasks, many approaches for semantic matching have been proposed with promising results [56,59]. However, these methods also suffer from the same limitations as many other machine learning tasks. Therefore, to achieve satisfactory results, large-scale and diverse training data are required, which is labor intensive and time consuming.
In this paper, we propose a novel framework for building change detection, using satellite orthorectified images to model the complex change relationships of the buildings with displacement in the scene. The main contributions of this paper can be summarized as follows: (1) We introduce a co-attention module which can deal with the displacement of buildings in orthoimages to enhance the feature representations and further mine the correlations therein. Meanwhile, we fuse the semantic and context information of the feature using a context fusion strategy; (2) We provide a new satellite dataset for building change detection frameworks covering various sensors, and verify its effectiveness by conducting extensive experiments; (3) We propose an effective Siamese building change detection framework and make some improvements. Moreover, we train our model using two different datasets. The proposed method shows superior performance: it can directly obtain pixel-level predictions without any other post-processing techniques.
The structure of this paper is organized as follows: Section 2 describes the datasets and the proposed method of this paper. The experimental results and accuracy assessments are presented in Remote Sens. 2020, 12, 484 5 of 21 Section 3. Section 4 presents the discussion. Finally, the conclusion of this paper is summarized in Section 5.

Materials and Methods
In this section, we provide the formulation of our method to detect building changes. Firstly, we introduce the datasets used in our study in Section 2.1. The description of our network, pyramid feature-based attention-guided Siamese network (PGA-SiamNet), is presented in Section 2.2. Finally, in Section 2.3, the implementation of the experiment is described in detail.

Datasets
In this paper, in order to train the proposed network and evaluate its performance, we adopted two different building change detection datasets, namely dataset I (DI) and dataset II (DII). The first dataset (DI) is the Wuhan University (WHU) building change detection dataset [60] which covers Christchurch, New Zealand, and contains two scenes acquired at the same location in 2012 and 2016, with the semantic labels of the buildings and the change detection labels. The dataset is made up of aerial imagery data with a 0.075 m spatial resolution.
We named the DII dataset the earth vision-change detection (EV-CD building) dataset. The dataset was labeled by us and is extremely challenging. Moreover, soon it will be available to the public. The dataset is much more complex than the DI dataset and is made up of satellite imagery data instead of aerial imagery data. The dataset consists of data from a variety of sensors with different spatial resolution ranges from 0.2 to 2 m and contains several cities in the south of China. In addition, there are many high-rise buildings with large displacements. Figure 2 shows part of the DI and DII datasets with the building changes labeled by vectorized polygons. Figure 2a is the DI dataset, Figure 2b,c belong to the DII dataset. The zoomed-in images in the right of Figure 2 show that the buildings in dataset DII are more diverse than those in dataset DI. In addition, dataset DII has more high-rise buildings with unavoidable displacement, which is what we were focusing on all along.
Remote Sens. 2020, 12, 484 5 of 21 Section 3. Section 4 presents the discussion. Finally, the conclusion of this paper is summarized in Section 5.

Materials and Methods
In this section, we provide the formulation of our method to detect building changes. Firstly, we introduce the datasets used in our study in Section 2.1. The description of our network, pyramid feature-based attention-guided Siamese network (PGA-SiamNet), is presented in Section 2.2. Finally, in Section 2.3, the implementation of the experiment is described in detail.

Datasets
In this paper, in order to train the proposed network and evaluate its performance, we adopted two different building change detection datasets, namely dataset I (DI) and dataset II (DII). The first dataset (DI) is the Wuhan University (WHU) building change detection dataset [60] which covers Christchurch, New Zealand, and contains two scenes acquired at the same location in 2012 and 2016, with the semantic labels of the buildings and the change detection labels. The dataset is made up of aerial imagery data with a 0.075 m spatial resolution.
We named the DII dataset the earth vision-change detection (EV-CD building) dataset. The dataset was labeled by us and is extremely challenging. Moreover, soon it will be available to the public. The dataset is much more complex than the DI dataset and is made up of satellite imagery data instead of aerial imagery data. The dataset consists of data from a variety of sensors with different spatial resolution ranges from 0.2 to 2 m and contains several cities in the south of China. In addition, there are many high-rise buildings with large displacements. Figure 2 shows part of the DI and DII datasets with the building changes labeled by vectorized polygons. Figure 2a is the DI dataset, Figure 2b,c belong to the DII dataset. The zoomed-in images in the right of Figure 2 show that the buildings in dataset DII are more diverse than those in dataset DI. In addition, dataset DII has more high-rise buildings with unavoidable displacement, which is what we were focusing on all along.
(a) For the two datasets, we divided all the images into tiles of 512 × 512 pixels, with both the overlay of the width and height being 200 pixels, and finally split each of the two datasets into training sets, validation sets, and test sets randomly in a ratio of 7:1:2. Table 1 shows the general information of the two datasets used in our experiments, including the ground sample distance (GSD), source, pixel size of the tile, and the image number to training, validation, and test sets.  For the two datasets, we divided all the images into tiles of 512 × 512 pixels, with both the overlay of the width and height being 200 pixels, and finally split each of the two datasets into training sets, validation sets, and test sets randomly in a ratio of 7:1:2. Table 1 shows the general information of the two datasets used in our experiments, including the ground sample distance (GSD), source, pixel size of the tile, and the image number to training, validation, and test sets.

Problem Description
PGA-SiamNet was constructed as a Siamese network, with an encoder-decoder structure. The co-attention module setting at the end of the encoder learns the correlation between the deep features of the input image pair; this enables the PGA-SiamNet to find the objects with displacement in other images, which is vital to building change detection. The pyramid change module helps the network to discover the object changes of various sizes and so give better results. Specifically, context information is important for the object in complex scenes, thus aggregating long-range contextual information is useful to improve the feature representation for building change detection.

Architecture Overview
The proposed building change detection network is a Siamese network, following the encoder-decoder architecture as shown in Figure 3. In particular, we employed the well-known VGG16 as the backbone to encode the features of the image pairs to be detected, with each branch sharing the weight. We built a network with the change residual (CR) module for the two-input feature but without any attention mechanism as the baseline; this can be seen in the blue dashed box; the yellow box is the change residual (CR) module.

Problem Description
PGA-SiamNet was constructed as a Siamese network, with an encoder-decoder structure. The co-attention module setting at the end of the encoder learns the correlation between the deep features of the input image pair; this enables the PGA-SiamNet to find the objects with displacement in other images, which is vital to building change detection. The pyramid change module helps the network to discover the object changes of various sizes and so give better results. Specifically, context information is important for the object in complex scenes, thus aggregating long-range contextual information is useful to improve the feature representation for building change detection.

Architecture Overview
The proposed building change detection network is a Siamese network, following the encoderdecoder architecture as shown in Figure 3. In particular, we employed the well-known VGG16 as the backbone to encode the features of the image pairs to be detected, with each branch sharing the weight. We built a network with the change residual (CR) module for the two-input feature but without any attention mechanism as the baseline; this can be seen in the blue dashed box; the yellow box is the change residual (CR) module. Thereafter, we introduced attention modules to enrich the CR module. For example, by increasing the receptive field to extract a different scale of feature information, we applied an atrous spatial pyramid pooling (ASPP) module for the deepest level feature of the encoder. We conducted ablation studies for comparison by modifying our network with proposed modules, as is discussed in Section 3.1. To emphasize the useful information of the deep features with 512 channels, the channelwise attention is used for layer 4, 5, and 6. Similarly, the shallow features with rich position information are optimized with spatialwise attention. Finally, a co-layer aggregation (CLA) module was used to aggregate the low-level and high-level features, thus fusing the semantic and the context information. Thereafter, we introduced attention modules to enrich the CR module. For example, by increasing the receptive field to extract a different scale of feature information, we applied an atrous spatial pyramid pooling (ASPP) module for the deepest level feature of the encoder. We conducted ablation studies for comparison by modifying our network with proposed modules, as is discussed in Section 3.1. To emphasize the useful information of the deep features with 512 channels, the channelwise attention is used for layer 4, 5, and 6. Similarly, the shallow features with rich position information are optimized with spatialwise attention. Finally, a co-layer aggregation (CLA) module was used to aggregate the low-level and high-level features, thus fusing the semantic and the context information.

Co-Attention Module
The first module of PGA-SiamNet is a co-attention block with an elegant differentiable attention based on a correlation network, which takes in deep feature representations from an image pair as inputs and outputs for a correlation map. If the image pairs contain common objects and therefore Remote Sens. 2020, 12, 484 8 of 21 belong to the unchanged category, the features at the locations of the shared objects exhibit similar characteristics. Therefore, inspired by the co-attention mechanism, which discriminates objects from video, the co-attention block was added to the proposed network to identify the changes.
The neighborhood consensus module [61] is used to obtain correlations of the two given features f a and f b , because it has already been applied and achieved superior performance in previous research [59]. COSNet [62] proposed another method which uses an affinity matrix to denote co-attention, and so to mine the correlations through adding a weight matrix and verifying the proposed three matrix styles by experiments. In the paper, the co-attention style is exploited to obtain the correlation map like COSNet, which is showed in Figure 4 with the blue dashed box. The correlation map, referred to as affinity matrix S R (h×w)×(h×w) between f a and f b , is derived from where f a R C × (h × w) and f b R C × (h × w) are the features of the input image pair, the two features are obtained by the encoder of the network, W R C × C is a weight matrix, P is an invertible matrix, and D is a diagonal matrix, and h and w indicate the height and width of the input features, respectively. Then, a softmax function was used to normalize S by column-wise and row-wise.

Co-Attention Module
The first module of PGA-SiamNet is a co-attention block with an elegant differentiable attention based on a correlation network, which takes in deep feature representations from an image pair as inputs and outputs for a correlation map. If the image pairs contain common objects and therefore belong to the unchanged category, the features at the locations of the shared objects exhibit similar characteristics. Therefore, inspired by the co-attention mechanism, which discriminates objects from video, the co-attention block was added to the proposed network to identify the changes.
The neighborhood consensus module [61] is used to obtain correlations of the two given features and , because it has already been applied and achieved superior performance in previous research [59]. COSNet [62] proposed another method which uses an affinity matrix to denote coattention, and so to mine the correlations through adding a weight matrix and verifying the proposed three matrix styles by experiments. In the paper, the co-attention style is exploited to obtain the correlation map like COSNet, which is showed in Figure 4 with the blue dashed box. The correlation map, referred to as affinity matrix (ℎ× )×(ℎ× ) between and , is derived from where × (ℎ × ) and × (ℎ × ) are the features of the input image pair, the two features are obtained by the encoder of the network, × C is a weight matrix, P is an invertible matrix, and is a diagonal matrix, and h and w indicate the height and width of the input features, respectively. Then, a softmax function was used to normalize S by column-wise and row-wise.
where (•) is to normalize the correlation map S ; ℎ× and ℎ× are the normalization of S by column-wise and row-wise, respectively. represents the relevance of each feature in to the feature in ; Similarly, is the relevance of each feature in to the feature in . is the -th feature of the .
Then the input feature and can be computed as follows: where , ′ denote the -th column of ′ and ′ , and the operator ⊗ represents elementwise multiplication. S c = so f tmax(S), S r = so f tmax S T (3) where so f tmax(·) is to normalize the correlation map S; S c R h×w and S r R h×w are the normalization of S by column-wise and row-wise, respectively. S c represents the relevance of each feature in f a to the feature in f b ; Similarly, S r is the relevance of each feature in f b to the feature in f a . S i is the i-th feature of the S.
Then the input feature f a and f b can be computed as follows: where f i a , f b denote the i-th column of f a and f b , and the operator ⊗ represents elementwise multiplication. Furthermore, an attention gate was followed to weight the information of the pair features. The gate is composed of one convolution layer with the kernel size being 1.
In the end, the features are concatenated and fed to a multi-layer perceptron (MLP) to obtain a new representation about the correlation map. To avoid parameter excessive, the MLP is composed of three convolution layers, and the kernel size of each layer is 1, 3, and 1, respectively. After obtaining the common object from the two inputs of the correlation map by MLP, a linear transformation is used to compute the change information. In short, the changed feature f d is calculated as follows: where σ denotes the sigmoid function, W 0 R 1 × 1 × C × C/r , W 1 R 3 × 3 × C/r × C/r , and W 2 R 1 × 1 × C/r × C/r , r is the reduction ratio of the channel and equals two in the paper, f d represents the output of the change residual (CR) module. Note that, before input to the MLP, the feature f should be normalized to [0,1] by a sigmoid function. Figure 4 depicts the computation process of the changed feature with co-attention module and the module is showed in the blue dashed box. Detailed experiments were conducted to compare the effects of the module during the ablation studies.

Co-Layer Aggregation Module
Recent studies have shown that the high-level layers encode abundant contextual and global information but lost fine spatial information, while the low-level layers are the opposite. Therefore, by adopting layer aggregation to merge the different level features with various details, a good performance may be obtained [36,63,64]. In this paper, we added a co-layer aggregation (CLA) module to the proposed network to weight the low-level features in order to enhance the change information.
The encoder of our model contains six layers, we chose the first three layers: f l f 1 , f 2 , f 3 as the low-level features, and the last layer weighted by co-attention: f h f 6 as the high-level feature to perform the operation. As show in Figure 5, to merge both spatial and channelwise information, the SE block [38] was firstly applied to both the shallow and the deep features. After being given the transformed features, we forwarded the transformed high-level feature to a global pooling and two convolutions to get a global attention, which can be used to enhance the context representation of the low-level feature.
Remote Sens. 2020, 12, 484 9 of 21 Furthermore, an attention gate was followed to weight the information of the pair features. The gate is composed of one convolution layer with the kernel size being 1.
In the end, the features are concatenated and fed to a multi-layer perceptron (MLP) to obtain a new representation about the correlation map. To avoid parameter excessive, the MLP is composed of three convolution layers, and the kernel size of each layer is 1, 3, and 1, respectively. After obtaining the common object from the two inputs of the correlation map by MLP, a linear transformation is used to compute the change information. In short, the changed feature ′ is calculated as follows: ( ) = ( ( )) = ( 2 ( 1 ( 0 ( )))) (8) where denotes the sigmoid function, 0 1 × 1 × × ⁄ , 1 3 × 3 × ⁄ × ⁄ , and 2 1 × 1 × ⁄ × ⁄ , r is the reduction ratio of the channel and equals two in the paper, represents the output of the change residual (CR) module. Note that, before input to the MLP, the feature should be normalized to [0,1] by a sigmoid function. Figure 4 depicts the computation process of the changed feature with co-attention module and the module is showed in the blue dashed box. Detailed experiments were conducted to compare the effects of the module during the ablation studies.

Co-Layer Aggregation Module
Recent studies have shown that the high-level layers encode abundant contextual and global information but lost fine spatial information, while the low-level layers are the opposite. Therefore, by adopting layer aggregation to merge the different level features with various details, a good performance may be obtained [36,63,64]. In this paper, we added a co-layer aggregation (CLA) module to the proposed network to weight the low-level features in order to enhance the change information.
The encoder of our model contains six layers, we chose the first three layers: { 1 , 2 , 3 } as the low-level features, and the last layer weighted by co-attention : ℎ { 6 } as the high-level feature to perform the operation. As show in Figure 5, to merge both spatial and channelwise information, the SE block [38] was firstly applied to both the shallow and the deep features. After being given the transformed features, we forwarded the transformed high-level feature to a global pooling and two convolutions to get a global attention, which can be used to enhance the context representation of the low-level feature.  Finally, the origin low feature was added into the enhanced one as a residual block. In this way, the shallow features are refined by correlation if they merge with the changed feature produced by co-attention module.

Pyramid Change Module
To make full use of the effective receptive field at each level, the decoder consists of a pyramid of features f 1 c , f 2 c , . . . f N c , as show in Figure 6, which is designed to find the building changes at different scales in the images. At each scale, the feature from the previous scale is unsampled and added to the changed feature f d generated from change residual (CR) module. Then, the result is fed into a convolution layer, with the kernel size being 1. After performing the same steps for all the scales, the results from each scale were concatenated together, and then fed into a convolution layer; the output is the change map. Referring to the classic feature pyramid method, FPN (feature pyramid network) [65] iteratively merges features with a top-down mechanism until the resolution of the last layer recover to the original input. We fused the changed feature in a top-down pathway in order to catch the change information with different sizes.
Remote Sens. 2020, 12, 484 10 of 21 Finally, the origin low feature was added into the enhanced one as a residual block. In this way, the shallow features are refined by correlation if they merge with the changed feature produced by co-attention module.

Pyramid Change Module
To make full use of the effective receptive field at each level, the decoder consists of a pyramid of features { 1 , 2 , … }, as show in Figure 6, which is designed to find the building changes at different scales in the images. At each scale, the feature from the previous scale is unsampled and added to the changed feature generated from change residual (CR) module. Then, the result is fed into a convolution layer, with the kernel size being 1. After performing the same steps for all the scales, the results from each scale were concatenated together, and then fed into a convolution layer; the output is the change map. Referring to the classic feature pyramid method, FPN (feature pyramid network) [65] iteratively merges features with a top-down mechanism until the resolution of the last layer recover to the original input. We fused the changed feature in a top-down pathway in order to catch the change information with different sizes. The CR module is shown in Figure 7. The objective of the module is to obtain the distinctive and discriminative features for the two inputs. As shown in Figure 7c, the generation starts with the input of the two image features and , and learns to produce a difference map for the input features. The module merges with two kinds of fusion strategy: elementwise difference and elementwise addition. The elementwise difference is to get the absolute value of their difference (see Figure 7a) while the elementwise addition is to add the two input features (see Figure 7b). The CR module learns the addition features (see Figure 7b) as the residual counterparts, which are added by the difference feature (see Figure 7a), making the information refinement task easier. The CR module is shown in Figure 7. The objective of the module is to obtain the distinctive and discriminative features for the two inputs. As shown in Figure 7c, the generation starts with the input of the two image features f a and f b , and learns to produce a difference map for the input features. The module merges with two kinds of fusion strategy: elementwise difference and elementwise addition. The elementwise difference is to get the absolute value of their difference (see Figure 7a) while the elementwise addition is to add the two input features (see Figure 7b). The CR module learns the addition features (see Figure 7b) as the residual counterparts, which are added by the difference feature (see Figure 7a), making the information refinement task easier.

Implementation Details
The proposed PGA-SiamNet was implemented using Pytorch; the training procedure used a single NVIDIA RTX 2080 Ti GPU with 11 GB memory. We used a mini-batch of two and the initial learning rate was 10 −4 for the two datasets and decreased linearly according to the number of iteration times. The optimization algorithm to train the network was the adaptive moment estimation (Adam) algorithm [66]. We regarded the task as a binary segmentation for the final output of the network, which is change or no-change. To measure the performance of the proposed network, the metrics intersection over union (IoU), F1 score, precision, recall, and overall accuracy (OA) were used. Generally, the most meaningful metric was IoU in our research. The imbalance in the two classes, changed and un-changed, resulted in a large value of OA. A large number of unchanged pixels have severely made the calculated results obviously too high. Just as OA shows in Section 3, the value is too high when regarding the unchanged pixels, so it is not a good metric to reflect the accuracy of the results. Taking our focus into consideration, the precision, recall, and F1 were only calculated on the changed pixels. The metrics are defined as follows:

Implementation Details
The proposed PGA-SiamNet was implemented using Pytorch; the training procedure used a single NVIDIA RTX 2080 Ti GPU with 11 GB memory. We used a mini-batch of two and the initial learning rate was 10 −4 for the two datasets and decreased linearly according to the number of iteration times. The optimization algorithm to train the network was the adaptive moment estimation (Adam) algorithm [66]. We regarded the task as a binary segmentation for the final output of the network, which is change or no-change. To measure the performance of the proposed network, the metrics intersection over union (IoU), F1 score, precision, recall, and overall accuracy (OA) were used. Generally, the most meaningful metric was IoU in our research. The imbalance in the two classes, changed and un-changed, resulted in a large value of OA. A large number of unchanged pixels have severely made the calculated results obviously too high. Just as OA shows in Section 3, the value is too high when regarding the unchanged pixels, so it is not a good metric to reflect the accuracy of the results. Taking our focus into consideration, the precision, recall, and F1 were only calculated on the changed pixels. The metrics are defined as follows: where true positive (TP) indicates the number of pixels correctly classified as changed buildings, true negative (TN) denotes the number of pixels correctly classified as unchanged buildings, false positive (FP) represents the number of pixels misclassified as changed buildings, and false negative (FN) is the number of pixels misclassified as unchanged buildings. In Equation (15), P and R denotes precision and recall, respectively. During the training period, Z-score standardization was firstly used for the multitemporal image pairs. In addition, we trained a convolution neural network to obtain better results by multiple iterations. The binary cross-entropy loss function is a popular and effective solution, so we minimized it to optimize the network. The loss function is calculated as follows: where y refers to the ground truth and y refers to the predicted result.

Results
In this section, some ablation studies are provided in Section 3.1. Then, we compared the results of the proposed method with other methods in Section 3.2. Furthermore, the robustness of the proposed algorithm is proved in Section 3.3.

Ablation Study
In this part, we focus on exploration studies to assess the different components of the network with the two datasets, DI and DII. We trained the proposed baseline network and obtained a satisfactory result, which confirms that our proposed base network is effective. Furthermore, we introduced some attention modules to improve the performance. However, it is difficult to balance the performance on two datasets because of the completely different sensors involved. After a lot of experiments, we verified that the described modules can improve the performance of the network; the results of the metrics are showed in Table 2. Line 2 of Table 2 shows that, by adding channelwise and spatialwise attention (CS) and a co-layer aggregation (CLA) module to the base network, we obtained a slight increase in accuracy for the two datasets. After adding the ASPP module to the network, there was a slight decline regarding dataset DI, but a great improvement regarding dataset DII in all areas. The reason for this is that dataset DII consists of results from various sensors and more variable information with multi-scales, while dataset DI is relatively unvaried as regards scales information. Finally, we introduced a co-attention (CoA) module, which is important for the performance when using orthoimages with building displacement. In addition, Line 4 in Table 2 further demonstrates the efficacy.

Comparisons with Other Methods
To evaluate the performance of the proposed architecture, we further compared our method with other recent change detection methods: a deep architecture for detecting changes (ChangeNet) [67], a correlated Siamese change detection network (CSCDNet) [68], multiple side-output fusion (MSOF) [69], dual task constrained deep Siamese convolutional network (DTCDSCN) [70], multi-scale fully convolutional early fusion (MSFC-EF) [31], deep Siamese multi-scale fully convolutional network (DSMS-FCN) [31], fully convolutional early fusion (FC-EF) [24], fully convolutional Siamese-difference (FC-Siam-Diff) [24], and fully convolutional Siamese-concatenation (FC-Siam-Conc) [24]. CSCDNet was proposed to train a semantic change detection network with a Streetview dataset; by inserting correlation layers into the network, it can overcome the limitation caused by camera viewpoints, which is a major problem to the end-to-end building change detection task. In this paper, to validate the effect of the layer, we give the comparison of the network with and without the correlation layer, denoted CSCDNet/w and CSCDNet/wo, respectively. 'FC-' is a series of full convolution network and the performance is improved through extracting multiscale features in the decoder, such as MSFC-EF and DSMS-FCN. For a fair comparison, we trained and tested our PGA-SiamNet and other methods with the two available datasets mentioned above and the same parameter settings. The results are shown in Table 3. One thing to be mentioned is that some of the comparison methods are completed with the semantic task, we only took the change detection network as the comparison due to a lack of semantic labels. The results show that PGA-SiamNet easily outperforms other approaches. Simultaneously, the visualized results of the proposed method and the other methods were also compared, as shown in Figure 8, which contains the first three image pairs are from the DI dataset and the last four pairs are from DII dataset.

Robustness of the Method
To prove the robustness of the proposed algorithm, we tested on other orthoimages with the model trained on EV-CD building dataset. The image pairs are different with our training samples because they are located in the north of China. As shown in Figure 9, the acceptable result shows that the proposed method has a great potential for the high-resolution remote sensing orthoimages from various sensors with more training samples.

Robustness of the Method
To prove the robustness of the proposed algorithm, we tested on other orthoimages with the model trained on EV-CD building dataset. The image pairs are different with our training samples because they are located in the north of China. As shown in Figure 9, the acceptable result shows that the proposed method has a great potential for the high-resolution remote sensing orthoimages from various sensors with more training samples.

Discussion
The main goal of the study was to find changes of the buildings on high-resolution remote sensing orthoimages automatically. We first trained an end-to-end framework on the two available datasets. Then, representation of the features was enhanced by attention modules by fusing global context features and local semantic features. Meanwhile, the correlation of the image pairs was considered in the network by co-attention module and co-layer aggregation. Finally, the proposed method obtained better results in our studies.

Importance of the Proposed Dataset
Change detection, as a hot topic in the field of remote sensing, has attracted extensive attention and emerged many related datasets. Buildings, as an important man-made object, are often in the spotlight. However, only the public WHU building datasets provide changes for buildings at present. As for satellite imagery, there are almost no available datasets for building change detection. The difficulty of building change detection is the displacement of high-rise buildings. Therefore, we built a building change detection dataset (EV-CD building datasets) with the existing satellite images. In WHU building datasets, the buildings are low with little displacement. Besides this, the buildings are mostly independent in WHU building datasets. However, in the complex cities, the buildings are

Discussion
The main goal of the study was to find changes of the buildings on high-resolution remote sensing orthoimages automatically. We first trained an end-to-end framework on the two available datasets. Then, representation of the features was enhanced by attention modules by fusing global context features and local semantic features. Meanwhile, the correlation of the image pairs was considered in the network by co-attention module and co-layer aggregation. Finally, the proposed method obtained better results in our studies.

Importance of the Proposed Dataset
Change detection, as a hot topic in the field of remote sensing, has attracted extensive attention and emerged many related datasets. Buildings, as an important man-made object, are often in the spotlight. However, only the public WHU building datasets provide changes for buildings at present. As for satellite imagery, there are almost no available datasets for building change detection. The difficulty of building change detection is the displacement of high-rise buildings. Therefore, we built a building change detection dataset (EV-CD building datasets) with the existing satellite images.
In WHU building datasets, the buildings are low with little displacement. Besides this, the buildings are mostly independent in WHU building datasets. However, in the complex cities, the buildings are often distributed densely. Since the focus of our study is on the complex cities, it is necessary to build a relevant dataset to promote the research. The experiments show that the dataset is effective. In addition, we will publish the dataset and enlarge it in the future.

Advantages of the Proposed Baseline
The results of the experiments show that the proposed baseline network is effective and surpasses other methods with the two datasets. There are several advantages of the proposed baseline. We found that adjusting the learning rate according to the number of iteration times is a better way after performing many experiments, and the networks obtain a better performance with a pretrained weight. Given that there are a variety of buildings with different size in the datasets, we took the pyramid feature into consideration to discover the changes with various sizes in the proposed baseline network. Inspired by ResNet [71], the proposed change residual (CR) module is also a key part to get a changed feature using a residual structure. The module can fuse the features from different sources without degradation. In the decoder, the features from each scale are concatenated together. In this way, our network detects the building area accurately.

Experimental Results Compared with Other Methods
Some of the networks mentioned above are proposed for street view change detection, such as ChangeNet and CSCDNet. In ChangeNet, the branch of the Siamese contains convolutional neural networks and deconvolutional neural networks. The weights of the two branches are shared and fixed in convolutional neural networks. In the deconvolutional neural networks, the weights of the layers are not fixed. The ChangeNet only concatenates three changed features produced by the Siamese feature extraction network to incorporate both coarse and finer details. However, only combining some outputs of the decoder, the relevance of the two channels is not enough to detect changes with remote sensing images. CSCDNet achieved a better result with a slightly lower accuracy than our baseline, especially with the correlation layer. The correlation layer was utilized to deal with difference in camera viewpoints like the displacement on remote sensing images, but the correlation layer is very time consuming. The architecture of our model is somewhat similar to CSCDNet. We got more change information by CR module and applied a co-attention module to obtain the correlation of the two input features instead of the time-consuming correlation layer. Then a co-layer aggregation module was used to fuse changed features extracted from co-attention module to the shallow features. The aggregated features further improved the representation of the features in the pyramid. DTCDSCN was proposed to complete both change detection and semantic segmentation at the same time, while we only employed the subnetwork of change detection. The model has shallower convolution layers than other models, and it may ignore the multi-scale change. In addition, the proposed improved network increases the receptive field to extract a different scale changed feature using the ASPP module, and it is helpful for building change detection in complex cities such as the EV-CD building dataset. Overall, the models trained with pretrained weight are superior to those without pretrained weight, such as FC-Siam-Diff and DSMS-FCN, in our studies. Finally, an addition experiment demonstrated that the proposed method has a strong robustness by testing on other orthoimages with the model trained on EV-CD building dataset.

Conclusions
In this paper, we proposed an end-to-end attention-guided Siamese network based on a pyramid feature (PGA-SiamNet) network. It performed excellently on remote sensing orthoimagery for building change detection and yielded better results on complex urban environments when compared with other methods. By using a co-attention mechanism, the method learns to discriminate feature change by capturing the correlation between image pairs. To obtain long-range dependencies effectively, we adopted an attention-guided method. Our experiments on the two available datasets show that our method gives comparable results to other state-of-the-art techniques. Therefore, the modules added to this framework are both independent and can be adopted for building change detection in a convenient way. Meanwhile, the experimental results with the WHU datasets show a better performance than the results with the satellite imagery EV-CD dataset; the complexity and diversity of the scenes contribute to this result, while this type of data is more in our research. Owing to the machine-learning boom, building extraction, which used to be the central problem of traditional building change detection, has become unnecessary. However, the need for large and accurate sample data is still the main concern for deep learning, thus data-independent research is increasingly important, since most data at hand is inadequate and noisy. As regards future studies, on the one hand, we may place more attention on noisy data and one-shot/few-shot learning; on the other hand, it is possible to imagine involving more diverse information, such as using auxiliary DSM information as an object guide, as well as mining more information from the current data.

Patents
At present, we are applying for the patent based on the research results of this paper and the application material has been submitted to China National Intellectual Property Administration (the patent application number is 2020100445918). Moreover, we are waiting for the examination and grant of this patent.