1. Introduction
Remote sensing image change detection (CD) focuses on identifying changes in phenomena or objects that have occurred in the same geographical area at different times [
1]. Because synthetic aperture radar (SAR) images are independent of weather and atmospheric conditions [
2,
3,
4], SAR image CD has attracted increasing attention. Recently, SAR image CD has been widely used in a diversity of studies, such as disaster assessment [
5], urban research [
6], environmental monitoring [
7], and forest resource monitoring [
8].
Generally, according to whether ground-truth information is used in the SAR image CD process, the CD approaches can be divided into three categories: supervised method [
9], semi-supervised method [
10], and unsupervised method [
11]. Theoretically, the supervised and semi-supervised methods can achieve better performance than the unsupervised method, but they require ground-truth information. However, obtaining ground-truth of SAR images is difficult, labor-intensive, grueling, and time-consuming for researchers. Therefore, unsupervised methods have attracted increasing attention [
11,
12,
13,
14].
Unsupervised CD in SAR images usually includes three steps: (1) image preprocessing, (2) generating a difference image (DI), and (3) analyzing the DI [
12]. In the preprocessing step, tasks mainly involve multi-temporal image co-registration, geometric corrections, and noise reduction. In the second procedure, two registered images are compared pixel by pixel to generate the DI, which is intended to enhance the discrepancy between unchanged and changed areas. Difference operators and ratio operators are two common methods used to acquire DI [
15]. Due to the influence of multiplicative noise in SAR images, difference operators cannot acquire effective DI. Although a ratio operator can overcome the disadvantage of multiplicative noise, it does not consider local, edge, and class-conditional distribution information of SAR images. Therefore, some researchers propose log-ratio (LR) operator [
16] and mean-ratio (MR) detector [
17] methods to acquire DI. The LR can weaken the influence of independent points in the background part of the unchanged class. The MR takes spatial neighborhood information into account, and it can suppress independent points adequately. Moreover, some improved ratio operator methods, such as the Gauss-log ratio operator [
18], wavelet fusion on ratio operator [
19], and saliency extraction guided ratio operator [
20], are proposed to generate a superior DI. Additionally, ratio-based nonlocal information (RNLI) [
21] and improved nonlocal patch-based graph (INLPG) methods [
22] are also used to generate DI.
In the third procedure, the CD task is transformed into a binary classification task. The generated DI is analyzed and divided into the changed class and unchanged class to achieve the binary change map (CM). The threshold method and clustering method are two typical types of DI analysis approaches. Some threshold methods, such as the Kittler and Illingworth (K&I) minimum-error threshold algorithm [
23], expectation maximization (EM) algorithm [
24], generalized KI (GKI) threshold [
15], locally fitting and semi-EM algorithm [
25], have been proposed to divide the DI into the changed and unchanged class. However, it is difficult to select the threshold. A tiny change in the threshold can lead to a tremendous error in the final results. Compared with the threshold method, the clustering method does not need modeling and has higher flexibility. As a classical clustering method, k-means clustering has been used to divide the DI into the changed or unchanged class [
26]. Fuzzy c-means (FCM), another typical algorithm, is also frequently employed to analyze DI in SAR image CD [
27]. Furthermore, some FCM variants, such as the fuzzy local information c-means (FLICM) algorithm [
28], reformulated fuzzy local-information C-means algorithm (RFLICM) [
27], edge-weighted FCM [
29], and spatial FCM (SFCM) [
13], have also been developed to improve the effect of clustering in SAR image CD. Although the clustering methods have some advantages over the threshold methods, they are sensitive to noise and do not sufficiently consider the spatial information of SAR images. These factors cause them to acquire unsatisfactory clustering results in DI analysis.
Compared with traditional methods, the deep learning method is more resistant to noise, and it can automatically extract feature representations from the input. Therefore, some deep learning approaches, such as principal component analysis network (PCANet) [
30], convolutional neural network (CNN) [
13], convolutional-wavelet neural networks (CWNN) [
31], siamese adaptive fusion network (SAFNet) [
32], restricted CNN [
33], dual-domain network (DDNet) [
14], and variational autoencoder (VAE) [
34], have been employed to accomplish SAR imagery CD. Most of the proposed networks are based on CNN, which often uses a tiny and single convolution kernel (often 3 × 3). The small convolution kernel often has a diminutive receptive field and cannot cover a huge region of the input feature. To deal with this problem, CNN usually uses small kernel size convolution layers coupled with down-sampling layers to gradually decrease the input feature size and to magnify the receptive field of network. Theoretically, the receptive field of network can cover a huge part or even the whole part of the input feature. However, the empirical receptive field is much more diminutive than the theoretical one [
35,
36]. Because the receptive field is not sufficiently large, it cannot capture enough context information and some other useful detailed information [
36]. This will adversely impact the learning process of the network, and it will influence the recognition performance of the network. Therefore, for SAR image CD, CNN with a small and single convolution kernel cannot acquire enough context information and cannot make full use of the detailed information in SAR images. Consequently, the CD results of these methods need to be improved.
In order to address the abovementioned drawbacks, we propose a pyramidal convolutional block attention network (PCBA-Net) for SAR image CD. The proposed network consists of pyramidal convolution (PyConv) and convolutional block attention module (CBAM). PyConv comprises different levels of kernels, and each level includes different types of filters with diverse sizes and depths [
36]. PyConv not only expands the receptive field of input features to capture enough context information, but it also handles input features with incremental kernel sizes in parallel to acquire multi-scale detail information of SAR images. To the best of our knowledge, few previous studies have considered PyConv for SAR image CD. Additionally, the attention mechanism can not only focus on the region of interest (RoI), but it also meliorates the representation of interests [
37]. CBAM combines channel attention and spatial attention [
38], and it can accentuate the significant features and restrain needless features in the channel and spatial axes. Therefore, CBAM is introduced in the proposed PCBA-Net to acquire more discriminative image features to improve the detection performance in recognizing changes and to enhance its robustness to pseudo-changes. Our objective is to enhance the representation of feature extraction in SAR image CD and to improve the performance of SAR image CD.
The remainder of this paper is organized as follows. The proposed method is described in detail in 
Section 2. Experimental results are depicted in 
Section 3. The discussion is given in 
Section 4. Finally, we conclude this study in 
Section 5.
  2. Methodology
Considering two co-registered SAR images  and , which are acquired in the same region at two different times  and , the goal of CD is to generate a binary CM that represents whether each pixel in the two images changes or not.
As precise annotation information is often difficult to acquire in practical application, unsupervised CD methods are urgently needed in many applications. Traditional unsupervised SAR image CD is usually composed of three steps: (i) preprocessing of the SAR images, (ii) DI generation, and (iii) analysis of the DI. The performance of CD severely depends on the quality of the DI. As SAR images are easily affected by speckle noise, it is difficult to obtain a high-quality DI. The proposed PCBA-Net directly extracts feature representation from the original SAR image pairs. It does not need to generate a DI and is not sensitive to speckle noise. Although the training of PCBA-Net needs labels, we generate pseudo labels using an unsupervised hierarchical clustering method [
30]. Hence, the whole CD process can be regarded as unsupervised.
In this study, pixel-wise patch-pairs are used to train the proposed approach. Given two multi-temporal co-registered SAR images  and , the  pixel-wise patch-pair is denoted as , where  are the image patches centered at the  pixel in  and , respectively. Its label  denotes the  pixel changed (“1”) or unchanged (“0”).
The proposed method consists of two phases. The first phase is to obtain reliable samples that have a high probability to be changed or unchanged. The second phase is to train the proposed PCBA-Net using the acquired reliable samples.
  2.1. Reliable Samples Generation
In order to obtain reliable samples to train the PCBA-Net, we need to generate pseudo labels. The process of generating pseudo labels is illustrated in 
Figure 1. Because SAR images are susceptible to speckle noise, SAR images are de-noised by the speckle-reducing anisotropic diffusion (SRAD) method [
39] to alleviate the influence of speckle noise in the preprocessing process. The DI is then generated using a log-ratio approach [
16], and the log-ratio DI 
 is expressed as 
.
The next step is to pre-classify the generated DI 
. In theory, any clustering approach can be employed to cluster the DI 
 into the changed and unchanged classes. Nevertheless, to obtain high-probability changed and high-probability unchanged classes, an effective clustering algorithm should acquire superior intra-class similarity and inter-class difference. Because of the overlap of the changed and unchanged classes, a single partitioned clustering approach, such as FCM, k-means, or their variants, has limited capability and effectiveness in achieving reliable clustering. To overcome the problem and achieve high-probability changed and high-probability unchanged classes, we use the two-stage hierarchical FCM (HFCM) clustering method [
30] to cluster the generated 
 into three classes: high-probability changed class (
), high-probability unchanged class (
), and intermediate class (
). The intermediate class (
) denotes the pixels that are difficult to distinguish by the clustering algorithm. The pixels in 
 and 
 are then chosen as the labels in the training process.
The procedure for clustering  into three classes using HFCM algorithm is as follows:
- (1)
- Use the FCM algorithm to cluster the  into two clusters: changed cluster () and unchanged cluster (). The number of pixels in the changed cluster () is denoted as . The upper bound of the change class is set as . 
- (2)
- Use the FCM algorithm to cluster  into five clusters: , , , , and . The five clusters are sorted in descending order by the mean value of each cluster. The cluster with a larger mean value has a higher probability to be changed and vice versa. The number of pixels in the five clusters are denoted as , , , , and , respectively. The pixels in  were assigned to the changed class . Set parameters . 
- (3)
- Set . 
- (4)
- If , assign the pixels in  to the intermediate class . Otherwise, the pixels in  should be assigned to the unchanged class . Go to step 3 and continue until . 
In this way, we can obtain the pseudo labels set [].
The parameter  is used to control the number of the intermediate class . For a given DI, when the change class is first determined, the setting of  determines the allocation of the number of the intermediate class  and unchanged class . If  is large, the number of pixels in the intermediate class  will be large, and the number of pixels in the unchanged class  will be small and vice versa. If the  is set too small, the number of pixels in the intermediate class  will be very small, and some pixels that are difficult to distinguish by the clustering algorithm will be allocated to the set of unchanged class . It will cause that pixels in the low-probability unchanged class may be selected in the training set. On the contrary, if the  is set too large, the number of pixels in the unchanged class  will become small. It may cause that we cannot select sufficient unchanged samples. To balance the number of pixels in the unchanged class  and intermediate class ,  is set 1.25.
  2.2. Overview of Pyramidal Convolutional Block Attention Network
The architecture of PCBA-Net is illustrated in 
Figure 2. It is mainly composed of four pyramidal convolutional block attention (PCBA) modules. Every PCBA module consists of a PyConv and a CBAM. First, image patches (with size 
) centered at pixels in 
 and 
 and their corresponding pseudo-labels are randomly selected as the training samples. The number of training samples is 10% of the number of pixels in the whole image. The ratio of changed samples to unchanged samples is about 1:1. These training samples are then fed into a convolution with a kernel size of 1 × 1 to generate new feature maps. These feature maps are sequentially forwarded to four PCBA modules to produce representative feature maps. These representative feature maps are then processed by a fully convolutional layer, two linear layers, and a softmax layer to get the final output. After that, a trained model will be obtained. Finally, the trained model is employed to test all patch-pairs in the whole image to generate the final binary CM.
  2.3. Pyramidal Convolutional Block Attention Module
The PCBA module is the main component of PCBA-Net. The architecture of PCBA is illustrated in 
Figure 3. It consists of a PyConv and a CBAM. The PCBA and CBAM are described in detail in the following sections.
  2.3.1. The PyConv Block
Traditional CNN generally utilizes a small and single convolution kernel, which has relatively diminutive receptive field. It cannot acquire enough context information and some other useful detailed information. This causes the network cannot fully extract the feature representation of the input image and adversely affects the recognition performance of the network.
In contrast to traditional convolution, PyConv consists of different levels of kernels, and each level contains diverse types of filters with various sizes and depths. In the implementation of this study, the PyConv contains three levels of different types of convolution kernels, and the structure of these convolution kernels is a inverted pyramid. The kernel sizes increase from the first level (bottom of the pyramid) to the third level (top of the pyramid). As shown in 
Figure 3a, the kernel sizes of the three levels are 3 × 3, 5 × 5, and 7 × 7, respectively.
Furthermore, the depth of the kernel decreases from the first level to the third level. In order to use different kernel depths at each level of PyConv, the input feature maps are divided into different groups, and each group applies independent kernels. This method is called grouped convolution [
36]. We use two examples to explain the grouped convolution. There are four input feature maps in every example. The examples are illustrated in 
Figure 4. 
Figure 4a describes the standard convolution, which only includes a single group of input feature maps. In this case, the depth of convolution kernels equals the number of input feature maps. The number of convolution kernels equals the number of output feature maps. Every output feature map connects to every input feature map. 
Figure 4b illustrates the case that the input feature maps are split into two groups (groups 
 in the example). In this case, each group applies independent kernels, and the depth of convolution kernels in each group becomes 1/2 of the number of input feature maps. The number of convolution kernels in each group is 1/2 of the number of output feature maps. The number of output feature maps in each group is also 1/2 of the number of output feature maps of the whole convolution. As illustrated in 
Figure 4, when the number of groups increases, the depth of the kernels decreases. As a result, the computational cost of convolution and the number of parameters is reduced by a factor equal to the number of groups. Model parameters and computational costs are described in detail in the next subsection.
In the specific implementation of PyConv in this study, the ratio of the number of input feature maps in three convolution levels is 1:1:2, and the group number of grouped convolution is 1, 4, and 8, respectively. Therefore, the ratio of depth of the kernels in the three levels is 4:1:1.
The different types of convolution kernels with incremental kernel sizes can not only expand the receptive field of input features to acquire enough context information, but they also use multi-scaled convolution kernels to handle the input features in parallel. Kernels of smaller size can concentrate on the detail information of smaller objects or parts of objects, while kernels of larger size can focus on the detail information of larger objects and context information. In this way, the multi-scaled convolution kernels in PyConv can obtain complementary information and enhance the recognition performance of the network.
  2.3.2. Model Parameters and Floating-Point Operations (FLOPs) of PyConv
Model parameters and floating-point operations (FLOPs) are two important indicators of network model complexity [
36]. For the standard convolution, the number of parameters and FLOPs are calculated as follows:
For the grouped convolution, the number of parameters and FLOPs are expressed as:
          where 
 and 
 denote the number of input feature maps and the number of output feature maps, respectively. The convolution kernel size is 
. 
 and 
 indicate the height and width of the input feature maps, respectively. 
 represents the number of groups in grouped convolution.
For the proposed PyConv, the number of parameters and FLOPs are expressed as:
          where 
, 
, and 
 denote the number of input feature maps in the first level, the second level, and the third level of PyConv, respectively. 
, 
, and 
 represent the number of output feature maps in the first level, the second level, and the third level of PyConv, respectively. 
, 
, and 
 refer to the number of groups that the input feature maps are divided in the first level, the second level, and the third level of PyConv, respectively. 
, 
, and 
 denote the convolution kernel sizes of the first level, the second level, and the third level of PyConv, respectively. 
 denotes the number of output feature maps of PyConv. 
 and 
 indicate the height and width of the input feature maps, respectively.
  2.3.3. Convolutional Block Attention Module
CBAM is another crucial component of PCBA. As 
Figure 3b shown, CBAM comprises two parts: channel attention and spatial attention [
38]. For a given input feature map F with size C × H × W, CBAM generates a 1D channel attention map with size C × 1 × 1 and a 2D spatial attention map with size 1 × H × W, sequentially. This process is depicted in 
Figure 3b, which can be represented by following Equation:
          where 
 denotes element-wise multiplication.
In the channel attention model, average pooling and max pooling are first performed on the input feature map F to acquire two feature vectors (e.g., average-pooled features and max-pooled features) with size C × 1 × 1, respectively. Then, the two feature vectors are respectively processed by a weight-sharing multi-layer perceptron (MLP) with one hidden layer. After that, the two feature vectors are merged into one with element-wise summation, and a sigmoid function σ is employed on it to obtain channel attention map 
. In brief, channel attention is expressed as:
          where σ represents the sigmoid function, 
 and 
 are the MLP weights, and their dimensions are C/r × C and C × C/r, respectively. r denotes the reduction ratio.
In the spatial attention module, average pooling and max pooling are also performed on the feature map F to generate two 2D maps. The two 2D maps are then concatenated and processed by a standard convolution layer. In brief, spatial attention is expressed as:
          where σ indicates the sigmoid function and 
 represents a convolution operation with the filter size of 
.
  4. Discussion
In this study, PCBA-Net is proposed for SAR image CD. We use PyConv to learn appropriate feature representation. In order to accentuate the momentous features, CBAM is introduced in the proposed PCBA-Net. Six actual datasets are utilized to assess the performance of PCBA-Net in the experiments. FP, FN, OE, PCC, KC, and F1 score are used as evaluation parameters to assess the CD results. The results of six real SAR datasets confirm the effectiveness of our proposed method. The proposed PCBA-Net outperforms several state-of-the-art methods.
In our PCBA-Net, PyConv is utilized to extract the features of multi-temporal SAR images. The PyConv in our module contains three levels of different convolution kernels of diverse sizes and depths. PyConv can expand the receptive field of the input features. Moreover, it can manage the input feature with incremental convolution kernel sizes in parallel to acquire multi-scale detailed information. Concretely, kernels of smaller size have tiny receptive fields and concentrate on capturing information regarding smaller objects and parts of objects. Conversely, kernels of larger size have large receptive fields and center on acquiring detailed information regarding larger objects and context information. As a result, PyConv can capture complementary information and improve the recognition performance of the network. Furthermore, grouped convolution is used in the PyConv. This helps PyConv to use kernels with different depths and to reduce the computational cost. Compared with standard convolution, PyConv sustains a similar number of model parameters and computational resource requirements. The main contributions of this study can be summarized as follows:
(1) PyConv is introduced for SAR image CD. PyConv can not only extend the receptive field of input to capture enough context information, but it also handles the input with incremental kernel sizes in parallel to obtain multi-scale detailed information of SAR images.
(2) CBAM is embedded into the PyConv to emphasize vital features of SAR images. CBAM can enhance crucial features and restrain redundant features in the channel and spatial axes of SAR images.
In the experiments, we compare the proposed PCBA-Net with the PCANet [
30], CNN [
13], CWNN [
31], DDNet [
14], and SAFNet [
32] methods. Details of compared results are described in 
Section 3.4. In PCANet, Gabor wavelets and FCM are used to obtain the pre-classified samples as the labeled samples, and PCANet is utilized for extracting features and classification. PCANet uses PCA filters as convolutional filters. On five datasets, PCANet has quite high FN values. This means that PCANet ignores a large number of changed pixels and misses much detailed information. In the CNN method, a spatial fuzzy clustering algorithm is used to pre-classify the DI for acquiring pseudo-labels. CNN with two convolution layers and two pooling layers is used for extracting features and classification. Because the structure of the CNN method is extremely simple, its capacity for feature extraction is quite weak. This causes unsatisfactory results for CD. In CWNN, dual-tree complex wavelet transform (DT-CWT) is introduced to CNN to alleviate the infection of speckle noise. CWNN has a small FN value, but its FP value is quite large. This means that numerous unchanged pixels are incorrectly detected as the changed pixels. In DDNet, a multi-region convolution (MRC) module is proposed, and features in discrete cosine transform (DCT) domain are integrated into the CNN model. In this way, both spatial and frequency features can be exploited in the DDNet method. In SAFNet, a siamese neural network is presented to extract features of multi-temporal SAR images, and an adaptive fusion module is used to compound multi-scaled features in different convolutional layers. A correlation layer is utilized to exploit the correlation between multi-temporal images. Although DDNet and SAFNet improve the performance of SAR image CD to some extent, the CD results remain unsatisfactory. Compared with the above methods, the proposed PCBA-Net obtains the best CD performance in the six actual datasets.
Moreover, we also compare the model parameters, FLOPs, training time, and testing time of the proposed PCBA-Net with four CNN-based approaches, i.e., CNN, CWNN, DDNet, and SAFNet (as shown in 
Table 8). Because PCANet is not a CNN-based method, it is inappropriate to compare its parameters and model complexity with CNN-based methods. Thus, we do not list these parameters of PCANet. The training time and test time are calculated on computer with NVIDIA GeForce RTX 3090 GPU with 24 GB memory when the batch size is equal to 1024. 
Table 8 lists these four parameters. According to 
Table 8, the FLOPs, training time, and testing time of the proposed PCBA-Net are larger than the four compared CNN-based methods. Although the proposed PCBA-Net has the largest parameters and FLOPs compared with the four CNN-based methods, its model parameters were only 1.873 M, which is not too large. Furthermore, we notice that CWNN has fewer parameters and FLOPs than DDNet and SAFNet (as shown in 
Table 8), but the performance of CWNN surpasses the performance of DDNet and SAFNet in the Ottawa and Sulzberger datasets (as shown in 
Table 2 and 
Table 3, respectively). DDNet has fewer parameters and FLOPs than SAFNet (as shown in 
Table 8), but the performance of DDNet exceeds the performance of SAFNet in the Ottawa, Yellow River A, and Yellow River B datasets (as shown in 
Table 2, 
Table 5, and 
Table 6, respectively). CNN has fewer parameters and FLOPs than CWNN (as shown in 
Table 8), but the performance of the CNN method is superior to the performance of CWNN in the Yellow River B and Yellow River C datasets (as shown in 
Table 6 and 
Table 7, respectively). This means that the performance of networks does not increase strictly with the increase of the model parameters and FLOPs. In addition, we also observe that DDNet has fewer parameters and FLOPs than SAFNet, but it has larger training and testing times than SAFNet (as shown in 
Table 8). This means that the training and testing time of networks also does not increase strictly with the increase of the model parameters and FLOPs.
In PyConv, the number of levels and convolution kernel size are important parameters. These parameters may impact feature extraction and affect the performance of CD. Hence, we investigate the effects of these parameters on CD performance. Diverse kernel sizes and levels are designed in the experiments, as shown in the first column of 
Table 9. PCC is used as the evaluation parameter to indicate the performance of CD in the experiments. PCC values with different types of PyConv are listed in 
Table 9. The best results are shown in bold. According to 
Table 9, the PyConv with kernels of 3 × 3, 5 × 5, and 7 × 7 obtains the best results in the six real datasets. This confirms that the PyConv with multi-scale convolution kernels can improve the recognition performance of the network.
For the PCBA-Net, the number of PCBA blocks may have an effect on the performance of SAR image CD. Therefore, we research the relationship between PCC values and the number of PCBA blocks. We set the number of PCBA block to 1, 2, 3, 4, and 5. 
Figure 17 depicts the PCC values with different numbers of the PCBA blocks on the six datasets. According to 
Figure 17, PCC values increase with increasing number of PCBA blocks at first. PCC values reach their maximum when the number of PCBA blocks is 4, then PCC values start to decrease. This may be because the PCBA-Net model is relatively simple and cannot extract the features of SAR images sufficiently when the number of PCBA blocks is less than 4. The capability of feature extraction of the proposed PCBA-Net grows with the increase in the number of PCBA blocks. PCBA-Net has the highest ability to extract features when the number of PCBA blocks is equal to 4. With a continued increase in the number of PCBA blocks, the parameters and model complexity of PCBA-Net increase, which causes over-fitting of the network. Therefore, we use four PCBA blocks in the PCBA-Net.
Furthermore, spatial context information is related to the size of input patch-pairs. The size of input patch-pairs is denoted as r × r × 2. The value of r may influence the CD performance of PCBA-Net. Accordingly, we explore the relationship between r and PCC. We set r = 5, 7, 9, 11, 13, 15, 17, 19, and 21 in the experiments. Then we use sample patch-pairs with these sizes of r to train the PCBA-Net. 
Figure 18 shows the PCC values with different values of r on the six datasets. The PCC values increase and then tend to be stable with the increment of r values. When r = 7, PCC values reach the maximum for all six datasets. This indicates that it is difficult to identify changed information of the center pixel using large patch size. Additionally, the larger patch size increases the complexity and computation cost of the model, resulting in a decrease in model performance. After comprehensive consideration, we set r = 7 in the experiments.
In addition, in order to explore the effectiveness of the PCBA-Net module, ablation experiments are conducted on the six datasets. CNN refers to the traditional CNN network, which has the same number of layers and output shape as PyConv in the PCBA-Net. The performance of CD is evaluated by PCC. The results of the ablation study are listed in 
Table 10. The best results are shown in bold. The PyConv model outperforms the CNN model (PCC values in the third row vs. PCC values in the first row). This demonstrates that the PyConv can enhance the recognition performance of the network. Moreover, when the CBAM is introduced to the network, the performance model also improves (PCC values in the fourth row vs. PCC values in the third row). This indicates that both PyConv and CBAM play important roles in improving the results of CD. Removing either the Pyconv model or the CBAM module will reduce the performance of CD.