Fully Convolutional Networks with Multiscale 3D Filters and Transfer Learning for Change Detection in High Spatial Resolution Satellite Images

Remote sensing images having high spatial resolution are acquired, and large amounts of data are extracted from their region of interest. For processing these images, objects of various sizes, from very small neighborhoods to large regions composed of thousands of pixels, should be considered. To this end, this study proposes change detection method using transfer learning and recurrent fully convolutional networks with multiscale three-dimensional (3D) filters. The initial convolutional layer of the change detection network with multiscale 3D filters was designed to extract spatial and spectral features of materials having different sizes; the layer exploits pre-trained weights and biases of semantic segmentation network trained on an open benchmark dataset. The 3D filter sizes were defined in a specialized way to extract spatial and spectral information, and the optimal size of the filter was determined using highly accurate semantic segmentation results. To demonstrate the effectiveness of the proposed method, binary change detection was performed on images obtained from multi-temporal Korea multipurpose satellite-3A. Results revealed that the proposed method outperformed the traditional deep learning-based change detection methods and the change detection accuracy improved using multiscale 3D filters and transfer learning.


Introduction
Change detection is a major research field in remote sensing; change detection methods are used for detecting the areas damaged by natural disasters [1][2][3]; monitoring vegetation [4][5][6]; as well as urban expansion [7][8][9] by analyzing spatial, spectral, and temporal changes in an area [10]. The wide availability of satellites and unmanned aerial vehicles worldwide and the improvements in sensor manufacturing technology have enabled acquiring images with a spatial resolution within 1 m and detecting regions of interest from high spatial resolution images. To use high spatial resolution satellite images for change detection, problems associated with spatial complexity, geometric inconsistency between images, and reflectance variability in each class must be considered [11][12][13].
Pixel-and object-based change detection methods are used for analyzing high spatial resolution satellite imagery [14]. Pixel-based change detection methods, such as image differencing [15], change vector analysis [16], and principal component analysis [17], detect changes based on the pixel, which is the basic unit of image analysis. Although these methods detect differences in detailed spectral characteristics at the pixel-level, the spatial context cannot be considered and the detection is easily of the change detection on average and transfer learning using multispectral dataset and several benchmark hyperspectral datasets could solve the limitation of small sample problems in hyperspectral images processing.
However, this was limited by differences in the spatial and spectral resolution of the source and target domain datasets. For example, multispectral images with 0.5 m spatial resolution and 4 bands were used as a source domain dataset in the semantic segmentation network, whereas hyperspectral images with 30 m spatial resolution and 150 bands were employed as a target dataset in the change detection network. To improve the previous study, transfer learning was performed on a large dataset of aerial images to detect changes in multi-temporal high spatial resolution images using the recurrent FCN. The network is composed of 3D convolutional layers and convolutional LSTM. The 3D convolutional layers extract the spatial-spectral features of the input images, and the convolutional LSTM analyzes the temporal relationship between feature maps obtained from temporal images. Therefore, extracting meaningful feature maps by considering spatial-spectral features of input images can improve the results of change detection network. Herein, multiscale 3D filters were used in the initial convolutional layer of the change detection network; the layer exploits pre-trained weights and biases of semantic segmentation network trained on an open benchmark dataset. The main contributions of this study as follows.

1.
Specialized 3D filters for spatial and spectral information were utilized to combine optimal multiscale filters considering the complexity of the calculation process and to prevent the redundancy of extracted information. Different surface materials can be detected using high spatial resolution satellite images; therefore, spatial and spectral filters of different sizes can be used to extract meaningful features, with the corresponding features maps improving the accuracy of the change detection.

2.
We attempted to address the training data limitation using the proposed change detection method and the pre-trained information trained on high spatial resolution aerial images. The spatial and spectral resolutions of these images are similar to those of the satellite images used herein. Trained weights and biases can provide reasonable initial points of initial layer in the change detection network and prevent overfitting problems.

3.
To confirm the effectiveness of the multiscale 3D filter and transfer learning for change detection in high spatial resolution satellite images, accuracies of other change detection methods based on deep learning and the proposed method with and without transfer learning were compared; then, the conditions for change detection were analyzed.
The remainder of this paper is organized as follows. Section 2 presents the architecture of the proposed method, and the datasets and environmental conditions for the experiments are described in Section 3. Sections 4 and 5 present the results and discussion, respectively, and Section 6 presents the conclusions.

Methods
The proposed change detection method primarily (i) trains the FCN for semantic segmentation using a large remote sensing dataset as the source domain, and (ii) performs transfer learning from the pre-trained FCN to the recurrent FCN for change detection. The FCNs for semantic segmentation and change detection includes the three multiscale 3D filters in the initial convolutional layer to extract various spatial and spectral features from high spatial resolution images. After the layer with 3D filters is trained on the source dataset, the pre-trained filters are transferred and fine-tuned on the target dataset.  The network was adapted for 3D convolutions with  downsampled images followed by upsampled images with transpose convolutions for recovering the  image dimensions. The 3D convolution is calculated as follows

Fully Convolutional Network (FCN) for Semantic Segmentation
where v x,y,z l,j is the pixel value of position (x, y, z) on the jth feature map in layer l (the layer of the current operation), and H and W are the width and height of the kernel, respectively. The parameter R is the spectral dimension of the 3D kernel, w hwr ljn is the weight value at the position (h, w, r) connected to the nth feature in the (l − 1)th layer, o (x+h)(y+w)(z+r) (l−1)n represents the input at the position (x + h)(y + w)(z + r) in (h, w, r) denoting its offset to (x, y, z), φ is the activation function, and the b is a bias parameter. The ReLU, which rectifies negative values to zero, is used as the activation function.
Remote Sens. 2020, 12, x FOR PEER REVIEW 4 of 17 Figure 1 shows the FCN architecture. First, the 3D FCN performing the semantic segmentation was trained on an aerial image dataset containing images obtained from the International Society for Photogrammetry and Remote Sensing (ISPRS). The network was adapted for 3D convolutions with downsampled images followed by upsampled images with transpose convolutions for recovering the image dimensions. The 3D convolution is calculated as follows, where , , , is the pixel value of position ( , , ) on the jth feature map in layer l (the layer of the current operation), and and W are the width and height of the kernel, respectively. The parameter R is the spectral dimension of the 3D kernel, lj ℎ is the weight value at the position (ℎ, , ) connected to the nth feature in the ( − 1) th layer, (l−1) ( +ℎ)( + )( + ) represents the input at the position ( + ℎ)( + )( + ) in (ℎ, , ) denoting its offset to ( , , ) , is the activation function, and the b is a bias parameter. The ReLU, which rectifies negative values to zero, is used as the activation function. Figure 1. Illustration of a fully convolutional network containing multiple three-dimensional (3D) filters for semantic segmentation. "Conv3D", "Max pool", and "Deconv" represent the 3D convolutional layer, max pooling layer, and deconvolutional layer, respectively. The numbers on the boxes represent the pixel size of the input images and italic fonts mean the output number of layers. For example, "30" means the output size of the layers is 30×30 pixels and "6" represents the number of final feature maps is six.
Metric to sub-metric-level spatial resolution images reflect ground objects of different sizes varying from very small neighborhoods to large regions; therefore, multiscale filters were used to extract different features. These features were successively applied in the classification task [40][41][42][43]. Multiscale filters also allow observations from broad and micro perspectives [43]. Unlike previous study, herein, the spatial dimension and aspect of the spectral band of the filter are considered. Generally, features with smaller spatial scales such as edges of buildings and roads, respond to small convolutional filters, whereas coarse structures are extracted by large filters [41]. Furthermore, different spectral features are extracted depending on the number of adjacent bands included. As the input image contains red, green, blue (RGB), and near-infrared bands, the spectral information can be sued for identifying different materials. For example, materials with similar colors in the RGB bands are discriminated by considering the near-infrared band characteristics.
Therefore, multiscale 3D spatial and spectral filters with different sizes were used to extract meaningful features and improve the feature extraction robustness from the high spatial resolution satellite images. Initially, 3D convolutional layers were applied for parallel input of the image. The network uses multiple 3D filters in the first convolutional layer with different sizes. To confirm the effectiveness of the multiscale 3D filters in the semantic segmentation and determine appropriate shapes for the filters ( × × ), the classification accuracies at pixel level for different cases were compared using 3D filters, namely, (1 × 1 × 4), (3 × 3 × 3), (5 × 5 × 2), and (7 × 7 × 1). The size of the filter can be determined according to the size of the input image. The width and height of the 3D filter (x, y) were selected as 1, 3, 5, and 7 based on previous studies on the classification of satellite images using multiscale filters [40,42,43]. The size of spectral band z ranged from 1 to 4, covering the ISPRS dataset containing four bands (three visible bands and a near-infrared band). In particular, to Figure 1. Illustration of a fully convolutional network containing multiple three-dimensional (3D) filters for semantic segmentation. "Conv3D", "Max pool", and "Deconv" represent the 3D convolutional layer, max pooling layer, and deconvolutional layer, respectively. The numbers on the boxes represent the pixel size of the input images and italic fonts mean the output number of layers. For example, "30" means the output size of the layers is 30 × 30 pixels and "6" represents the number of final feature maps is six.
Metric to sub-metric-level spatial resolution images reflect ground objects of different sizes varying from very small neighborhoods to large regions; therefore, multiscale filters were used to extract different features. These features were successively applied in the classification task [40][41][42][43]. Multiscale filters also allow observations from broad and micro perspectives [43]. Unlike previous study, herein, the spatial dimension and aspect of the spectral band of the filter are considered. Generally, features with smaller spatial scales such as edges of buildings and roads, respond to small convolutional filters, whereas coarse structures are extracted by large filters [41]. Furthermore, different spectral features are extracted depending on the number of adjacent bands included. As the input image contains red, green, blue (RGB), and near-infrared bands, the spectral information can be sued for identifying different materials. For example, materials with similar colors in the RGB bands are discriminated by considering the near-infrared band characteristics.
Therefore, multiscale 3D spatial and spectral filters with different sizes were used to extract meaningful features and improve the feature extraction robustness from the high spatial resolution satellite images. Initially, 3D convolutional layers were applied for parallel input of the image. The network uses multiple 3D filters in the first convolutional layer with different sizes. To confirm the effectiveness of the multiscale 3D filters in the semantic segmentation and determine appropriate shapes for the filters (x × y × z), the classification accuracies at pixel level for different cases were compared using 3D filters, namely, (1 × 1 × 4), (3 × 3 × 3), (5 × 5 × 2), and (7 × 7 × 1). The size of the filter can be determined according to the size of the input image. The width and height of the 3D filter (x, y) were selected as 1, 3, 5, and 7 based on previous studies on the classification of satellite images using multiscale filters [40,42,43]. The size of spectral band z ranged from 1 to 4, covering the ISPRS dataset containing four bands (three visible bands and a near-infrared band). In particular, to combine optimal multiscale filters, considering the complexity of the calculation process and prevent the redundancy of the extracted information, specialized 3D filters for spatial or spectral information were utilized. Therefore, the sizes of 3D filters were controlled, e.g., if the spatial dimension of the filters were large, the spectral dimension of the filters was set to be small.
The three feature maps obtained from three convolutional layers in the first layer were then combined to create a joint feature map. Feature maps from each convolutional filter have different sizes; therefore, they must be readjusted to match before creating the composite feature map. The features share all dimensions using padding except for the channels used, which may differ, and all feature maps are collected in a tensor.
The composite feature map was then used as the input for the sequential convolutional layers. The network mainly comprises nine convolutional layers with 3D filters, two pooling layers, and deconvolutional layers. For the downsampling, a filter size of 3 was used (3 × 3 × 3), with pooling followed by two sets of convolutions with size and stride of 2. This step generated the spatial size of the output map, with two convolution operations of identical output dimension followed by a pooling layer from a block of operations. Successive blocks reduced the spatial size, and many upsampling blocks were followed by downsampling blocks to recover the spatial size of the original image. Upsampling was achieved via transpose convolutions; after each transpose convolution, slicing of the output map occurred to match the size of the corresponding output map in the downsampling block, followed by the concatenation and convolution operations. This process was repeated until the original spatial size was recovered. The experiment was performed using Keras with TensorFlow as the backend, and the network was trained using the NVIDIA GeForce RTX 2070 GPU memory of 8 GB; the ISPRS multispectral datasets was used as the source data input. The size of the ISPRS image was too large; slices of shapes with labels were extracted, separated into batches and stored with the 3D FCN trained using 960,000 sub-images. In the experiment, the structure of the FCNs was identical to Figure 1 but the initial convolutional layer was different. When one filter was used, the initial layer comprised sequential two 3D convolutional layers. All patches obtained from the ISPRS dataset were used as training and test samples. The networks were trained with the Adam optimizer, which had a learning rate of 10 −3 , batch size of 256, and 300 Epochs.

Recurrent FCN for Change Detection
The proposed change detection network combined 3D FCN and convolutional LSTM, wherein a fully connected layer at the end of the network was replaced with a convolutional layer. The 3D convolutional layers with multiscale 3D filters extracted spatial-spectral features from the input images, whereas the convolutional LSTM recorded and analyzed the change information of the multi-temporal image. The network was developed as a recurrent 3D FCN (Re3FCN) [34] by adding the transfer learning and multiscale 3D filters to apply for high spatial resolution satellite images. Figure 2 shows the architecture of the proposed change detection network.
Creating meaningful feature maps from multi-temporal images improves change detection accuracy because the change detection network detects changes based on the temporal information from the feature maps generated from temporal images. Therefore, the transfer learning resolves the problem of insufficient training samples using many remote sensing images as the source data. As the ultimate goal of transfer learning is to improve the change detection performance, the low-level features learned by deep networks from the source domain are transferred to the target domain. This provides excellent initial configurations in the transfer learning method to quickly initiate meaningful feature extraction from the multi-temporal high spatial resolution satellite images; proper initialization is crucial for network training [44]. The hypothesis is that the lowest layers of the FCN extract general features from the images; therefore, the learned weights are extended to other recognition tasks, as these mostly detect generic features. Concurrently, the topmost layers detect higher level features from the images, and therefore these are specific for the trained network's classification task. Thus, it is hypothesized that initializing a convolutional network with weights from a network pre-trained on another dataset accelerates the training process, and improves performance because the low-level features are generic across different tasks.

Recurrent FCN for Change Detection
The proposed change detection network combined 3D FCN and convolutional LSTM, wherein a fully connected layer at the end of the network was replaced with a convolutional layer. The 3D convolutional layers with multiscale 3D filters extracted spatial-spectral features from the input images, whereas the convolutional LSTM recorded and analyzed the change information of the multitemporal image. The network was developed as a recurrent 3D FCN (Re3FCN) [34] by adding the transfer learning and multiscale 3D filters to apply for high spatial resolution satellite images. Figure  2 shows the architecture of the proposed change detection network.

Figure 2.
Architecture of the proposed change detection network, with "Conv2D" representing the 2D convolutional layer. "ig", "fg", "og", "c","c", and "h" represent input gate, forget gate, output gate, candidate memory cell, memory cell, and hidden state, respectively. The colored layers exploit the pre-trained weights and biases from semantic segmentation network.
Considering I T 1 and I T 2 as the images obtained at times T 1 and T 2 , respectively, the 3D patches obtained from each image are passed through the 3D convolutional layers with different 3D filters in parallel. Multiscale 3D filters were used to create different feature maps with identical 3D filters employed in the classification network. The weights and biases of initial layer with multiscale filters were later fine-tuned, with two more 3D convolutional layers included after the multiscale 3D convolutional layers. This is because small patches were used as input, which naturally reduced the network depth, and the predicted classes are relatively simple (change and no change) compared to other classification tasks. For example, the ImageNet classification involves 1000 categories, whereas the PASCAL VOC classification shows 20 classes [33]. A simple network is suitable for detecting changes in high spatial resolution images.
The spectral-spatial feature maps obtained from 3D convolutional layers were fed into the convolutional LSTM layer. In this phase, the temporal information between two images was reflected. Let f T 1 and f T 2 be the spectral-spatial feature maps obtained from I T 1 and I T 2 , respectively. The RNN architecture recollects values over arbitrary intervals using a memory cell c t at a time step t. The convolutional LSTM involves three gates, namely, the input gate ig, output gate og, and forget gate f g, each of which has a learnable weight. f g t is the gate for forgetting the previous information, and the output range of the sigmoid function is 0-1. If σ = 0, the previous state information is forgotten; if σ = 1, the previous state information is memorized. ig t is the gate for remembering the current information, and the cell states are regulated by deleting or adding information through the gates. These gates are expressed as follows.
Remote Sens. 2020, 12, 799 The subscripts associated with the weight matrix W have specific meaning. For example, W h f g and b f g are the weight matrices between h t−1 and f g and the bias of f g, respectively. c t is the candidate cell value for constructing a new candidate value with ig t , which is then added to the memory cell c t to influence the next state. Finally, the output h t is determined by multiplying tanh(c t ) and og t ; " * " is the convolutional operator and " " is the element-wise multiplication. The three gates of the convolutional LSTM are represented by 3D tensors, and the convolutional LSTM determines the future state of a cell in the pixel-based on the input and past state of its adjacent region using a convolutional operator [45]. The outputs of the convolutional LSTM layer were then fed into the prediction layers to generate a score map, and the number of final feature maps and classes is equal. Finally, the pixels were classified into final classes according to the score map.

Quality Evaluation
To evaluate the utility of the proposed change detection method, the overall accuracy, Kappa coefficients, and F1 scores by class were calculated. The overall accuracy is the number of correctly classified pixels divided by the total number of sampled pixels. It is described as true positive (TP), true negative (TN), false negative (FN), or false positive (FP), with its calculation expressed in Equation (8). The F1 score considers the precision (Equation (9)) and recall (Equation (10) The Kappa coefficient measures the closeness of the classified images using the specific classifier with the ground truth map. For the Kappa coefficient, values greater than 0.8 imply a strong agreement between the classification result and ground truth, 0.6-0.8 indicates good accuracy, 0.4-0.6 indicates moderate accuracy, and <0.4 indicates poor accuracy [46]. The Kappa coefficient is defined as Equation (12), and it uses the overall accuracy and random accuracy (Equation (13)). Random accuracy is the sum of the products of reference likelihood and results likelihood for each class.
where OA and RA represent overall accuracy and random accuracy, respectively. n is total number of samples. Herein, the proposed network was compared with other methods, such as fully connected LSTM, 2D CNN-fully connected LSTM (2D CNN-LSTM), and Re3FCN combination composed of 3D convolutional layers and convolutional LSTM [34]. The LSTM deals with temporal information and is used to detect changes between two images [29]. The 2D CNN-LSTM involves the same structure of the paper of Mou et al. [33], and it comprises 2D convolutional layers and fully connected LSTM layers. The Re3FCN from a previous study was designed for extracting spatial and spectral features Remote Sens. 2020, 12, 799 8 of 17 from hyperspectral images. The difference between the Re3FCN and the proposed network is that the Re3FCN used a sequence of three convolutional layers, whereas the proposed network uses multiple convolutional layers and pre-trained values in the initial phase. To assess the effectiveness of the transfer learning, the change detection network with and without pre-trained weights and biases were compared. All three cases involved identical training and test samples and experimental parameters such as learning rate, Epochs, and optimizer type.

The International Society for Photogrammetry and Remote Sensing Dataset
ISPRS 2D semantic labeling challenge Potsdam is an online open benchmark dataset [47] that provides high spatial resolution airborne images with a spatial resolution of 5 cm. The data contain near-infrared, red, blue, and green orthorectified imagery and corresponding digital surface models. Furthermore, the data include ground truth images that show the impervious surface, buildings, trees, low vegetation, cars, and unidentified features (Figure 3). the Re3FCN used a sequence of three convolutional layers, whereas the proposed network uses multiple convolutional layers and pre-trained values in the initial phase. To assess the effectiveness of the transfer learning, the change detection network with and without pre-trained weights and biases were compared. All three cases involved identical training and test samples and experimental parameters such as learning rate, Epochs, and optimizer type.

The International Society for Photogrammetry and Remote Sensing Dataset
ISPRS 2D semantic labeling challenge Potsdam is an online open benchmark dataset [47] that provides high spatial resolution airborne images with a spatial resolution of 5 cm. The data contain near-infrared, red, blue, and green orthorectified imagery and corresponding digital surface models. Furthermore, the data include ground truth images that show the impervious surface, buildings, trees, low vegetation, cars, and unidentified features (Figure 3).
Although the dataset contains 38 patches, only 24 images with ground truth images were used for training and validation, and the patch numbers of the labeled data are presented in Table 1. Twenty-four large multispectral images containing 6000 × 6000 × 4 pixels in tiff format were the initial data input sources. Because the size of the ISPRS images was too large, slices of 30 × 30 × 4 pixels with labels (a total of 960,000 images) were extracted, separated into batches, and stored. The classification network was then trained using the sub-images.

KOMPSAT 3A
This dataset of the Korea multipurpose satellite (KOMPSAT)-3A involves multi-temporal images obtained from Daejeon in South Korea (Figure 4). KOMPSAT-3A is Korea's first earth observation/infrared satellite with two imaging systems on board and was developed by the Korea Aerospace Research Institute (KARI) [48]. It provides high spatial resolution panchromatic and multispectral imagery in the near-infrared, red, blue, and green bands. The spatial resolutions of the Although the dataset contains 38 patches, only 24 images with ground truth images were used for training and validation, and the patch numbers of the labeled data are presented in Table 1. Twenty-four large multispectral images containing 6000 × 6000 × 4 pixels in tiff format were the initial data input sources. Because the size of the ISPRS images was too large, slices of 30 × 30 × 4 pixels with labels (a total of 960,000 images) were extracted, separated into batches, and stored. The classification network was then trained using the sub-images.

KOMPSAT 3A
This dataset of the Korea multipurpose satellite (KOMPSAT)-3A involves multi-temporal images obtained from Daejeon in South Korea (Figure 4). KOMPSAT-3A is Korea's first earth observation/infrared satellite with two imaging systems on board and was developed by the Korea Aerospace Research Institute (KARI) [48]. It provides high spatial resolution panchromatic and Remote Sens. 2020, 12, 799 9 of 17 multispectral imagery in the near-infrared, red, blue, and green bands. The spatial resolutions of the KOMPSAT-3A are 0.55 m (panchromatic image) and 2.20 m (multispectral images with five bands). The multi-temporal images were acquired in October 2015(T 1 ) and July 2018 (T 2 ), with vegetation distribution changes due to seasonal change and changes in urban areas attributed to new construction. To improve the spatial resolution of the KOMPSAT-3A images, a hybrid pan-sharpening method based on the normalized difference vegetation index (NDVI) [49] was applied during preprocessing. Locations of the images with improved spatial resolution are shown in Figure 4. Before the change detection, geometric correction was applied to the multi-temporal images using ground control points.
Remote Sens. 2020, 12, x FOR PEER REVIEW 9 of 17 KOMPSAT-3A are 0.55 m (panchromatic image) and 2.20 m (multispectral images with five bands). The multi-temporal images were acquired in October 2015( 1 ) and July 2018 ( 2 ), with vegetation distribution changes due to seasonal change and changes in urban areas attributed to new construction. To improve the spatial resolution of the KOMPSAT-3A images, a hybrid pansharpening method based on the normalized difference vegetation index (NDVI) [49] was applied during preprocessing. Locations of the images with improved spatial resolution are shown in Figure  4. Before the change detection, geometric correction was applied to the multi-temporal images using ground control points. Binary change detection distinguishes the pixels of the sites into changed (Ω ) and unchanged (Ω ) classes. Ground truth data were generated using web maps and KOMPSAT images with high spatial resolution. We defined changes involving land cover classes, from waterbodies to building areas. The land cover classes include vegetation, bare soil, buildings, waterbodies, and roads. Colored roofs, such as blue, brown, and white are classified as "buildings." The "bare soil" represents ground without buildings and vegetation, whereas "road" encompasses asphalt roadways. Changes due to relief displacement and shadows are not considered as changes in the ground truth data. Figure 4. Locations of the two study sites; the background map was obtained from the ArcGIS world map [50]. The upper images are of site1 and the lower images are of site 2. The first images were obtained in October 2015, and the second images were obtained in July 2018. Both the images highlight the differences due to seasonal changes and newly constructed buildings.

Semantic Segmentation Results
The semantic segmentation results of the FCN for differently sized filters are presented in Table  2. The FCNs with (1 × 1 × 4 ) and (7 × 7 × 1 ) filters produced lower overall accuracy than (3 × 3 × 3) and (5 × 5 × 2) filters. In contrast, (3 × 3 × 3) and (7 × 7 × 1) filters have relatively higher F1 score than other 3D filters. In particular, the (3 × 3 × 3) filter shows the highest F1 scores for all classes and the (7 × 7 × 1) filter more effectively classifies the five classes-impervious surface, Figure 4. Locations of the two study sites; the background map was obtained from the ArcGIS world map [50]. The upper images are of site1 and the lower images are of site 2. The first images were obtained in October 2015, and the second images were obtained in July 2018. Both the images highlight the differences due to seasonal changes and newly constructed buildings.
Binary change detection distinguishes the pixels of the sites into changed (Ω c ) and unchanged (Ω u ) classes. Ground truth data were generated using web maps and KOMPSAT images with high spatial resolution. We defined changes involving land cover classes, from waterbodies to building areas. The land cover classes include vegetation, bare soil, buildings, waterbodies, and roads. Colored roofs, such as blue, brown, and white are classified as "buildings." The "bare soil" represents ground without buildings and vegetation, whereas "road" encompasses asphalt roadways. Changes due to relief displacement and shadows are not considered as changes in the ground truth data.

Semantic Segmentation Results
The semantic segmentation results of the FCN for differently sized filters are presented in Table 2. The FCNs with (1 × 1 × 4) and (7 × 7 × 1) filters produced lower overall accuracy than (3 × 3 × 3) and (5 × 5 × 2) filters. In contrast, (3 × 3 × 3) and (7 × 7 × 1) filters have relatively higher F1 score than other 3D filters. In particular, the (3 × 3 × 3) filter shows the highest F1 scores for all classes and the (7 × 7 × 1) filter more effectively classifies the five classes-impervious surface, building, low vegetation, tree, and car-than (5 × 5 × 2) and (1 × 1 × 4) filters. The (1 × 1 × 4) filter addresses spectral correlation rather than spatial information, whereas the (7 × 7 × 1) filter addresses local spatial correlation rather than spectral information. The filter that considers only the spectral information could not classify materials of five classes. Sematic segmentation results demonstrate that 3D filters, which consider spatial and spectral features, improve the classification results; further, spatial information significantly influences the classification of five classes than spectral information. Therefore, the (3 × 3 × 3), (5 × 5 × 2), and (7 × 7 × 1) filters were selected to create multiscale 3D filters. When the multiscale 3D filters were used in the initial convolutional layer of the semantic segmentation network, the F1 scores and overall accuracy display remarkable improvements. For example, the F1 scores and OA displayed remarkable improvements. For example, the F1 score of the impervious surface, buildings, low vegetation, trees, and car increased by 0.0303, 0.0143, 0.0656, 0.049, and 0.104, respectively, compared with the highest existing values. Furthermore, the FCN with multiscale 3D filters delivered an improved OA value of 87.17%, which was 2.9% higher compared with when the (3 × 3 × 3) filter was used.

Change Detection Results
During the process, the Adam optimizer with a learning rate of 10 −3 was used and the Epoch was set to 500. Training data were randomly generated from the ground truth data, and the number of training, validation, and test samples was 40,000, 20,000, and 30,000 pixels, respectively. ReLU served as the activation function of the convolutional layers, whereas softmax served as the activation function of the last convolutional layer. The final output of the change detection network could be classified into changed and unchanged classes. Figures 5 and 6 display change detection maps generated using the proposed and other change detection methods for sites 1 and 2. The overall accuracy, Kappa coefficient, and F1 score for all classes from different methods are presented in Tables 3 and 4. The LSTM network shows the lowest overall accuracy, Kappa coefficient, and F1 score for sites 1 and 2 (for site 1, overall accuracy = 0.9136, Kappa coefficient = 0.6384, and F1 score = 0.6876, and for site 2, overall accuracy = 0.8826, Kappa coefficient = 0.5350, and F1 score = 0.6010). In site 1, LSTM classified the pixel changes from gray bare soil to green vegetation into unchanged classes (Figure 5b). In the same way, LSTM did not recognize the pixel changes from dark colored bare soil to building with brown roof. In contrast, the 2DCNN-LSTM and Re3FCN produced relatively higher accuracies than the LSTM and could classify the changed and unchanged pixels according to the training data. The accuracy of the 2D CNN-LSTM for site 1 is an overall accuracy of 0.9597, Kappa coefficient of 0.8443, and F1 score of 0.8680, and that for site 2 is overall accuracy of 0.9565, Kappa coefficient of 0.8518, and F1 score of 0.8783. In addition, the Re3FCN yielded an overall accuracy of 0.9674, Kappa coefficient of 0.8984, and F1 score of 0.8978 for site 1, and overall accuracy of 0.9633, Kappa coefficient of 0.8766, and F1 score of 0.8990 for site 2. However, many spot noises and errors are noted at the boundaries of buildings, road, and trees.
Because the proposed change detection method uses transfer learning, change detection was performed with and without transfer learning to assess the effectiveness of transfer learning. To briefly explain the method and avoid confusion, the proposed change detection method without transfer learning is termed "multiscale Re3FCN without transfer learning" and the proposed change detection method with transfer learning is named "multiscale Re3FCN with transfer learning". The change detection method with the multiscale 3D filters outperformed other change detection methods for both study sites. The method without transfer learning produced an overall accuracy of 0.9717, Kappa coefficient of 0.8923, and F1 score of 0.9090 for site 1, and an overall accuracy of 0.9759, Kappa coefficient of 0.9158, and F1 score of 0.9304 for site 2. The multiscale Re3FCN with transfer learning showed the best results for all approaches. It produced an overall accuracy of 0.9790, Kappa coefficient of 0.9201, and F1 score of 0.9326 for site 1, and overall accuracy of 0.9795, Kappa coefficient of 0.9288, and F1 score of 0.9412 for site 2. The proposed change detection method could detect the pixels with the changes in class type although the colors appeared to be similar in RGB images. In addition, the spot noises were reduced and edges of changes were detected clearly.
Remote Sens. 2020, 12, x FOR PEER REVIEW 11 of 17 for site 1, and overall accuracy of 0.9633, Kappa coefficient of 0.8766, and F1 score of 0.8990 for site 2. However, many spot noises and errors are noted at the boundaries of buildings, road, and trees. Because the proposed change detection method uses transfer learning, change detection was performed with and without transfer learning to assess the effectiveness of transfer learning. To briefly explain the method and avoid confusion, the proposed change detection method without transfer learning is termed "multiscale Re3FCN without transfer learning" and the proposed change detection method with transfer learning is named "multiscale Re3FCN with transfer learning". The change detection method with the multiscale 3D filters outperformed other change detection methods for both study sites. The method without transfer learning produced an overall accuracy of 0.9717, Kappa coefficient of 0.8923, and F1 score of 0.9090 for site 1, and an overall accuracy of 0.9759, Kappa coefficient of 0.9158, and F1 score of 0.9304 for site 2. The multiscale Re3FCN with transfer learning showed the best results for all approaches. It produced an overall accuracy of 0.9790, Kappa coefficient of 0.9201, and F1 score of 0.9326 for site 1, and overall accuracy of 0.9795, Kappa coefficient of 0.9288, and F1 score of 0.9412 for site 2. The proposed change detection method could detect the pixels with the changes in class type although the colors appeared to be similar in RGB images. In addition, the spot noises were reduced and edges of changes were detected clearly.    (e) (f)

Comparison with Previous Studies
Although the LSTM learns the rules for change detection between temporal data, the images must be flattened for use with the fully connected LSTM network. Therefore, the LSTM is unsuitable

Comparison with Previous Studies
Although the LSTM learns the rules for change detection between temporal data, the images must be flattened for use with the fully connected LSTM network. Therefore, the LSTM is unsuitable for image analysis because it ignores spatial connectivity and the large weight matrix size increases the computational cost [44]. Therefore, change detection methods using LSTM, such as LSTM and 2D CNN-LSTM, relatively detect changes as unchanged areas than FCN-based change detection methods. However, using LSTM with 2D CNN, the change detection accuracies increase compared with when only LSTM is used. For example, the improvements in overall accuracy and Kappa coefficient are 4.6% and 0.2057 for site 1 and 7.3% and 0.3168 for site 2. The results show that convolutional layers extract meaningful features from temporal images, with the features improving change detection accuracies.
When comparing 2D CNN-LSTM with Re3FCN, the results show superior performance for the Re3FCN for sites 1 and 2. The difference in the two change detection methods is that the fully connected LSTM is replaced by the convolutional LSTM and 3D filters are used instead of 2D filters in the convolutional layers. The convolutional LSTM models the temporal dependency of inputs, maintaining the spatial structure, whereas the 3D convolution effectively exploits spatial and spectral information simultaneously [34]. The reflectance pattern through spectral bands is crucial information for high spatial resolution spatial satellite images ranging from the visible to near-infrared. For example, when reflected radiation is in the near-infrared than visible bands, the vegetation in that pixel is likely dense vegetation. Therefore, the spectral information is crucial for change detection using satellite images.
The Re3FCN was developed using multiscale 3D filters and transfer learning. The improvements in overall accuracies are 1.16% and 1.62% for sites 1 and 2, respectively. Furthermore, the F1 score for sites 1 and 2 increased to 0.0348 and 0.0422, respectively. Results show that changed pixels are correctly classified as changed classes by the proposed change detection method. Objects with different shapes and characteristics are identified in high spatial resolution images; therefore, multiscale 3D filters assist in extracting meaningful features and improving the change detection results.

The Effect of Transfer Learning
The multiscale Re3FCN without transfer learning was randomly initialized at the start of the iteration. Conversely, the multiscale Re3FCN with transfer learning used pre-trained weights and biases, which are convolutional layers with multiscale filters, in the FCN for semantic segmentation. When the network involves pre-trained convolutional layer with multiscale filters, the change detection results slightly improved. For example, the overall accuracy increased from 0.9717 to 0.9790 and Kappa coefficient from 0.8923 to 0.9201 for site 1. Furthermore, the F1 scores increased to 0.0236 (site 1) and 0.0108 (site 2), respectively. Thus, transfer learning provides more rational initial values than the randomly selected values, thereby improving the change detection performance under the same experimental conditions.

Conclusions
In this study, change detection was conducted using an FCN with multiscale 3D filters and convolutional LSTM. As the proposed change detection network detects changes by analyzing the temporal information of feature maps obtained from temporal images, extracting meaningful features can improve the change detection results. Therefore, multiscale 3D filters were used in the initial phase of change detection network development to extract various spatial and spectral features from high spatial resolution images. Furthermore, the filters used pre-trained values from the ISPRS dataset to overcome the lack of training samples. The appropriate combination of 3D filters was determined by analyzing accuracy by class, and the classification and change detection performance were improved using multiscale 3D filters. The change detection results on the KOMPSAT-3A were compared with those of the LSTM, 2D CNN-fully connected LSTM, and Re3FCN; results revealed that the proposed change detection method outperformed others. Particularly, the change detection results were improved when using pre-trained values.
However, several problems are associated with the proposed change detection method. For example, since it uses multiple 3D filters for temporal images in parallel, the computing cost may increase depending on the learning environment. Furthermore, differences in spatial resolution and the class types between the source domain (ISPRS dataset) and target domain (KOMPSAT-3A) were not considered. To solve this problem, we developed the transfer learning technique applicable for broad applications. This is expected to improve usage of the large amounts of data extracted for detecting changes in different high spatial resolution satellite images.