Uncertainty Analysis for Object-Based Change Detection in Very High-Resolution Satellite Images Using Deep Learning Network

: Object-based image analysis (OBIA) is better than pixel-based image analysis for change detection (CD) in very high-resolution (VHR) remote sensing images. Although the e ﬀ ectiveness of deep learning approaches has recently been proved, few studies have investigated OBIA and deep learning for CD. Previously proposed methods use the object information obtained from the preprocessing and postprocessing phase of deep learning. In general, they use the dominant or most frequently used label information with respect to all the pixels inside an object without considering any quantitative criteria to integrate the deep learning network and object information. In this study, we developed an object-based CD method for VHR satellite images using a deep learning network to denote the uncertainty associated with an object and e ﬀ ectively detect the changes in an area without the ground truth data. The proposed method deﬁnes the uncertainty associated with an object and mainly includes two phases. Initially, CD objects were generated by unsupervised CD methods, and the objects were used to train the CD network comprising three-dimensional convolutional layers and convolutional long short-term memory layers. The CD objects were updated according to the uncertainty level after the learning process was completed. Further, the updated CD objects were considered as the training data for the CD network. This process was repeated until the entire area was classiﬁed into two classes, i.e., change and no-change, with respect to the object units or deﬁned epoch. The experiments conducted using two di ﬀ erent VHR satellite images conﬁrmed that the proposed method achieved the best performance when compared with the performances obtained using the traditional CD approaches. The method was less a ﬀ ected by salt and pepper noise and could e ﬀ ectively extract the region of change in object units without ground truth data. Furthermore, the proposed method can o ﬀ er advantages associated with unsupervised CD methods and a CD network subjected to postprocessing by e ﬀ ectively utilizing the deep learning technique and object information.


Introduction
Object-based image analysis (OBIA) involves the segmentation of an image based on clusters of similar neighboring pixels exhibiting common properties such as spectral, textual, spatial, or topological properties [1]. The objective of OBIA in remote sensing is to provide adequate methods for analyzing very high-resolution (VHR) images within a spatial resolution of 1 m [2]. OBIA methods are often superior to pixel-based image analysis for classification and change detection (CD) in VHR remote Figure 1 depicts the architecture of the proposed method. Two images I t1 and I t2 were acquired from the same region at time t1 and t2, respectively. First, various unsupervised CD methods were applied to I t1 and I t2 for obtaining the initial CD map. Then, objects were generated from the two images as the segmentation results of the principle components of these images. The initial CD map comprised three classes: "change", "no-change", and "no-value". The map was reconstructed with respect to the units of objects (CD objects) according to the percentage of each class. The detailed method for obtaining the initial CD map and CD objects is explained in Section 2.1.
Remote Sens. 2020, 12, x FOR PEER REVIEW 4 of 26 initial CD map comprised three classes: "change", "no-change", and "no-value". The map was reconstructed with respect to the units of objects (CD objects) according to the percentage of each class. The detailed method for obtaining the initial CD map and CD objects is explained in Section 2.1. Figure 1. Framework of the proposed method. The objects were obtained from principal components (PC) of two temporal images, and the CD objects are used as label data for the deep learning network. The result map of the deep learning network is integrated with the object boundary to update the CD objects, which involve change or no-change classes over a specific percentage for the entire region. This process is repeatedly performed until all the objects of the entire image are classified into change or no-change classes.
The CD objects were fed into the CD network, and the objects were classified as change, nochange, and no-value. Only the change and no-change classes were used as label data. The pixels in the no-value class were not used for training the network because they were masked during training. After the training phase, the network generated a binary map containing change and no-change classes for the entire image. The CD result map was reconstructed using objects. The reliable objects, in which the percentage of change and no-change classes was greater than a specific value, were selected and added to the initial CD objects. The detailed method of updating CD objects is explained in Section 2.2. The initial CD objects included no-values. When learning with the initial training data, it was impossible to train the locations of pixels with no-values. The updated CD objects were used for the training data of the CD network to train a CD network for the entire regions of images. This process was iteratively performed until the percentage of change and no-change classes in all objects exceeded a specific value.

Generating CD Objects
Training data were necessary to learn the change and no-change classes in temporal images because the CD network in this work conducted supervised learning. The quality of the training data considerably influenced the accuracy of the CD result. Figure 2 presents the generation of CD objects as the label data. Although the unsupervised CD methods are easily affected by spot noise, they can be applied to images and without prior knowledge for performing quantitative analysis. In this study, five pixel-based unsupervised CD methods were used, which can extract the changed pixels by measuring the spectral difference between two images to obtain an initial CD map. Various unsupervised CD methods can extract different pixels as changed/unchanged areas because the calculation method and criteria for judging the changes are different. Therefore, we selected the pixels identified as belonging to the same class in more than four methods as the final initial CD map to extract the pixels determined in the majority of the algorithms. In addition, the thresholds percentage related to the level of uncertainty were determined to reconstruct the initial CD map in units of objects. If the pixels within an object had the same class for more than the determined threshold percentage (in this study, 50%, 60%, and 70% were used as the threshold percentages), the object was classified as belonging to the class of most pixels. Thus, all the pixels within the object have the value of class occupied by the majority of the pixels when the pixels in an object are classified as the same item based on a threshold percentage. Furthermore, the objects were defined as no-value when the pixels within an object did not have the same class over a certain percentage. Finally, the CD objects The result map of the deep learning network is integrated with the object boundary to update the CD objects, which involve change or no-change classes over a specific percentage for the entire region. This process is repeatedly performed until all the objects of the entire image are classified into change or no-change classes.
The CD objects were fed into the CD network, and the objects were classified as change, no-change, and no-value. Only the change and no-change classes were used as label data. The pixels in the no-value class were not used for training the network because they were masked during training. After the training phase, the network generated a binary map containing change and no-change classes for the entire image. The CD result map was reconstructed using objects. The reliable objects, in which the percentage of change and no-change classes was greater than a specific value, were selected and added to the initial CD objects. The detailed method of updating CD objects is explained in Section 2.2. The initial CD objects included no-values. When learning with the initial training data, it was impossible to train the locations of pixels with no-values. The updated CD objects were used for the training data of the CD network to train a CD network for the entire regions of images. This process was iteratively performed until the percentage of change and no-change classes in all objects exceeded a specific value.

Generating CD Objects
Training data were necessary to learn the change and no-change classes in temporal images because the CD network in this work conducted supervised learning. The quality of the training data considerably influenced the accuracy of the CD result. Figure 2 presents the generation of CD objects as the label data. Although the unsupervised CD methods are easily affected by spot noise, they can be applied to images and without prior knowledge for performing quantitative analysis. In this study, five pixel-based unsupervised CD methods were used, which can extract the changed pixels by measuring the spectral difference between two images to obtain an initial CD map. Various unsupervised CD methods can extract different pixels as changed/unchanged areas because the calculation method and criteria for judging the changes are different. Therefore, we selected the pixels identified as belonging to the same class in more than four methods as the final initial CD map to extract the pixels determined in the majority of the algorithms. In addition, the thresholds percentage related to the level of uncertainty were determined to reconstruct the initial CD map in units of objects. If the pixels within an object had the same class for more than the determined threshold percentage (in this study, 50%, 60%, and 70% were used as the threshold percentages), the object was classified as belonging to the class of most pixels. Thus, all the pixels within Remote Sens. 2020, 12, 2345 5 of 26 the object have the value of class occupied by the majority of the pixels when the pixels in an object are classified as the same item based on a threshold percentage. Furthermore, the objects were defined as no-value when the pixels within an object did not have the same class over a certain percentage. Finally, the CD objects comprised three classes, i.e., change, no-change, and no-value, having the same value in the case of object units.
Remote Sens. 2020, 12, x FOR PEER REVIEW 5 of 26 comprised three classes, i.e., change, no-change, and no-value, having the same value in the case of object units.

Figure 2.
Generation of CD objects. The pixels were classified as belonging to the same class in more than four methods among the five unsupervised CD methods and were selected to constitute the initial CD map, which was reconstructed in object units with threshold percentages, i.e. specific percentages of pixels in an object; in this study, 50%, 60%, and 70% were used to generate CD objects.

Initial CD Map Generated from Unsupervised CD Methods
The unsupervised pixel-based CD methods employ a pixel as the basic unit of analysis and can extract changes based on the spectral information [4]. The traditional pixel-based CD methods achieve remarkable performance for low-and moderate-resolution satellite images. However, they are often unsuitable for application to VHR images. They often result in salt-and-pepper noise because of the detection of several changes owing to spot noise [28]. Despite these limitations, they are widely used in case of VHR images because they can be easily applied without the requirement of prior information on the study site [29,30]. Generally, difference images (DIs), which highlight the spectral difference between two images acquired based on the same region at different time points, were generated. In addition, decision functions, such as the application of a threshold value and clustering algorithms, were used to classify the change from no-change in most of the unsupervised CD methods. Clustering algorithms, such as k-means clustering, have been widely used because selecting appropriate threshold values is difficult, especially when ground truth data are unavailable. In the case of binary CD, the clustering algorithms divide all the pixels of the DI into two classes. Herein, five traditional methods, namely image differencing, image regression, change vector analysis (CVA), iterative reweighted multivariate alteration detection (IR-MAD), and PCA, were used to generate the DI, and k-means clustering was exploited to differentiate the change class from the no-change class.
Image differencing is an easy method for interpreting CD results. The temporal images are directly subtracted from to in a pixel-by-pixel manner [31].
where ImageDiff is the difference image generated by image differencing and ( , ) are the coordinates. The difference image has the same number of bands as that in the input images. For example, the difference image will have four bands if the input images have four spectral bands. Based on image differencing, the absolute values of the difference between the corresponding pixels in temporal images can be calculated; large values in the difference image represent changed pixels. Because the output value is an absolute value, the same value may exhibit different meanings; thus, it requires a preprocessing step of atmospheric calibration. Based on image regression, the relation between and can be established and the pixel values of can be estimated using a regression function such as the least squares regression [32]. In image regression, the pixels from are assumed to be a linear function of time with respect to the pixels from [31]. In other words, is the reference image and is the subject image. is adjusted to match the radiometric conditions of . If is the predicted value obtained from the regression line, the difference image can be defined as Generation of CD objects. The pixels were classified as belonging to the same class in more than four methods among the five unsupervised CD methods and were selected to constitute the initial CD map, which was reconstructed in object units with threshold percentages, i.e., specific percentages of pixels in an object; in this study, 50%, 60%, and 70% were used to generate CD objects.

Initial CD Map Generated from Unsupervised CD Methods
The unsupervised pixel-based CD methods employ a pixel as the basic unit of analysis and can extract changes based on the spectral information [4]. The traditional pixel-based CD methods achieve remarkable performance for low-and moderate-resolution satellite images. However, they are often unsuitable for application to VHR images. They often result in salt-and-pepper noise because of the detection of several changes owing to spot noise [28]. Despite these limitations, they are widely used in case of VHR images because they can be easily applied without the requirement of prior information on the study site [29,30]. Generally, difference images (DIs), which highlight the spectral difference between two images acquired based on the same region at different time points, were generated. In addition, decision functions, such as the application of a threshold value and clustering algorithms, were used to classify the change from no-change in most of the unsupervised CD methods. Clustering algorithms, such as k-means clustering, have been widely used because selecting appropriate threshold values is difficult, especially when ground truth data are unavailable. In the case of binary CD, the clustering algorithms divide all the pixels of the DI into two classes.
Herein, five traditional methods, namely image differencing, image regression, change vector analysis (CVA), iterative reweighted multivariate alteration detection (IR-MAD), and PCA, were used to generate the DI, and k-means clustering was exploited to differentiate the change class from the no-change class.
Image differencing is an easy method for interpreting CD results. The temporal images are directly subtracted from I t1 to I t2 in a pixel-by-pixel manner [31].
where ImageDiff is the difference image generated by image differencing and (x, y) are the coordinates. The difference image has the same number of bands as that in the input images. For example, the difference image will have four bands if the input images have four spectral bands. Based on image differencing, the absolute values of the difference between the corresponding pixels in temporal images can be calculated; large values in the difference image represent changed pixels. Because the output value is an absolute value, the same value may exhibit different meanings; thus, it requires a preprocessing step of atmospheric calibration. Based on image regression, the relation between I t1 and I t2 can be established and the pixel values of I t2 can be estimated using a regression function such as the least squares regression [32]. In image regression, the pixels from I t1 are assumed to be a linear function of time with respect to the pixels Remote Sens. 2020, 12, 2345 6 of 26 from I t2 [31]. In other words, I t1 is the reference image and I t2 is the subject image. I t2 is adjusted to match the radiometric conditions of I t1 . IfÎ t2 is the predicted value obtained from the regression line, the difference image can be defined as where ImageRegr represents the difference image generated by image regression. This can reduce the effect of atmospheric and environmental conditions but requires an accurate regression function [32]. The magnitude of change between I t1 and I t2 is calculated based on CVA. The pixel values considered as a vector of the spectral bands and the change vector are calculated by subtracting the vectors for all pixels in the case of different data [33]. The magnitude represents the degree of change; thus, it can be used to distinguish between the change and no-change classes [34]. To apply CVA, preprocessing steps such as data transformation are required. For example, principal component (PC), tasseled cap, and spectral index can be utilized to generate spectral features from each image pair. CVA can deal with any number of spectral bands and produce detailed CD information; however, it is difficult to identify land cover change trajectories [32]. The change magnitude of CVA was calculated as follows: where I t1,k and I t2,k are the kth band of I t1 and I t2 , respectively, and N is the number of spectral bands of both images.
Multivariate data are transformed via PCA to obtain a new set of components for reducing data redundancy. PCA produces new components based on eigenvector analysis of the covariance matrix, and most of the variance with respect to all the original variables can be observed in the case of the first few components. For achieving the CD of the remote sensing data, I t1 and I t2 with α and β bands are combined into one image with (α + β) bands; then, the stacked image is transformed into (α + β) PCs. Among the temporal images, high and low correlations can be observed with respect to the unchanged and changed areas, respectively. In addition, generally, the first four components exhibit change information [35]. PCA can effectively reduce the redundancy data and denote different change information; however, obtaining a suitable interpretation of different datasets is difficult because it is scene dependent.
IR-MAD is a regularized, iteratively reweighted MAD method based on canonical correlation analysis. In this method, the coupling vector exhibiting the highest correlation to a set of multivariate variables is estimated [36]. MAD finds the differences between linear combinations of spectral bands from I t1 and I t2 . Thus, a set of N change maps is obtained, where N represents the maximum number of bands and each change map is orthogonal to the remaining change maps. Further, the uncorrelated difference images can be sequentially extracted, where each new image shows maximum change under the constraint of being uncorrelated with the previous images [37]. The intensity image of IR-MAD was calculated as follows: All pixels in DI were clustered using the k-means algorithm to produce a binary CD map. Two classes with pixels belonging to the change and no-change classes were used. In this case, the pixel value "0" denoted the no-change class, whereas the pixel value "1" denoted the change class. K-means clustering can be used to iteratively partition the pixels into two groups, where each pixel belongs to only one class. Further, the pixels are assigned to a cluster based on the sum of the squared distance between the cluster's centroid and pixels. The initial centroids of k classes were randomly selected, and this process was repeated until there was no change in the centroids.
The traditional methods have unique advantages and disadvantages because each algorithm is based on different principles. The performance of the methods may vary depending on the input images because the quality of the difference images is dependent on the input image characteristics and environmental conditions during image acquisition. Hence, it is difficult to decide the best method for all the cases. Therefore, we integrated the CD results obtained using various methods and defined the changed pixel, which is classified as changed pixels by most of the algorithms. CM sum (x, y) is the sum of classes classified as change at (x, y) position of the change maps extracted by k-means clustering of various methods. Because there are five methods, the range was 0≤ CM sum ≤ 5. The initial change map can be defined using the decision function (Equation (7)). The pixels, which were not included in the change and no-change classes, exhibit no-value in the initial CD map.
The pixels at (x, y) with CM sum ≥ 4 were classified as the change class, whereas those with CM sum ≤ 1 were classified as the no-change class. In addition, the pixels with 2 ≤ CM sum ≤ 3 were classified as the no-value class, indicating that two or more algorithms classified the pixel as belonging to the change class. Thus, the changes at these pixels are difficult to distinguish.

Segmentation of Temporal Images
Image segmentation is the process of partitioning an image into multiple segments. In other words, image segmentation is to assign labels to all pixels in an image such that the pixel has the same label to share certain characteristics. The image objects are treated as several superpixels, which can be defined as a group of pixels exhibiting common characteristics. A graph-based segmentation algorithm can be used to effectively generate superpixels [38]. Felzenszwalb and Huttenlocher [39] developed an efficient segmentation algorithm based on graph theory. This algorithm calculates the gradient between two adjacent pixels, which is weighted according to the pixel properties [40]. Further, it minimizes the difference between the gradients within the segment but maximizes the differences between adjacent segments. There are three parameters associated with this algorithm: k, m, and σ. k sets the observation scale, and larger segments are obtained with increasing k. m is the minimum object size of the components. A small component can be observed when there is a major difference between the neighboring components. σ is the diameter of a Gaussian kernel for slightly smoothing the image prior to segmentation.
In this study, objects were obtained from the PC images generated after PCA using the open-source image-processing toolkit, scikit-image, in Python. The newly obtained PC images were used as the input of the segmentation algorithm to effectively reflect the changed information between I t1 and I t2 simultaneously. σ was always set to 0.8, which is the default value and does not visually change the image but helps to eliminate image artifacts [39]. The optimal scale size of the input images can differ with spatial resolution and material type. To analyze the effect of k value, the experiments were conducted at k values of 30, 50, 100, 200, and 300. m is also related to the size of materials in the input images. The m values are determined in an empirical manner. For example, m was 60 for the study site, where changes could be observed with respect to small objects, such as buildings, and m was set to 100 for changes in land cover without any building material.

Reflection of the Uncertainty in an Object Unit
The CD object can be generated from the initial CD map and segmentation results. The initial CD map has three pixel-level classes: change, no-change, and no-value. Several thresholds, which are the percentages of the pixel's label in an object, were determined to reconstruct the initial CD map with the object unit. If the pixels included in an object have three classes in similar proportions, the uncertainty is associated with defining an object as one class increases. Therefore, only objects containing at least more than 50% pixels with the same label were selected to generate CD objects. In addition, when a specific class of the pixels within an object does not occupy more than the threshold percentage, the object is defined to have a no-value. If the threshold percentage is large, a reliable object is selected as the CD object because most of the pixels in an object have the same class; however, the number of objects satisfying this condition can be reduced. To obtain the optimal threshold, the CD objects were generated using different percentage from 50% to 70% at an interval of 10%, and each condition was categorized into uncertainty Levels 3, 2, and 1, respectively. The higher is the level of uncertainty, the lower is the percentage of same classes represented by the pixels in the object. However, the number of samples and quality of data are equally important because the CD objects are used as training labels for the CD network. Therefore, we do not consider a percentage of more than 80% because insufficient CD objects are available when the threshold is set to 80% or more.

Updating CD Object
CD objects were used as label data for training the CD network. Figure 3 shows the architecture of the CD network, which comprises 3D and 2D convolutional layers and convolutional long short-term memory (LSTM) layers. 3D convolutional layers can extract spatial and spectral feature maps from hyper or multispectral images, and convolutional LSTM can analyze the temporal relation between two images [27,41]. Convolutional LSTM is a modification of the conventional LSTM in which the matrix multiplication operators are replaced with convolution operators [42]. Convolutional LSTM is suitable for application to remote sensing images when compared with conventional LSTM because the size of the weight matrix increases the computational cost; further, spatial connectivity is ignored when applying the conventional LSTM [41]. Remote Sens. 2020, 12, x FOR PEER REVIEW 9 of 26 Figure 3. The CD network architecture. The network comprises 3D convolutional layers to extract spatial and spectral features and convolutional LSTM layers to analyze the temporal relation between two features. Finally, two more 2D convolutional layers are considered to calculate the score map. Two temporal images and CD object were used as the input data. After the training step, the network produces a binary CD map. w, h, and represent the width, height, and number of spectral bands, respectively. Ω and Ω are the change and no-change classes, respectively. , , and are the change, no-change, and no-value classes of the initial CD map, respectively.
The binary CD result map obtained using the CD network was integrated with the segmentation results. The CD objects representing meaningful classes, such as change or no-change, were used to train the CD network samples, and the objects retained their existing properties. On the contrary, the objects having no-values must be updated. The process of updating CD objects is described in Figure  4. The binary CD result map was integrated with the object units and the candidate area to be updated, and the CD objects with no-values were selected. The CD result map was reconstructed according to the class percentage of the pixels within an object to consider the uncertainty of an object when updating CD objects. The process of reconstructing the object class follows the same rule and threshold as those associated with the previously generated CD object. For example, if the previous CD objects were generated with a threshold of 60%, only the objects classified as change or no-change pixels with more than 60% are selected. The objects that do not follow the aforementioned conditions are assigned no-values. The CD objects were updated by combining the previous CD objects and the newly generated objects. The updated CD objects were also fed into the CD network, and the network was trained using the randomly extracted samples of CD objects. This process was repeatedly performed until all the objects in the area were classified as change or no-change classes. The CD network architecture. The network comprises 3D convolutional layers to extract spatial and spectral features and convolutional LSTM layers to analyze the temporal relation between two features. Finally, two more 2D convolutional layers are considered to calculate the score map. Two temporal images and CD object were used as the input data. After the training step, the network produces a binary CD map. w, h, and λ represent the width, height, and number of spectral bands, respectively. Ω c and Ω u are the change and no-change classes, respectively. ω c , ω u , and ω n are the change, no-change, and no-value classes of the initial CD map, respectively.
The training samples were randomly extracted from the temporal VHR images and the CD objects. Each training sample was a 3D patch with dimensions of w × h × λ, where w and h are the lengths of the column and row, respectively, and λ is the number of spectral bands. The size of the 3D patches was empirically set to 10 × 10 × λ in this study. The central points of the 3D patches were randomly extracted only at the locations of the change and no-change classes of the CD objects. We selected 40,000 pixels as training data, 20,000 pixels as validation data, and 30,000 pixels as testing data. Because convolutional layers exploited information from neighboring pixels and the training and validation pixels were extracted from the CD objects, their features were likely to overlap owing to the shared source of information [43]. Overlap between training and validation data can result in intrinsic positive bias in the CD result. However, in this study, because the images of the study areas consist of relatively few pixels (e.g., 1200 × 1200 pixels), the number of training patches was reduced when extracting without overlap. Therefore, data for network training were randomly extracted to increase the amount of training data.
After the training samples were extracted, two patches captured from the same location of two temporal images were separately fed into the 3D convolutional layers in parallel. The filter size of these convolutional layers was (3 × 3 × 3), which is the optimal alternative for 3D convolution in spatiotemporal feature learning [44]. Next, the spatial-spectral feature maps were fed into the convolutional LSTM layers to denote the temporal information and recode the change rules. The outputs from the convolutional LSTM layers were passed through 2D convolutional layers with (3 × 3) filters to generate a score map. The final number of feature maps was equal to the number of classes. The binary cross entropy L is used as loss function of the network and it can be defined as follows: where n is the number of samples, y i is ground truth value, andŷ i is the predicted value. Finally, the pixels were classified into change or no-change classes according to the score map. The binary CD result map obtained using the CD network was integrated with the segmentation results. The CD objects representing meaningful classes, such as change or no-change, were used to train the CD network samples, and the objects retained their existing properties. On the contrary, the objects having no-values must be updated. The process of updating CD objects is described in Figure 4. The binary CD result map was integrated with the object units and the candidate area to be updated, and the CD objects with no-values were selected. The CD result map was reconstructed according to the class percentage of the pixels within an object to consider the uncertainty of an object when updating CD objects. The process of reconstructing the object class follows the same rule and threshold as those associated with the previously generated CD object. For example, if the previous CD objects were generated with a threshold of 60%, only the objects classified as change or no-change pixels with more than 60% are selected. The objects that do not follow the aforementioned conditions are assigned no-values. The CD objects were updated by combining the previous CD objects and the newly generated objects. The updated CD objects were also fed into the CD network, and the network was trained using the randomly extracted samples of CD objects. This process was repeatedly performed until all the objects in the area were classified as change or no-change classes.
Remote Sens. 2020, 12, x FOR PEER REVIEW 10 of 26 Figure 4. The process of updating CD objects. CD objects are fed into the CD network, and the network produces a binary CD map, which can be divided into "noncandidate objects for updating" and "candidate objects for updating". After uncertainty analysis, the selected objects were added to the previous CD object. Ω and Ω are the change and no-change classes, respectively. , , and are the change, no-change, and no-value classes of the initial CD map, respectively.

Performance Evaluation
The classification performance can be assessed using the confusion matrix, also known as the error matrix. Binary CD can be interpreted as a classification task involving two classes. The confusion matrix is a 2 × 2 table that contains four outcomes produced by the binary classifier. To evaluate the performance of the CD methods, various performance measures showing how well the classifier accurately identifies the objects, such as the overall accuracy (OA), precision, recall, and F1 score, can be calculated from the confusion matrix. Test data were used to calculate the accuracy of the CD methods. OA represents the proportion of accurately classified prediction with respect to the observation, and it can be described in terms of true positive (TP), true negative (TN), false negative (FN), and false positive (FP).
OA is a simple methodology for evaluating the classification accuracy and works well when FP and FN exhibit similar costs. However, when the class distribution is dissimilar, FP and FN are considerably different, and OA is inappropriate for showing the effectiveness of the result. In this case, the F1 score is a better way to evaluate the results. F1 score is the harmonic mean of precision and recall (Equation (10)). Precision can be obtained by dividing the total number of accurately classified positive pixels with the total number of predicted positive pixels, whereas recall is the ratio Figure 4. The process of updating CD objects. CD objects are fed into the CD network, and the network produces a binary CD map, which can be divided into "noncandidate objects for updating" and "candidate objects for updating". After uncertainty analysis, the selected objects were added to the previous CD object. Ω c and Ω u are the change and no-change classes, respectively. ω c , ω u , and ω n are the change, no-change, and no-value classes of the initial CD map, respectively.

Performance Evaluation
The classification performance can be assessed using the confusion matrix, also known as the error matrix. Binary CD can be interpreted as a classification task involving two classes. The confusion matrix is a 2 × 2 table that contains four outcomes produced by the binary classifier. To evaluate the performance of the CD methods, various performance measures showing how well the classifier accurately identifies the objects, such as the overall accuracy (OA), precision, recall, and F1 score, can be calculated from the confusion matrix. Test data were used to calculate the accuracy of the CD methods. OA represents the proportion of accurately classified prediction with respect to the observation, and it can be described in terms of true positive (TP), true negative (TN), false negative (FN), and false positive (FP).
OA is a simple methodology for evaluating the classification accuracy and works well when FP and FN exhibit similar costs. However, when the class distribution is dissimilar, FP and FN are considerably different, and OA is inappropriate for showing the effectiveness of the result. In this case, the F1 score is a better way to evaluate the results. F1 score is the harmonic mean of precision and recall (Equation (10)). Precision can be obtained by dividing the total number of accurately classified positive pixels with the total number of predicted positive pixels, whereas recall is the ratio of the total number of accurately classified positive pixels divided by the total number of positive pixels (Equations (11) and (12)). In addition, negative predicted value (NPV) represents how many of negative pixels are true negatives (Equation (13)).

Dataset
Multispectral VHR images of two sites were used for CD ( Figure 5) [21]. The temporal images of Site 1 were acquired from WorldView-3 with a spatial resolution of 1.24 m and 8 bands. The images were acquired from Gwangju city in South Korea. This city includes industrial areas, residences, agricultural lands, rivers, and changed regions because of large-scale urban development. The temporal multispectral images (I t1 and I t2 ) were acquired on 26 May 2017, and 4 May 2018, respectively. The Site 2 images were acquired from KOMPSAT-3 multispectral sensor images having a spatial resolution of 2.8 m and 4 bands. I t1 and I t2 were acquired on 16 November 2013, and 26 February 2019, respectively, from the area located over Sejong city in South Korea. This area is an administrative city in South Korea and has been developing since 2007; a central administrative agency has also relocated to this area. Large-scale high-rise buildings and complexes have been constructed in a short period, considerably changing the image pair.
To perform effective and reliable CD, accurate geometrical preprocessing such as orthorectification should be performed with respect to the multitemporal VHR images to minimize geometric misalignment [45,46]. Image I t2 was coregistered to the coordinates related to image I t1 by applying the phase-based correlation method [47] with an improved piecewise linear transformation warping [48]. The co-registration was applied to warp image I t2 related to the coordinates of image I t1 . We did not perform pansharpening because the spatial resolution of the image was sufficient to describe the scene in detail.
The ground truth data were manually obtained based on various VHR web maps, the value of the normalized difference vegetation index, and field survey. We defined the changes when the classes of land cover had changed such as vegetation to bare soil. The classes of land cover were defined as vegetation, bare soil, buildings, water, and roads. Vegetation was defined as crop land and trees with high vegetation vitality. Buildings having height were defined as "buildings". "Bare soil" represented ground without buildings and vegetation (or areas with very low vegetation vitality), and "roads" encompassed asphalt roadways. Changes owing to relief displacement and shadows were not considered as changes in the ground truth data. In particular, the greenhouse of Site 1 appears in different colors depending on the influence of light and internal materials in temporal images. Such differences do not denote the changed area; the area in which a greenhouse was newly constructed instead of bare soil was selected as the changed area in the ground truth map. Moreover, the slight differences in vegetation vitality owing to seasonal differences were not considered as changes.
were not considered as changes in the ground truth data. In particular, the greenhouse of Site 1 appears in different colors depending on the influence of light and internal materials in temporal images. Such differences do not denote the changed area; the area in which a greenhouse was newly constructed instead of bare soil was selected as the changed area in the ground truth map. Moreover, the slight differences in vegetation vitality owing to seasonal differences were not considered as changes.

Results
The experiments were conducted on Sites 1 and 2 having different VHR satellite images. Initially, the CD objects were generated using various unsupervised CD methods and segmented objects. Further, we compared the CD results for various threshold percentages, including 50%, 60%, and 70%. After the CD objects were generated, the CD networks were trained using the CD objects. In this study, the final epoch was set to 200 in the case of the Adam optimizer with a learning rate of 10 and a batch size of 256. The CD objects were iteratively updated after every 50 epochs. The pixel-based CD results generated from the CD network were compared to show the effectiveness of the proposed method.

Generation of CD Objects
The CD objects were generated by integrating the initial CD map with the segmented objects. The initial CD maps were obtained from unsupervised CD results using the decision function (Equation (7)). The pixels equally determined to denote changed or unchanged area in more than four methods were extracted as the initial CD maps. Figures 6 and 7 show the results of CD generated using various unsupervised CD methods, and Table 1 shows the accuracies of the results. Image regression denotes the lowest accuracies at both the sites. PCA and CVA have the highest accuracies at Sites 1 and 2, respectively. The materials and properties of the changes at the study sites can affect the CD results. PCA can detect the changes from vegetation to bare soil, and image differencing can detect newly constructed buildings. Because the unsupervised CD methods are pixel-based methods, they only consider the spectral difference between two images. Therefore, the salt-and-pepper noise and the difference caused by shadows and lights were detected as the changed class. For example, in the case of Site 1, the greenhouse where the color looks different by light and inner materials were also extracted as the changed area. Furthermore, in the case of Site 2, most CD methods classified the shadows apart from buildings as the changed class.

Results
The experiments were conducted on Sites 1 and 2 having different VHR satellite images. Initially, the CD objects were generated using various unsupervised CD methods and segmented objects. Further, we compared the CD results for various threshold percentages, including 50%, 60%, and 70%. After the CD objects were generated, the CD networks were trained using the CD objects. In this study, the final epoch was set to 200 in the case of the Adam optimizer with a learning rate of 10 −4 and a batch size of 256. The CD objects were iteratively updated after every 50 epochs. The pixel-based CD results generated from the CD network were compared to show the effectiveness of the proposed method.

Generation of CD Objects
The CD objects were generated by integrating the initial CD map with the segmented objects. The initial CD maps were obtained from unsupervised CD results using the decision function (Equation (7)). The pixels equally determined to denote changed or unchanged area in more than four methods were extracted as the initial CD maps. Figures 6 and 7 show the results of CD generated using various unsupervised CD methods, and Table 1 shows the accuracies of the results. Image regression denotes the lowest accuracies at both the sites. PCA and CVA have the highest accuracies at Sites 1 and 2, respectively. The materials and properties of the changes at the study sites can affect the CD results. PCA can detect the changes from vegetation to bare soil, and image differencing can detect newly constructed buildings. Because the unsupervised CD methods are pixel-based methods, they only consider the spectral difference between two images. Therefore, the salt-and-pepper noise and the difference caused by shadows and lights were detected as the changed class. For example, in the case of Site 1, the greenhouse where the color looks different by light and inner materials were also extracted as the changed area. Furthermore, in the case of Site 2, most CD methods classified the shadows apart from buildings as the changed class.      k values of 30-300 were applied to the input image, and Figure 8 shows the optimal parameter value. Because Site 2 involves changes in buildings, a low value was more effective in distinguishing between building objects. The segmented results of I t1 (Figure 8a,e) cannot reflect the object information at the time of change. For example, newly built buildings and bare soil areas were not reflected in the early obtained images. On the contrary, the segmented results of I t2 (Figure 8b,f) contain current information, such as newly constructed materials but do not reflect the previous state of the area. Figure 8c,g shows the colored composite of PCs with the segmented results of Sites 1 and 2, respectively. The advantage of using the PCs extracted from the stacked images is that they can consider the temporal image information. Therefore, we subjected the combination of PCs to segmentation and used the segmented results to generate CD objects.   k values of 30-300 were applied to the input image, and Figure 8 shows the optimal parameter value. Because Site 2 involves changes in buildings, a low value was more effective in distinguishing between building objects. The segmented results of I (Figure 8a,e) cannot reflect the object information at the time of change. For example, newly built buildings and bare soil areas were not reflected in the early obtained images. On the contrary, the segmented results of (Figure 8b,f) contain current information, such as newly constructed materials but do not reflect the previous state of the area. Figure 8c,g shows the colored composite of PCs with the segmented results of Sites 1 and 2, respectively. The advantage of using the PCs extracted from the stacked images is that they can consider the temporal image information. Therefore, we subjected the combination of PCs to segmentation and used the segmented results to generate CD objects. We reconstructed the initial CD map using the segmented results to effectively generate CD objects. Figure 9 shows CD objects with different uncertainty levels in objects. Tables 2 and 3 We reconstructed the initial CD map using the segmented results to effectively generate CD objects. Figure 9 shows CD objects with different uncertainty levels in objects. Tables 2 and 3 represent the number of pixels in three classes with the uncertainty levels and the accuracy of the CD objects. The uncertainty Levels 1, 2, and 3 indicate the threshold percentages of 70%, 60%, and 50%, respectively. With increasing uncertainty levels, the number of CD objects representing the ω c or ω u classes increase. On the contrary, objects classified as ω n increase with decreasing uncertainty levels. In addition, the accuracies of ω c and ω u were improved with decreasing uncertainty levels.
Remote Sens. 2020, 12, x FOR PEER REVIEW 15 of 26 represent the number of pixels in three classes with the uncertainty levels and the accuracy of the CD objects. The uncertainty Levels 1, 2, and 3 indicate the threshold percentages of 70%, 60%, and 50%, respectively. With increasing uncertainty levels, the number of CD objects representing the or classes increase. On the contrary, objects classified as increase with decreasing uncertainty levels. In addition, the accuracies of and were improved with decreasing uncertainty levels.

CD Result of Traditional Approaches Using the CD Network
Generally, object information can be obtained from the preprocessing and postprocessing phases in which the input images are modified to obtain object information or refine the output with object units. We compared the CD results of the traditional approaches to confirm the effectiveness of the proposed CD method.
We defined four different cases using the CD network; the detailed experimental conditions are described as follows: • Case 1: The original multitemporal images were used as the input data, and the initial pixel-level CD map was used as the label data in the case of the CD network. After training, the network produced a pixel-level CD map. In this case, object information could not be obtained. • Case 2: The original multitemporal images were used as the input data, and the CD objects generated in Section 4.1 were used as the label data to train the CD network. In other words, the object information was reflected in the preprocessing phase, and the network produced a pixel-level CD map. • Case 3: The segmentation image, in which each image has a unique value, was added to the original images. Thus, a band containing object information was stacked onto the existing bands.
The new images adding one more band were used as the input data, and the initial pixel-level CD map was used as the label data for the CD network. • Case 4: The result of Case 1 was subjected to postprocessing. In this case, the object was reclassified as the most dominant class of pixels within the object.
The experimental setting of the CD network and input materials was similar to those in the proposed method. Figure 10 shows the CD results, and Table 4 describes the accuracies. By comparing Cases 1 and 2, the CD accuracies at both the sites can be improved by adding object information via initial CD map reconstruction. In addition, when stacking original images onto the segmentation band as one image (Case 3), the accuracies decreased because the salt-and-pepper noise increased. However, the pixel changes can be attributed to shadows and light difference. The CD accuracies were the highest in Case 4. The noise within the image was eliminated. However, at Site 2, most of the building objects were also removed. This is because the shape of the buildings was underestimated in Case 1; therefore, in the postprocessing steps, the objects were classified as unchanged objects. Therefore, the OAs of Sites 1 and 2 are similar, but the F1 score of Site 2 is lower than that of Site 1. segmentation band as one image (Case 3), the accuracies decreased because the salt-and-pepper noise increased. However, the pixel changes can be attributed to shadows and light difference. The CD accuracies were the highest in Case 4. The noise within the image was eliminated. However, at Site 2, most of the building objects were also removed. This is because the shape of the buildings was underestimated in Case 1; therefore, in the postprocessing steps, the objects were classified as unchanged objects. Therefore, the OAs of Sites 1 and 2 are similar, but the F1 score of Site 2 is lower than that of Site 1. . Ω c and Ω u are the change and no-change classes, respectively.

CD Result of the Proposed Method
In the proposed method, CD objects were used to train the CD network; further, the binary CD map generated from the network was used to update CD objects. In this step, the results of CD objects were obtained depending on the uncertainty level. To obtain the optimal uncertainty level, we compared the CD results obtained under different conditions. The CD network was trained until epoch 200, and the CD objects were updated after every 50 epochs. e 0 , e 1 , e 2 , e 3 , and e 4 are the updating points and represent epochs 0, 50, 100, 150, and 200, respectively. When all the CD objects were assigned changed or unchanged classes before e 4 , the learning process was completed. Moreover, we set the learning process as finished even when the CD objects still contained a no-value class after conducting the learning process until e 4 .
To analyze the effects of the uncertainty level in various cases, we present three conditions: (1) maintain the same level during every update phase; (2) increase the level as the update progresses; and (3) decrease the level during the update phase. Figures 11-16 show the CD results obtained at Sites 1 and 2, and Table 5 shows the accuracies of all the cases. When maintaining the same level during the updating process ( Figures 11 and 12), the CD results with an uncertainty Level 3, in which the objects are updated when the pixels have the same class more than 50% in the object, were fully generated within e 1 . On the contrary, the CD result could not be used to obtain values for the whole area when the proposed method was used in case of uncertainty Levels 2 and 1. This is because, as the level increases, all the pixels in the object must have the same class to update the CD objects. Thus, it is difficult to update the objects. In addition, because the CD network was trained with the updated CD objects, the CD result maps generated from the CD network would not have been considerably different if there were slight differences between the updated and previous objects. However, the accuracies of the CD results with Levels 2 and 1 were higher than that of the CD result with Level 3 (Table 5)  it is difficult to update the objects. In addition, because the CD network was trained with the updated CD objects, the CD result maps generated from the CD network would not have been considerably different if there were slight differences between the updated and previous objects. However, the accuracies of the CD results with Levels 2 and 1 were higher than that of the CD result with Level 3 ( Table 5) for both sites. For example, the CD results with Level 2 resulted in an OA of 0.9174 and an F1 score of 0.8542 for Site 1 and an OA of 0.9006 and an F1 score of 0.7353 for Site 2.  Although the CD results obtained with uncertainty Level 3 can produce CD objects via one updating phase, the accuracies of the CD results were lower than those of the remaining cases. On the contrary, CD results with Levels 1 and 2 were not valid for the CD objects in the entire area;  Although the CD results obtained with uncertainty Level 3 can produce CD objects via one updating phase, the accuracies of the CD results were lower than those of the remaining cases. On the contrary, CD results with Levels 1 and 2 were not valid for the CD objects in the entire area; however, the generated CD objects could achieve increased CD accuracies. Therefore, the uncertainty level was changed during the update process. Figures 13 and 14 show the CD results obtained using an increased uncertainty level during the update. Thus, the proportion of pixels having the same class however, all the objects in the input image can be classified as change or no-change classes. The CD results with an increased uncertainty from Level 1 to Level 3 show OA = 0.8611 and F1 score = 0.7218 at Site 1 and OA = 0.8859 and F1 score = 0.6770 at Site 2. Further, when the uncertainty changed from Level 2 to Level 3, the accuracies increased to become the highest, e.g. OA = 0.9299 and F1 score = 0.8745 at Site 1 and OA = 0.9012 and F1 score = 0.7347 at Site 2 (Table 5).  Figures 15 and 16 show the CD results obtained using a decreased uncertainty level during the updating process. Thus, the proportion of pixels having the same class to define CD objects can be gradually increased. In particular, several CD objects could not be decreased when the uncertainty level was decreased to Level 1 (Figures 15b and 16). In addition, the CD objects generated at e -e were similar, indicating that only some CD objects were newly added during the updating process. The CD results with the uncertainty decreasing from Level 3 to Level 1 resulted in OA = 0.8735 and F1 score = 0.7498 at Site 1 and OA = 0.8987 and F1 score = 0.7348 at Site 2. In addition, when the uncertainty changed from Level 2 to Level 1, OA = 0.9185 and F1 score = 0.8558 at Site 1 and OA = 0.8957 and F1 score = 0.7778 at Site 2 (Table 5). however, all the objects in the input image can be classified as change or no-change classes. The CD results with an increased uncertainty from Level 1 to Level 3 show OA = 0.8611 and F1 score = 0.7218 at Site 1 and OA = 0.8859 and F1 score = 0.6770 at Site 2. Further, when the uncertainty changed from Level 2 to Level 3, the accuracies increased to become the highest, e.g. OA = 0.9299 and F1 score = 0.8745 at Site 1 and OA = 0.9012 and F1 score = 0.7347 at Site 2 (Table 5).  Figures 15 and 16 show the CD results obtained using a decreased uncertainty level during the updating process. Thus, the proportion of pixels having the same class to define CD objects can be gradually increased. In particular, several CD objects could not be decreased when the uncertainty level was decreased to Level 1 (Figures 15b and 16). In addition, the CD objects generated at e -e were similar, indicating that only some CD objects were newly added during the updating process. The CD results with the uncertainty decreasing from Level 3 to Level 1 resulted in OA = 0.8735 and F1 score = 0.7498 at Site 1 and OA = 0.8987 and F1 score = 0.7348 at Site 2. In addition, when the uncertainty changed from Level 2 to Level 1, OA = 0.9185 and F1 score = 0.8558 at Site 1 and OA = 0.8957 and F1 score = 0.7778 at Site 2 (Table 5). Although the CD results obtained with uncertainty Level 3 can produce CD objects via one updating phase, the accuracies of the CD results were lower than those of the remaining cases. On the contrary, CD results with Levels 1 and 2 were not valid for the CD objects in the entire area; however, the generated CD objects could achieve increased CD accuracies. Therefore, the uncertainty level was changed during the update process. Figures 13 and 14 show the CD results obtained using an increased uncertainty level during the update. Thus, the proportion of pixels having the same class that can define CD objects is gradually reduced. In this case, the uncertainty of the objects increases; however, all the objects in the input image can be classified as change or no-change classes. The CD results with an increased uncertainty from Level 1 to Level 3 show OA = 0.8611 and F1 score = 0.7218 at Site 1 and OA = 0.8859 and F1 score = 0.6770 at Site 2. Further, when the uncertainty changed from Level 2 to Level 3, the accuracies increased to become the highest, e.g., OA = 0.9299 and F1 score = 0.8745 at Site 1 and OA = 0.9012 and F1 score = 0.7347 at Site 2 (Table 5) Figures 15 and 16 show the CD results obtained using a decreased uncertainty level during the updating process. Thus, the proportion of pixels having the same class to define CD objects can be gradually increased. In particular, several CD objects could not be decreased when the uncertainty level was decreased to Level 1 (Figures 15b and 16). In addition, the CD objects generated at e 1 -e 4 were similar, indicating that only some CD objects were newly added during the updating process. The CD results with the uncertainty decreasing from Level 3 to Level 1 resulted in OA = 0.8735 and F1 score = 0.7498 at Site 1 and OA = 0.8987 and F1 score = 0.7348 at Site 2. In addition, when the uncertainty changed from Level 2 to Level 1, OA = 0.9185 and F1 score = 0.8558 at Site 1 and OA = 0.8957 and F1 score = 0.7778 at Site 2 (Table 5).

Comparison with Traditional CD Approaches
The changed pixels can be extracted using the unsupervised CD methods without the requirement of any training data based on the spectral difference between temporal images. These methods tend to perceive shadows and changes in the color of a substance caused by atmospheric effects as the real change. Therefore, spot and salt-and-pepper noise could be observed in the CD result maps. Depending on the objective of the study, unsupervised CD methods can be used appropriately when finding changes; however, they may be unsuitable to extract only the changes of the land cover materials. At Site 1, PCA showed the highest accuracies (OA = 0.8471 and F1 score = 0.7620), whereas CVA had the highest accuracies at Site 2 (OA = 0.8716 and F1 score = 0.7285).
CD was achieved using the CD network in four different cases. When compared with unsupervised CD methods, the CD result maps produced by the CD network involved only some salt-and-pepper noise, and the shadows around the building were not extracted as the changed area because the CD network used the initial CD map or the generated CD objects in which shadows were classified as no-class as label data. However, small objects such as buildings were underestimated. In particular, postprocessing can slightly change the no-change object. The accuracies of postprocessing were the highest at both the sites (OA = 0.8874 and F1 score = 0.7921 at Site 1 and OA = 0.8831 and F1 score = 0.6560 at Site 2).
The proposed method denotes the advantages of unsupervised CD methods and CD network with postprocessing. This method can generate training data for a deep learning network by generating label data even in areas in which prior information about changes is not available. Unlike the unsupervised CD methods, the proposed method does not produce salt-and-pepper noise and the shadows around trees and buildings were not extracted as the changed class. Furthermore, compared with the CD network using postprocessing, the proposed method can appropriately extract the shape of a building. The accuracies of the proposed method were OA = 0.9299 and F1 score = 0.8745 at Site 1 and OA = 0.9012 and F1 score = 0.7347 at Site 2.

The Effect of Uncertainty Level
When generating CD objects, a high uncertainty level indicates that the objects have a low percentage of the same class of pixels in an object. Thus, the higher is the level, the more easily is the objects can be assigned as change or no-change class even if the pixels within the object have different classes. Therefore, the number of CD objects required to train the CD network increases, and many objects are updated at the defined epoch. Further, the percentage of the same pixels in the object increases if the uncertainty level is low. Therefore, the reliability of the object representing change or no-change can increase. However, in this case, the number of objects to be updated is small; therefore, the CD network results were severely changed. Therefore, not all the objects in the image were classified as change or no-change classes by increasing the epoch when the uncertainty levels were 1 and 2.
The experimental results showed that uncertainty Level 2 was appropriate for the two sites and gradually decreasing the level was considerably effective. In this case, the accuracies were observed to be the highest. When the CD network is trained, a large number of CD objects can be used as training data because the objects would not have been updated significantly if the training data did not change significantly.

The Effect of Segmentation Scale
The scale parameter is the most important factor during the segmentation process. In this study, different scale factors were applied to each experimental site because the optimal values may vary depending on the shape and size of the material in the image. The proposed method with different scales k was applied to analyze the effect of the scale factors. In the experiments, the uncertainty level was set based on the highest accuracies among the results in Section 4.3. Figure 17 shows the CD result obtained using different scales overlapping with the segment boundaries, and Table 6 gives the accuracies of the CD result maps. According to the results, the scale value can affect the accuracies of the proposed CD method. Optimal scale values can be observed at both the sites (Figure 18). k = 200 was the most effective value at Site 1, where bare soil changes were dominant, and k = 50 was the most effective at Site 2, where building changes were dominant. The optimal value k was related to the minimum size of the object of change to be extracted from the study site. If k was set to small regardless of the minimum size of changes, the CD objects would be inaccurate because the pixels in a small object can easily have the same class. Furthermore, if k becomes too large, the places at which change and no-change occurred can be considered as one object. Therefore, it is important to select an appropriate k based on the size of the object to be changed. The scale parameter is the most important factor during the segmentation process. In this study, different scale factors were applied to each experimental site because the optimal values may vary depending on the shape and size of the material in the image. The proposed method with different scales k was applied to analyze the effect of the scale factors. In the experiments, the uncertainty level was set based on the highest accuracies among the results in Section 4.3. Figure 17 shows the CD result obtained using different scales overlapping with the segment boundaries, and Table 6 gives the accuracies of the CD result maps. According to the results, the scale value can affect the accuracies of the proposed CD method. Optimal scale values can be observed at both the sites (Figure 18). k = 200 was the most effective value at Site 1, where bare soil changes were dominant, and k = 50 was the most effective at Site 2, where building changes were dominant. The optimal value k was related to the minimum size of the object of change to be extracted from the study site. If k was set to small regardless of the minimum size of changes, the CD objects would be inaccurate because the pixels in a small object can easily have the same class. Furthermore, if k becomes too large, the places at which change and no-change occurred can be considered as one object. Therefore, it is important to select an appropriate k based on the size of the object to be changed.

Limitations and Future Work
In the proposed method, the initial CD map generated various unsupervised CD methods. Although we used only the pixels that exhibited the same class in more than four methods among five when reducing the effect of noise and shadow, the initial CD map was observed to be dependent on the nature of unsupervised CD methods. For example, the region that changed from low vegetation to bare soil could not be extracted as a change class in the initial CD map because there was a minor difference between the two materials. In addition, buildings with dark roof similar to cement ground surface were not classified as change class in the initial map because the differences in spectral characteristics from the ground surface were not significant. Since the accuracy of the CD objects can affect the final CD results, it is important to produce qualified CD objects. To solve this issue, it might be helpful to add improved unsupervised CD methods, which can deal with fine spectral differences, to construct the initial CD map.
Furthermore, since the initial CD object can have a no-value, the deep learning network did not provide the best performance compared to that when training took place using the entire region. Therefore, we plan to develop a method that can provide appropriate initial values when learning with limited data based on transfer learning that can use information learned from similar tasks. Thus, the CD objects constructed at uncertainty Level 1 will become available. Finally, the method of automatically finding the optimal scale k considerably affects the performance of the proposed method and can reduce the difficulty associated with the empirical determination of the optimal value. Furthermore, the proposed method can be applied to a wide range of study sites.

Conclusions
A novel object-based CD method is proposed to detect the changes in VHR satellite images using deep learning networks without the requirement of ground truth data. The proposed method generated a pixel-based initial CD map using various unsupervised CD methods; further, the map was reconstructed to produce CD objects, which have three classes: change, no-change, and no-value. To update the no-value objects, only two classes (change and no-change) of the CD objects were used to label the data for training the CD network. Further, objects were defined and updated according to the uncertainty level. The updated CD objects were used as the training data of the CD network. This process was iteratively conducted until the entire area was classified into two classes in object units or defined epoch. The experiments on Worldview-3 and KOMPSAT-3 datasets confirmed that the proposed method achieved the best performance when compared with the traditional CD approaches. In particular, uncertainty Level 2 was appropriate, and the changes at both the sites could be detected by decreasing the uncertainty level during the updating process. However, the performance of the proposed method can depend on the scale size; therefore, the optimal value should be established by considering the minimum size of the object of change to be extracted from the study site. Future work to automatically detect the optimal scale size and develop transfer learning is being conducted to overcome the limitation of insufficient training data caused by no-values in the CD objects.