Unsupervised Change Detection from Remotely Sensed Images Based on Multi-Scale Visual Saliency Coarse-to-Fine Fusion

: Unsupervised change detection(CD) from remotely sensed images is a fundamental chal-lenge when the ground truth for supervised learning is not easily available. Inspired by the visual attention mechanism and multi-level sensation capacity of human vision, we proposed a novel multi-scale analysis framework based on multi-scale visual saliency coarse-to-ﬁne fusion (MVSF) for unsupervised CD in this paper. As a preface of MVSF, we generalized the connotations of scale as four classes in the ﬁeld of remote sensing (RS) covering the RS process from imaging to image processing, including intrinsic scale, observation scale, analysis scale and modeling scale. In MVSF, superpixels were considered as the primitives for analysing the difference image(DI) obtained by the change vector analysis method. Then, multi-scale saliency maps at the superpixel level were generated according to the global contrast of each superpixel. Finally, a weighted fusion strategy was designed to incorporate multi-scale saliency at a pixel level. The fusion weight for the pixel at each scale is adaptively obtained by considering the heterogeneity of the superpixel it belongs to and the spectral distance between the pixel and the superpixel. The experimental study was conducted on three bi-temporal remotely sensed image pairs, and the effectiveness of the proposed MVSF was veriﬁed qualitatively and quantitatively. The results suggest that it is not entirely true that ﬁner scale brings better CD result, and fusing multi-scale superpixel based saliency at a pixel level obtained a higher F1 score in the three experiments. MVSF is capable of maintaining the detailed changed areas while resisting image noise in the ﬁnal change map. Analysis of the scale factors in MVSF implied that the performance of MVSF is not sensitive to the manually selected scales in the MVSF framework.


Introduction
As the fast development of imaging technology gives rise to easy access to image data, dealing with bi-temporal or multi-temporal images has been getting great concerns. Change detection (CD) is to detect changes from multiple images covering the same scene at different time phrases. Two main branches of the scholar community have been working on this problem, one is the computer vision (CV) and the other is remote sensing (RS). The former analyses the changes among multiple natural images or video frames to carry out further applications, such as object tracking, visual surveillance, and smart environments, etc. [1]. By contrast, the latter is engaged in obtaining the spatiotemporal changes of geographical phenomena or objects on the earth's surface, such as land cover/use change analysis, disaster monitoring and ecological environment monitoring, etc. CD based on RS images usually undergoes more difficulties than natural images because of the intrinsic characteristics of various RS data sources, including multi/hyper-spectral (M/HS) images, high spatial resolution (HSR) images, synthetic aperture radar (SAR) In this paper, we would like to take a fresh look at the CD procedure. We consider it as a visual process when people conduct CD artificially from remotely sensed images. Accordingly, we built a new unsupervised CD framework inspired by the characteristics of human visual mechanism. In the practical manipulation of CD map manually, people are able to focus on the changed areas quickly when they watch both the images repeatedly or in a flicker way. Then, detailed changes can be found and delineated if people put attention on the changed areas of various sizes. We consider this sophisticated visual procedure to be attributed to the visual attention mechanism and multi-level sensation capacity of the human visual system. Visual attention refers to the cognitive operations that allow people to select relevant information and filter out irrelevant information from cluttered visual scenes efficiently [17]. As for an image, the visual attention mechanism can help people focus on the region of interests efficiently while suppressing the unimportant parts of the image scene. The multi-level sensation capacity can help people incorporate the multi-level information and sense the muli-scale objects in the real world [18]. Inspired by this, we proposed a novel unsupervised change detection method based on multi-scale visual saliency coarse-to-fine fusion (MVSF), aiming to develop an effective visual saliency based multi-scale analysis framework for unsupervised change detection. The main contributions of this paper are as follows. • We generalized the connotations of scale in remote sensing as four classes including intrinsic scale, observation scale, analysis scale and modeling scale, which covers the remote sensing process from imaging to image processing. • We designed the multi-scale superpixel based saliency detection to imitate the visual attention mechanism and multi-level sensation capacity of human vision system. • We proposed a coarse-to-fine weighted fusion strategy to incorporate multi-scale saliency information at the pixel level for noise eliminating and detail keeping in the final change map.
The remainder of this paper is organized as follows. We elaborate on the background and motivation of the proposed framework in Section 2. Section 3 introduces the technical process and mathematical description of the proposed MVSF in detail. Section 4 exhibits the experimental study and the results. Discussion is presented in Section 5. In the end, the concluding remarks are drawn in Section 6.

Multi-Scale Analysis in Remote Sensing
Scale is one of the most ambiguous conceptions. It has various scientific connotations in different disciplines. However, the underlying meaning is that scale is the degree of detail at which we observe an object or analyse a problem. In terms of remote sensing, we generalize the connotations of scale as four classes in this paper, including intrinsic scale, observation scale, analysis scale and modeling scale (See Table 1). These four classes are of a progressive relationship and cover the remote sensing process from imaging to image processing. As the intention of this paper, we will put emphasis on the analysis scale and modeling scale.
Analysis scale means the size of unit we use when dealing with RS images and image information detection. It has been considered a significant factor for the performance of CD [19]. The revolution of image analysis unit comes from the promotion of spatial resolution of RS images and development of image analysis technology. Individual pixel has been the basic unit for image analysis since the early use of RS images based on statistical characteristics. Then, spatial context information were further considered by using a window composed of neighbored pixels, named kernel. These two kinds of unit have been widely used in most of the traditional CD frameworks based on image algebra, image transformation, image classification and machine learning [12]. Specifically, some multi-scale analysis mathematical theories have been applied in the field of remote sensing, such as scale space theory [20] for image registration [21,22] and object detection [23,24]; wavelet for multi-scale feature extraction [25] and image fusion [26]; and fractal geometry for multi-scale image segmentation [27]. Using image objects as the units comes from the development of object-based image analysis technique(OBIA). Resulting from the various intrinsic scales of geographic entities in the world, it is difficult to define single or most appropriate scale parameter for segmentation to create the optimized objects with accurate boundaries. It means that the subsequent analysis usually endures the impacts of oversegmentation or under-segmention. Thus, multi-scale segmentation with varying scale paramters for multi-scale representation has been getting more and more popular with the development of OBIA [28,29]. Superpixel is a perceptually meaningful atomic region composed of a group of pixels with homogeneous features. Superpixel segmentation can normally obtain regular and compact regions with well-adhering boundaries compared with conventional image segmentation approaches. Thus, it greatly reduces the complexity of subsequent image analysis [30]. Compared with pixel and image object, superpixel has been considered as the most proper primitive for CD [19].

Superpixel Segmentation
There are many approaches to generate superpixels. They can be broadly categorized as either graph-based or gradient ascent methods [30]. In Ref. [30], the authors performed an empirical comparison of five state-of-the-art algorithms concentrating on their boundary adherence, segmentation speed and performance when used as a pre-processing step in a segmentation framework. In addition, they proposed a new method for generating superpixels based on K-means clustering, called Simple Linear Iterative Clustering (SLIC). SLIC has been shown to outperform existing superpixel methods in nearly every respect. Specifically, the zero parameter version of SLIC (SLICO) adaptively chooses the compactness parameter for each superpixel differently. SLICO generates regular shaped superpixels in both textured and non-textured regions, and this improvement comes with hardly any compromise on the computational efficiency. The only parameter required in SLICO is the manually set number of superpixels, which can be considered as the scale parameter under the interpretation of scale we made in Section 2.1. Detailed discussion on SLIC can be found in the work [30].

Visual Saliency for Change Detection
Visual saliency is originally studied in the field of neurobiology, aiming to understand the mechanism of visual behavior. Then, it was applied to computer vision for detecting salient region of an image. Visual saliency of an image reveals the obvious areas where a target of interest initially attracts the human eyes. This mechanism has spawned many bottom-up visual saliency detection methods [31,32]. Normally, they are designed by following the basic principles of human visual attention, such as contrast consideration and center prior, which means the salient region usually has a sharp contrast with the background and locates in the center of the image as well.
The strong visual contrast of changed areas in DI makes visual saliency suitable to guide the CD procedure [33]. In Table 2, we state the differences and similarities between saliency detection from natural images and CD from RS images. The direct relation between them is that they both need the consideration of contrast and integrity. Here, the integrity consideration means all the saliented objects (changed areas) from the given image scene need to be discovered and all parts that belong to a certain salient object (changed area) should be highlighted [34]. However, CD does not need to consider center prior, because the changed areas may spread any place of an image. Even so, many previous works [33,35,36] have proven contrast based saliency detection to be efficient for locating the changed areas fast while suppressing the interfered information in unchanged areas. Table 2. Differences and similarities between change detection and saliency detection.

Change Detection
Saliency Detection

Bi-temporal RS images Single scene image Result
Changed areas Salient areas

Center prior consideration
No Yes

Contrast consideration
Yes Yes Integrity consideration Yes Yes As described above, this paper introduces multi-scale superpixel based saliency detection for multi-scale analysis of DI. The generated multi-scale saliency maps will finally be integrated by a proposed weighted fusion strategy at a pixel level. The mathematical description of MVSF is presented in detail in the following section. Figure 1 gives an illustration of the proposed MVSF. Firstly, the DI is generated with CVA method. Then, we consider superpixels as the basic units to decrease the effects of image noise and keep the changed areas structured with good boundary adherence [30]. Specifically, the scale parameter are set with different values to obtain multi-scale segmentation results during superpixel segmentation. Third, global contrast based saliency detection at superpixel level is conducted to highlight the changed areas while suppressing the unchanged areas. Finally, a coarse-to-fine multi-scale saliency fusion strategy is proposed to keep the changed areas of varying sizes well-structured while retaining detailed changes. In the end, the final change map can be generated by simple clustering or thresholding method.

Difference Image Generation and Multi-Scale Superpixel Segmentation
DI is normally obtained by subtracting the corresponding components of bi-temporal images with an L2 norm compression for each pixel. This procedure can also be applied to feature maps after feature detection from the bi-temporal images. Given two co-registered bi-temporal images X 1 and X 2 with same size R × C × B, let [F 1 1 , F 1 2 , . . . , F 1 N ] and [F 2 1 , F 2 2 , . . . , F 2 N ] be the feature maps with size R × C × N after image feature detection from X 1 and X 2 . The DI, noted as X D can be generated by: where ∑ N n=1 • here is an element-wise processing for the change vector of each pixel. X D preserves the main properties of the changes despite the compression of the change vector size from [R, C, N] to [R, C, 1] [37]. Pixel values of X D indicate the magnitude of the changes. After DI generation, we chose SLICO to segment the DI into superpixels because of its high efficiency of operation and high quality of results [30]. Specifically, the only parameter of SLICO is the desired number of approximately equally sized superpixels. As the remote sensing image is a reflection of the multi-scale objects on the earth, the DI also conveys changed areas with various sizes. It is not reasonable to use the segmentation result of DI at a single scale, because it can not describe the multi-scale objects in the image accurately and breaks the rules of the multi-level visual sensation of human vision. Thus, we use multi-scale superpixel segmentation of DI to detect multi-scale change information. The CD results will be finally promoted by a coarse-to-fine fusion process we proposed, which will be introduced in detail in Section 3.3.
Note the multi-scale superpixel maps generated from DI by SLICO as [S 1 , S 2 , . . . , S N ], in which N is the number of scales we adopted. For superpixel map at ith scale, namely, S i (i ∈ [1, N]), the approximate size of S i is therefore R × C/K i pixels, in which K i is the scale parameter set for generating S i .

Saliency Detection at Superpixel Level
In terms of superpixel based saliency detection, the global contrast and prior of spatial distribution of the salient region is commonly considered [32,38,39]. As there is no prior information of the spatial distribution of changed areas, we simply used global contrast to obtain the saliency map for each scale regardless of the spatial location. The saliency of each superpixel is caculated by averaging the global contrast with all the other superpixels.
in which j, k ∈ [1, K i ] and j = k. The multi-scale saliency maps, noted as [C 1 , C 2 , . . . , C N ], can be obtained by calculating the saliency of all the superpixels at all scales. For the pixels in a certain superpixel, we consider that they share the same saliency value with the superpixel they belong to.

Multi-Scale Saliency Coarse-to-Fine Fusion
The generated saliency maps reveal the possibility of changes of a superpixel at different scales. However, as the size of changed areas varies all over the image, oversegmentation and under-segmentation may happen at a single scale when the superpixel size is smaller or larger than the changed areas. This will lead to some false alarms or omissions in the CD result. To maintain the details of changed areas while suppressing the noise, a weighted fusion strategy was designed to incorporate multi-scale saliency at pixel level. Note the saliency map at i th scale as is the saliency value of pixel (r, c) in C i and shares the same saliency value of the superpixel it belongs to. Then, the proposed multi-scale weighted saliency fusion at pixel level can be formulated as in which, C fused is the fused saliency map from multi-scale saliency maps [C 1 , C 2 , . . . , is the normalization factor. Specifically, the weight of fusion for pixels at different scales were designed out of the following concerns. Firstly, the weight should be decreased if the superpixel, where the pixel belongs to, has a higher heterogeneity. High heterogeneity normally means under-segmentation happens, so we give a penalty to the pixel for the uncertainty of sharing the same saliency with the superpixel it belongs to. In such cases, we further decrease the weight if the spectral distance between the pixel and the superpixel (the average spectral value of the pixels in the same superpixel) is larger. The weight item w i (r,c) can be finally formulated as, in which v j i is the spectral variance of all the pixels that belong to the same superpixel s j i , and it represents the degree of heterogeneity. d(x (r,c) , x j i ) is the spectral distance between pixel x (r,c) and the superpixel it belongs to. With the fused saliency map, the changed areas are much more highlighted and unchanged areas are suppressed. Then the final change map can be generated by simple thresholding or clustering method.

Datasets
Three datasets from multiple sensors were employed in our experimental study.  Figure 2c,f,i are the corresponding reference change maps of the three datasets. They were all generated by manual visual interpretation of the changes.

Implementation Details and Evaluation Criteria
In the experimental study, the implementation of the algorithms and accuracy evaluation was conducted with Python 3.7 in an integrated development environment called PyCharm. For each dataset, the DI was generated by utilizing the original spectral feature for simplicity. The superpixel segmentation was implemented by using the image processing Python package named scikit-image [40]. The scale parameter K was set manually with K = 500, 1000, 2000 for multi-scale analysis in the MVSF framework, which is simply represented as scale 500, scale 1000 and scale 2000 in the following text for convenience. We adopted the widely used evaluation criteria by comparing the CD result with the reference change map, in order to analyse the performance of the proposed method quantitatively.

Effects of Single Scale of Superpixel Segmentation
The scale parameter in the proposed MVSF is the number of superpixels we set manually, which determines the size of a superpixel. To verify the performance of superpixel segmentation of RS images at different scales, we took the Mexico dataset for example. For the convenience of illustration, Figure 3 presents subsets of the superpixel segments boundaries covering DI with scales at 500, 1000 and 2000, in which the boundaries between the adjacent superpixels are delineated using red lines. As we can see, the generated superpixels by SLICO at each single scale are of the similar size and generally have well-adhering boundaries. This will facilitate the following process of CD. Segments at finer scale perform better at describing the tiny changed areas with accurate boundaries (seen from the areas in the yellow circles of Figure 3a-c). However, CD at finer scale not only brings higher cost of time but also has a risk of introducing the noise.

Valuation of the Performance of MVSF
To verify the effectiveness of the proposed MVSF, we conducted three experiments on Mexico dataset, Sardinia dataset and Ottawa dataset, respectively. For each experiment, we illustrated the multi-scale fused saliency map and the CD result by MVSF. As comparisons, we also present the superpixel saliency maps at each scale and the corresponding change maps by applying K-means. Besides, EM-Bayes thresholding [4] and K-means clustering on DI were carried out and the results were presented. All the CD results were evaluated quantitatively by the foregoing criterias.  Figure 4a-c is the superpixel based saliency map at scale 500, 1000 and 2000, respectively. The number of generated superpixels is 484, 1024, 2181 for each corresponding scale. It is obvious that superpixel based saliency inherits the advantages of SLICO superpixel segmentation. The salient region, which indicates the high possibility of changed areas, has kept good boundaries. It also helps to highlight the changed areas and suppress the noise in the unchanged background. Moreover, the scale effects we discussed before have embodied the saliency maps at different scales. Figure 4d is the fused saliency map by the proposed weighted fusion strategy of Figure 4a-c. The fused saliency map has shown a finer description of saliency, which gives more details while suppressing the noise because of the weighted integration of multiple scale saliency information at a pixel level. This can be further proved by Figure 4e-h, which show the change maps by K-means on Figure 4a-d. Figure 4i,j shows the change maps by K-means and EM-Bayes on DI, respectively. As we can see, they both suffer from the noise seriously. By comparison, Figure 4h, namely the change map by MVSF, has given the best CD result thanks to the superpixel based saliency and weighted multi-scale fusion strategy we proposed.  Table 3 presents the quantitative comparison analysis of MVSF. In terms of change map generated from superpixel based saliency map at scale 500, 1000, 2000, we found that it obtained the highest F1 score (0.884) at scale 1000 than the other two scales (0.866 at scale 500 and 0.865 at scale 2000). This suggests it is not entirely true that finer scale brings higher CD accuracy. The reason would be the saliency map at finer scale may mix with noise even though it supplies more details. By contrast, the proposed MVSF performed best with the highest F1 score at 0.890 and it exceeds the F1 score of K-means and EM-Bayes as well. Moreover, MVSF obtained the highest precision at 0.971. This is because K-means and EM-Bayes has lots of false alarms as we can see from Figure 4i,f.

Experiment on Sardinia Dataset
The experimental results of Sardinia dataset are given in Figure 6. Figure 6a-c is the superpixel based saliency map at scale 500, 1000 and 2000, in which the number of generated superpixels is 494, 998, 1955, respectively. Figure 6d is the saliency map after multi-scale saliency weighted fusion of Figure 6a-c. Figure 6e-h shows the change maps by K-means on Figure 6a-d. Figure 6i-j shows the change maps by K-means and EM-Bayes on DI, respectively. In terms of the qualitative analysis, K-means and EM-Bayes both failed to overcome the serious noise problem. By contrast, superpixel based saliency helps to detect the changed areas with a robust ability of eliminating the noise. However, the utility of superpixels would recede with the down scaling of the superpixels for the sake of detecting more detailed information. As we can see from Figure 6g, noise starts to emerge when the scale is 2000. Among all the change maps, MVSF performed the best although it has obvious miss detections compared with the reference change map. This is because of a compromise between details keeping and noise eliminating. If we fuse more finer scales of the scaliency maps, the noise would pollute the final change map gradually. Table 4 presents the quantitative comparison analysis of MVSF. For the performance of the single scale, scale 1000 obtained the higher F1 score at 0.844 than the other two scales (0.809 at scale 500 and 0.828 at scale 2000). Meanwhile, MVSF obtained the highest F1 score at 0.855. The F1 score of K-means and EM-Bayes is only 0.739 and 0.559, respectively. This is because the change map by K-means and EM-Bayes have large number of false detections, and the precision of them have been pulled down drastically to 0.637 and 0.399, respectively.   Figure 5a-c is the superpixel based saliency map at different scales. Figure 5d is the final saliency map by fusing weighted multi-scale saliency maps (Figure 5a-c). Figure 5e-h presents the change maps by K-means on Figure 5a-d. Figure 5i,j shows the change maps by K-means and EM-Bayes on DI, respectively. Table 5 presents the quantitative comparison analysis of MVSF. The Ottawa dataset is composed of two SAR images with heavy speckle noise interference. As a result, the change maps by K-means and EM-Bayes on DI seem to be insufferable with large amount of noise (Seen from Figure 5i,j), and the corresponding F1 score is only 0.669 and 0.485 respectively. By contrast, change map by MVSF outperformed the others qualitatively and quantitatively with F1 score at 0.739. It suggests that MVSF has a robust ability of suppressing noise.

Analysis of the Scale Factors
For multi-scale analysis in remote sensing, it is normally difficult to determine the optimal scales. The selection of scales for multi-scale saliency fusion is worthy to analyse in the MVSF framework. In the forgoing experimental study, we segmented the DI at three scales, namely scale = 500, 1000, 2000, and the obtained change maps of the three datasets have shown better performance after the designed pixel level multi-scale saliency fusion. In this subsection, we will further analyse the effects of scale factors to the change detection results.
We consider five scales for analysis, including scale = 500, 1000, 2000, 4000, 8000, noted as 0.5, 1, 2, 4, 8 (×10 3 ) for convenience. Then, the F1 scores of CD results in different fusion cases were explored. Table 6 presents the different fusion cases and scale permutations we applied in the experiment. Figure 7a-c illustrates the corresponding F1 scores of the change maps by fusing two scales, three scales and four cases, respectively. The F1 scores of the change map by fusing all the five scales for the Mexico dataset, Sardinia dataset and Ottawa dataset is 0.874, 0.851 and 0.753, respectively.   Seen from Figure 7a-c, the F1 score curves of the change maps from the three datasets fluctuate in a very small range for each fusion case. The maximum fluctuated range is 0.035, appeared in the F1 score curve of fusing two scales for the Ottawa dataset (the green curve in Figure 7a). Besides, the F1 score curve seems to be more stable with more scales are fused, according to the F1 score curves of the change maps by fusing two scales, three three and four scales. Figure 8 presents the average F1 scores of fusing various number of scales. It shows that the average F1 scores of the three fusion cases for all the three datasets are almost identical, which is around 0.875 for Mexico dataset, 0.850 for Sardinia dataset and 0.752 for Ottawa dataset, respectively. Overall, the analysis of the results implies that the accuracy of the CD result is not sensitive to the scale factors. It clarifies the confusion of choosing scales to apply the MVSF framework for CD. We believe that it is more appropriate to choose three scales for multi-scale saliency fusion in a balanced consideration of the performance stability and the calculation cost.

Discussion
For traditional unsupervised CD, balance of the removing noise and maintaining multi-level change details has been a headache for most CD algorithms. For example, the pixel based CD, which considers pixels as the basic analysis units, has to develop novel feature descriptors to cope with noise problem. The object based CD is based on the comparison of the segmented objects between different image phases. The results usually depend on the quality of image segmentation, and it is also difficult to determine the optimal segmentation scale. Therefore, multi-scale analysis of remote sensing image is of great importance for CD. However, as far as we know, there is a lack of clear generalization of the concept 'scale' in remote sensing, and novel multi-scale analysis framework for CD is still required to be developed further.
In this paper, we generalized the connotations of scale in the field of remote sensing as intrinsic scale, observation scale, analysis scale and modeling scale, which covers the remote sensing process from imaging to image processing. From views of analysis scale and modeling scale, We further proposed a novel unsupervised CD framework based on multiscale visual saliency coarse-to-fine fusion, inspired by the visual attention mechanism and the multi-level sensation capacity of human vision. Specifically, superpixel was considered as the primitives for generating the multi-scale superpixel based saliency maps. A coarseto-fine weighted fusion strategy was also designed to incorporate multi-scale saliency information at a pixel level.
The effectiveness of the proposed MVSF was comprehensively examined by the experimental study on three remote sensing datasets from different sensors. The MVSF has shown its superiority through the qualitative and quantitative analysis against the popular K-means and EM-Bayes. On one hand, MVSF performed a robust ability of suppressing noise, although there is no utilization of high-level features in the experiments. One of the reasons is that generating superpixels can be recognised as a denoising process. The other reason lies in that the visual saliency itself has abilities of suppressing the background information. On the other hand, the proposed multi-scale saliency weighted fusion at the pixel level can incorporate multi-level change information, and maintain the change details well. Overall, the superiority of MVSF makes it applicable to images with multiple changes of various sizes and noise interference. In addition, we also analysed the scale factors in the MVSF framework, and the results implied that the accuracy of CD results by MVSF is not sensitive to the manually chosen scales. It means the performance of MVSF is stable against the scale factors.
It should be noted that there exist potential limitations of the MVSF framework. First, for the four scale connotations we generalized, we has already incorporated the analysis scale and modeling scale in the multi-scale analysis framework. However, an excellent multi-scale analysis framework for CD could also deal with the RS images with multiple observation scales, namely spatial resolution or spectral resolution. That is what we will work on in the future. Second, as we referred before, this work was inspired by the visual attention mechanism and multi-level visual sensation capacity of human vision. With the development of the recognition of human vision mechanism, advanced visual attention algorithm and multi-scale fusion method could be applied in the MVSF framework to improve the CD performance.

Conclusions
This paper proposed a novel multi-scale analysis framework for unsupervised CD based on multi-scale visual saliency coarse-to-fine fusion. To imitate the visual attention mechanism and multi-level sensation capacity of human vision, the proposed MVSF produces the multi-scale visual saliency at the superpixel level, and they are subsequently incorporated by a weighted coarse-to-fine fusion strategy at the pixel level. The performance of MVSF was examined qualitatively and quantitatively by an experimental study on three remotely sensed images from different sensors. The results indicated that the MVSF is capable of maintaining the detailed changed areas while resisting image noise in the final change map. In addition, the performance of MVSF is robust to the scale factors chosen manually in the multi-scale analysis framework. In the future, we will incorporate the observation scale into the multi-scale analysis framework and exploit the potentials of dealing with RS images with various spatial or spectral resolution for MVSF. The MVSF framework could also be further improved by more advanced visual attention algorithm and multi-scale description method with the progressive understanding of human vision.

Conflicts of Interest:
The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, and in the decision to publish the results.

Abbreviations
The