1. Introduction
As the fast development of imaging technology gives rise to easy access to image data, dealing with bi-temporal or multi-temporal images has been getting great concerns. Change detection (CD) is to detect changes from multiple images covering the same scene at different time phrases. Two main branches of the scholar community have been working on this problem, one is the computer vision (CV) and the other is remote sensing (RS). The former analyses the changes among multiple natural images or video frames to carry out further applications, such as object tracking, visual surveillance, and smart environments, etc. [
1]. By contrast, the latter is engaged in obtaining the spatiotemporal changes of geographical phenomena or objects on the earth’s surface, such as land cover/use change analysis, disaster monitoring and ecological environment monitoring, etc. CD based on RS images usually undergoes more difficulties than natural images because of the intrinsic characteristics of various RS data sources, including multi/hyper-spectral (M/HS) images, high spatial resolution (HSR) images, synthetic aperture radar (SAR) images and multi-sources images. For example, the M/HS images contain more detailed spectral information, which makes it possible to detect various changes of geographical objects with particular spectral characteristic curves. Nevertheless, this profit comes at the cost of an increasing CD complexity due to the high dimensionality of M/HS data [
2]. CD from HSR images can benefit from the fine representation of the geographical world with high spatial resolution, and more accurate CD results could be obtained. However, the consequent decrease of intra-class homogeneity and huge amount of noise have to be seriously considered. SAR imaging is an active earth observation system with day and night imaging capability regardless of the weather conditions. Since the inherently existed speckle noise degrades the quality of the SAR images, CD from SAR images also needs to withstand the effects of noise [
3].
Generally, CD from remotely sensed images is aimed to obtain a change map (CM) indicating the changed areas with a binary image. The CD methods can be mainly classified as two groups including supervised CD and unsupervised CD. To learn the accurate underlying mode of ‘change’ and ‘no-change’, the supervised CD needs a huge amount of ground truth samples with high quality, which is usually time-consuming and labor-intensive. In contrast, the unsupervised CD has been getting more popular because of leaving the procedure of training data labelling. This paper accordingly focuses on the discussion of unsupervised CD.
Unsupervised CD is to make labeling decisions according to the understanding of data itself rather than being driven by labeled data. Statistical machine learning has been widely applied to CD problems. A classic work from Bruzzone L. and Prieto D.F. [
4] proposed two CD methods for analysing the DI of the bi-temporal images automatically under a change vector analysis (CVA) framework. The two methods were based on the assumption that the samples from the DI are independent or not. One is an automatic thresholding method according to the Gaussian mixture distribution model and Bayes inference (noted as EM-Bayes thresholding). The other is a post processing method considering the contextual information modeled by Markov random field. Subsequently, many researches have been devoted to finding a more proper threshold, such as histogram thresholding [
5] and adaptive thresholding [
6], or modeling more accurate context information such as fuzzy hidden Markov chains [
7]. Clustering is another commonly used technique in CD. Many clustering algorithms incorporated with fuzziness [
8,
9] or spatial context information [
10,
11] have been proposed to alleviate the uncertainty problems in remote sensing and obtain more accurate clusters indicating as ‘change’ and ‘no-change’.
Since around 2000, object-based image analysis(OBIA) for remote sensing has grown rapidly. Accordingly, Object-based change detection(OBCD) has been studied widely [
12]. The OBCD considers image segments as the analysis units to obtain object-oriented features such as texture, shape and topology features, which can be utilized in the objects comparison procedure. Compared with the pixel-based CD, OBCD performs better in dealing with high-resolution images [
13]. However, how to get high-quality image segments and how to make segments from image pairs one-one corresponding and comparable has gotten more concerns than CD problem itself [
12].
Seen from the development of CD techniques, image feature representation is one of the key factors that lead the technological progress. Specifically, in nearly a decade, deep learning has been widely applied in remote sensing researches. Deep features with high-level semantic information obtained by deep neural networks (DNN) has become an outstanding supplement to the manually designed features [
14]. Zhange P. etc. [
15] incorporated deep-architecture-based unsupervised feature learning and mapping-based feature change analysis for CD, in which the denoising autoencoder is stacked to learn local and high-level representation from the local neighborhood of the given pixel in an unsupervised fashion. In the work by Gong M. etc. [
16], autoencoder, convolutional neural networks (CNN) and unsupervised clustering are combined to solve ternary change detection problem without supervision. More information about deep learning based CD can be found in the review article [
3].
In this paper, we would like to take a fresh look at the CD procedure. We consider it as a visual process when people conduct CD artificially from remotely sensed images. Accordingly, we built a new unsupervised CD framework inspired by the characteristics of human visual mechanism. In the practical manipulation of CD map manually, people are able to focus on the changed areas quickly when they watch both the images repeatedly or in a flicker way. Then, detailed changes can be found and delineated if people put attention on the changed areas of various sizes. We consider this sophisticated visual procedure to be attributed to the visual attention mechanism and multi-level sensation capacity of the human visual system. Visual attention refers to the cognitive operations that allow people to select relevant information and filter out irrelevant information from cluttered visual scenes efficiently [
17]. As for an image, the visual attention mechanism can help people focus on the region of interests efficiently while suppressing the unimportant parts of the image scene. The multi-level sensation capacity can help people incorporate the multi-level information and sense the muli-scale objects in the real world [
18]. Inspired by this, we proposed a novel unsupervised change detection method based on multi-scale visual saliency coarse-to-fine fusion (MVSF), aiming to develop an effective visual saliency based multi-scale analysis framework for unsupervised change detection. The main contributions of this paper are as follows.
We generalized the connotations of scale in remote sensing as four classes including intrinsic scale, observation scale, analysis scale and modeling scale, which covers the remote sensing process from imaging to image processing.
We designed the multi-scale superpixel based saliency detection to imitate the visual attention mechanism and multi-level sensation capacity of human vision system.
We proposed a coarse-to-fine weighted fusion strategy to incorporate multi-scale saliency information at the pixel level for noise eliminating and detail keeping in the final change map.
The remainder of this paper is organized as follows. We elaborate on the background and motivation of the proposed framework in
Section 2.
Section 3 introduces the technical process and mathematical description of the proposed MVSF in detail.
Section 4 exhibits the experimental study and the results. Discussion is presented in
Section 5. In the end, the concluding remarks are drawn in
Section 6.
5. Discussion
For traditional unsupervised CD, balance of the removing noise and maintaining multi-level change details has been a headache for most CD algorithms. For example, the pixel based CD, which considers pixels as the basic analysis units, has to develop novel feature descriptors to cope with noise problem. The object based CD is based on the comparison of the segmented objects between different image phases. The results usually depend on the quality of image segmentation, and it is also difficult to determine the optimal segmentation scale. Therefore, multi-scale analysis of remote sensing image is of great importance for CD. However, as far as we know, there is a lack of clear generalization of the concept ‘scale’ in remote sensing, and novel multi-scale analysis framework for CD is still required to be developed further.
In this paper, we generalized the connotations of scale in the field of remote sensing as intrinsic scale, observation scale, analysis scale and modeling scale, which covers the remote sensing process from imaging to image processing. From views of analysis scale and modeling scale, We further proposed a novel unsupervised CD framework based on multi-scale visual saliency coarse-to-fine fusion, inspired by the visual attention mechanism and the multi-level sensation capacity of human vision. Specifically, superpixel was considered as the primitives for generating the multi-scale superpixel based saliency maps. A coarse-to-fine weighted fusion strategy was also designed to incorporate multi-scale saliency information at a pixel level.
The effectiveness of the proposed MVSF was comprehensively examined by the experimental study on three remote sensing datasets from different sensors. The MVSF has shown its superiority through the qualitative and quantitative analysis against the popular K-means and EM-Bayes. On one hand, MVSF performed a robust ability of suppressing noise, although there is no utilization of high-level features in the experiments. One of the reasons is that generating superpixels can be recognised as a denoising process. The other reason lies in that the visual saliency itself has abilities of suppressing the background information. On the other hand, the proposed multi-scale saliency weighted fusion at the pixel level can incorporate multi-level change information, and maintain the change details well. Overall, the superiority of MVSF makes it applicable to images with multiple changes of various sizes and noise interference. In addition, we also analysed the scale factors in the MVSF framework, and the results implied that the accuracy of CD results by MVSF is not sensitive to the manually chosen scales. It means the performance of MVSF is stable against the scale factors.
It should be noted that there exist potential limitations of the MVSF framework. First, for the four scale connotations we generalized, we has already incorporated the analysis scale and modeling scale in the multi-scale analysis framework. However, an excellent multi-scale analysis framework for CD could also deal with the RS images with multiple observation scales, namely spatial resolution or spectral resolution. That is what we will work on in the future. Second, as we referred before, this work was inspired by the visual attention mechanism and multi-level visual sensation capacity of human vision. With the development of the recognition of human vision mechanism, advanced visual attention algorithm and multi-scale fusion method could be applied in the MVSF framework to improve the CD performance.