Predicting Relevant Change in High Resolution Satellite Imagery

With the ever increasing volume of remote sensing imagery collected by satellite constellations and aerial platforms, the use of automated techniques for change detection has grown in importance, such that changes in features can be quickly identified. However, the amount of data collected surpasses the capacity of imagery analysts. In order to improve the effectiveness and efficiency of imagery analysts performing data maintenance activities, we propose a method to predict relevant changes in high resolution satellite imagery based on human annotations on selected regions of an image. We study a variety of classifiers in order to determine which is most accurate. Additionally, we experiment with a variety of ways in which a diverse set of training data can be constructed to improve the quality of predictions. The proposed method aids in the analysis of change detection results by using various classifiers to develop a relevant change model that can be used to predict the likelihood of other analyzed areas containing a relevant change or not. These predictions of relevant change are useful to analysts, because they speed the interrogation of automated change detection results by leveraging their observations of areas already analyzed. A comparison of four classifiers shows that the random forest technique slightly outperforms other approaches.


Introduction
With the proliferation of readily accessible high resolution satellite imagery, many researchers have focused their efforts on multi-temporal imagery analysis.Bhatt and Wallgrun astutely observe that the temporal aspect of spatial data has become an increasingly important component for analysis applications [1], including image to image change detection.
One example of such a system is the Geospatial Change Detection and Exploitation System (GeoCDX), a fully-automated system for large-scale change detection in high resolution imagery [2], which was recently published in a Special Issue on multi-temporal analysis of remote sensing data [3].Other approaches for high resolution change detection include using neural networks [4], hierarchical clustering [5,6], expectation maximization level sets [7], morphological attribute profiles [8] and segmentation [9].However, in many cases, simply identifying the changes that have occurred is not sufficient.
Several change detection approaches focus the identification of change for very specific purposes.A sampling of these includes mapping land cover patterns for urban growth modeling [10], identifying areas in need of vegetation cover rehabilitation [11] and estimating seismic risk [12].In this manuscript, we are interested in permanent, anthropogenic changes; a more detailed description is provided in Section 3.1.
We propose an approach for automatically identifying relevant change using a classifier that has been trained with user-identified examples of relevant change.If a user views exemplar regions of a pair of multi-temporal images and provides an assessment of whether or not a relevant change occurred, then we should be able to train a system to then classify other regions within the image.
In previous work, we developed a query-by-example (QBE) system for content-based image retrieval (CBIR) [13][14][15] that could identify imagery in a database that matched a given query image.In [16], Barb and Kilicay-Ergin developed semantic models using genetic optimization of low-level image features.Other examples of applying data mining algorithms to remote sensing imagery include mining temporal-spatial information [17] and using association rules to extract information from the gaze patterns of individuals viewing satellite imagery [18].
In this manuscript, Section 2 presents a high-level overview of the GeoCDX change detection system, as this serves as the source of the imagery features and change annotations used in the prediction of relevant change.Our definition of relevant change is given in Section 3 along with a description of the classification algorithms used.Section 4 describes the experiments performed to evaluate the change prediction algorithms and discusses the meaning of the results.Finally, Section 5 provides a conclusion and a brief description of future directions of research.

Change Detection with GeoCDX
The Geospatial Change Detection and Exploitation System (GeoCDX) is a sensor-agnostic change detection system for high resolution remote sensing imagery [2].GeoCDX automatically ingests imagery from a variety of sensors, including IKONOS, QuickBird, GeoEye-1 and WorldView-2.Once ingested into the system's catalog, a data-specific processing plan is developed based on the characteristics of the imagery.The first step of this processing may involve steps, such as geometric correction and conversion to top-of-atmosphere (TOA) reflectance, if they are appropriate for the imagery.The system automatically determines which temporal pairs of images can be created when new imagery is ingested; this also creates a processing plan for each pair.This fully-automated plan includes image-to-image coregistration, radiometric balancing, high-level feature extraction, differencing of the extracted features and fusion of the difference images into a single change confidence image.A summary of these processing steps can be found in Figure 1.As defined in [2], this per-tile change score is: where T is the set of pixels that compose a tile and s ij is a per-pixel change score calculated using a non-linear stack filter algorithm that accounts for the intensity and morphological characteristics of the change present at each pixel.This per pixel-change score is described in detail in Section IV.B. of [2].).Further to the right are three images in each row representing the before image, the after image and the corresponding change map that highlights changed regions.Finally, on the far right side of each row, there is a UI element that allows an analyst to tag a tile as "change" (the button with the red text) or "no change" (the button with the green text).
Tiles are then ranked using this change score from most change to least change and presented in rank order in a web interface.Users are then free to exploit the tiles that have been determined to have the most change and stop their analysis when they no longer find a relevant change in the results.An example of highly-ranked change results in the GeoCDX web interface can be seen in Figure 2.
Additionally, the GeoCDX system also uses the per-tile feature signature and per-tile change signature to cluster tiles based on their amount of change and the content.Complete details on the competitive agglomerative clustering algorithm used for this task can be found in [19].This algorithm produces a dynamic (but bounded) number of clusters based on the degree of variance in the types of change present in a given pair.Each cluster produced represents a distinct type of transition between land-cover, land-use types.For example, one cluster may represent grassland that has changed to residential housing, while another may contain examples of new buildings appearing in urban areas.Figure 3 shows several examples of members of clusters produced by the GeoCDX system.
For the work presented herein, the GeoCDX system was used to perform change detection on imagery from a variety of geographic areas.One of the results (and the only one considered in this work) of the automated GeoCDX change detection processing is a prioritized list of tiles 256 × 256 meters in size that are ordered in terms of most change to least change.In typical usage, an analyst would interrogate these results in rank order, making a change versus no change assessment for each tile.As a user progresses through the list of tiles, we seek to leverage knowledge from the tiles that have already been assessed to make predictions about the remaining tiles in the list that contain change or not.

Using Classifiers to Predict Relevant Change
Using the same per-tile feature signatures that the GeoCDX system uses to organize image tiles into clusters [19], we propose methods for predicting areas of relevant change based on prior, manual classification of a subset of a pair.A high-level flow chart of the proposed change prediction methodology can be seen in Figure 4.A user begins by inspecting tiles in the GeoCDX user interface and performing change analysis to determine if a relevant change has occurred within the tile.These change/no-change annotations are recorded on a per-tile basis in the system database.If change occurs, but is not relevant, it is to be marked as no-change by the analyst.This information can then be used in conjunction with the per-tile features used for change clustering that were described in Section 2 and explained in detail in Section III.A. of [19].These features are 16-bin histograms that encode information about the 14-pixel level features used by the GeoCDX system.As was the case in [19], we concatenate these histograms together to construct a single feature vector that represents the signature for each tile.We use these signatures along with accumulated change/no change annotation data for a pair to produce a classifier (i.e., the relevant change model) that can be used to predict relevant change for the remaining tiles within a pair.

Definition of Relevant Change
There are many applications that call for the use of automated change detection using remotely-sensed imagery.Each application has its own set of criteria that define types of changes that are relevant and not relevant.For example, following a natural disaster, emergency management authorities are likely only interested in identifying areas that have been damaged or destroyed.Additionally, insurance companies have an interest in knowing about changes to properties for which they underwrite policies (e.g., expansions of existing structures, new outbuildings being constructed, etc.) Bruzzone and Bovolo propose a taxonomy of the causes of changes in [20].In this paper, we propose a scenario in which we are interested in the subset of anthropogenic changes that may require features on a map to be updated that are considered relevant change; all other features are not considered relevant change.Within this definition of relevant change, we include any new building or an extension to an existing building that is at least 200 square meters in area (i.e., approximately the size of a small residential house).Additionally, we consider any new road, parking lot or other impervious surface to be a relevant change.The demolition of any existing building or road is also considered to be a relevant change.Finally, disturbed earth that has been cleared for non-agricultural purposes (e.g., construction, deforestation, etc.) is considered to be relevant change.
Conversely, seasonal or transitory changes are not considered to be relevant changes for this particular experiment.For example, vehicles in parking lots or on roads, although a common sight, are not considered to be a relevant change.Changes to road surfaces, such as repaving, do not constitute a relevant change, because it is not a change that would require an update of features on a map.Agricultural changes (including planting crops, plowing fields, etc.) and seasonal water body fluctuations are not relevant for this experiment.Finally, ephemeral changes, such as shadows or building glint (due to over-saturation of the sensor), leading to streaking, is not a relevant change.

Classification Algorithms
In this manuscript, we present change classification results from four different algorithms in order to determine the relative efficacy of each.The input to each classification model is a real-valued feature vector along with a binary (change/no-change) classification for each training data point.For classification purposes, we use the same feature vectors used to cluster the tiles that were described in Section 2. From this training dataset, a classification model is built for nearest neighbor, SVM, decision tree (CART) and random forest classification.

k-Nearest Neighbors
The simplest algorithm employed in this manuscript is the k-nearest neighbors classification algorithm [21].For a given tile to be classified, its feature vector, x, is compared to those of all training tiles in the set and the class of the nearest tile in the feature space is assigned.

Support Vector Machines
The use of support vector machines (SVMs) has been widely discussed as a means of performing nonlinear two-class classification.Originally developed by Cortes and Vapnik [22], SVMs are capable of performing efficient classification of data not otherwise linearly separable by employing the "kernel trick" to project data into a high-dimensional feature space.
For each tile in the training set, we can define feature vector x i ∈ R d and assignment membership to it based on whether it was marked as being relevant change (i.e., let y i = 1) or either not-relevant change or no change (i.e., let y i = −1).If we let w represent the vector normal to the hyperplane that divides the two sets, then we can solve the classification problem using quadratic programming.We must optimize: subject to: for all values of i.The resulting classification is the one which maximizes the margin, or separation between the two classes, in the high dimensional space used by the chosen kernel.
Classification is then performed by projecting each new data point into the same high dimensional space and determining on which side of the hyperplane it falls.

Decision Tree Classification
Additionally, we employ the CART decision tree classification algorithm originally proposed by Breiman et al. [23].Decision trees are a non-parametric technique that are built by making choices at each node in the tree regarding how to split the dataset in such a way that balances the data points and yields the greatest predictive accuracy.These splits continue recursively until each node contains data points belonging to a single class or some predetermined node size has been reached.
Classification can be performed by starting at the root node and walking the decision tree until a node is reached.A class label is then assigned based on the label of the data points in the node.

Random Forest
The decision tree concept was extended by Breiman to create the ensemble classification method of random forests [24].This technique utilizes "bagging" to sample the training dataset to produce multiple decision trees.During the classification stage, these trees are then used in concert to produce several classification results.Each tree casts a vote for classification of the data point, and the consensus data point (i.e., the one with the plurality of votes) is assigned.

Results
In this section, we will describe the experiments performed to test the predictions of relevant change made by various algorithms.These experiments involved data from the three areas shown in Table 1.The regions used were varied in their landscape.Columbia, Missouri, USA, contains a mix of urban areas and rural farmland, both of which showed moderate amounts of change.The Las Vegas, Nevada, USA, imagery was highly urbanized and contained significant amounts of change.Finally, a sparsely-populated, mountainous area near Natanz, Iran, was used, which underwent very little change during the time period between the two images.In order to generalize well, a classifier should be built with a training dataset that matches the natural distribution of the entire dataset [25].This is particularly challenging with imbalanced datasets in which there are relatively few samples from one class.Methods to address the challenge of imbalanced datasets can be grouped into three categories [26]: adapting existing algorithms, pre-processing the datasets through sampling techniques or post-processing the classification model.While the Columbia and Las Vegas datasets are split roughly 3:1 between no change and change tiles, the Natanz dataset is split 50:1 between no change and change tiles.Given the potential challenges of our dataset, we will investigate dataset sampling techniques, such as those proposed in [27], to improve our classification results.The following sections describe sampling methods that employ knowledge of the dataset to ensure that a variety of types of data points are included in the training set.Table 2 shows the number of training and testing samples used for each dataset as well as the percent of change and no change tiles contained within each testing dataset.

Predictions Using High-Change Tiles
The first experiment involves using tiles that the GeoCDX system identified as being high change tiles.Using these high change tiles, each of the four classifiers will be trained with the data corresponding to a fixed percentage of tiles.While the selected tiles are all high-change tiles, not all of the change captured by them is necessarily relevant change.We produced three different datasets with high change tiles; they include the highest ranked 5%, 15% and 25% of the dataset.Table 3 shows the chosen percentages and the corresponding number of tiles used from each pair.Next, we will expand the training set by also including an equal number of high and low change tiles.As shown in Table 4, we will select a fixed percentage of high and low change tiles that will double the number of tiles compared to those selected in the previous section.This will ideally balance the number of tiles with change and those without relevant change to allow the classification training to create a more discriminative classifier instead of one that has been over-fitted to the high change data.

Predictions Using Cluster Members
Recall that Section 2 described the clustering of change detection results.In an effort to train the classifier with a more diverse training dataset, we can use these clusters to produce our training samples.As was mentioned above, the number of clusters varies by pair, as does the number of members in each cluster.We began by producing a training dataset for each pair that contained the most representative member of each cluster in the pair.Then, we produced expanded training datasets for each pair by including the second and then third most representative member in each cluster.Table 5 shows a summary of the number of tiles used for each dataset.

Predictions Using Cluster Members in Addition to High-and Low-Change Tiles
Finally, we also produce datasets that combine tiles that have very high GeoCDX change scores, very low GeoCDX change scores and representative exemplars from each GeoCDX change cluster.
Tables 6 and 7 show a summary of the composition of these training datasets.Ideally, these tiles depict the wide variety of land cover and land use types present in each pair to address the sampling concerns described in the introduction to Section 4.

Prediction Results
Using all of the training datasets described in the previous subsections, we will construct a nearest neighbor, support vector machine, decision tree and random forest classifier for each dataset.We will use each classifier to label all of the remaining data (i.e., the test data) and compare the results to ground truth change/no-change labels applied by an experienced imagery analyst.Based on these classification results, we catalog the following: • true positive results: relevant change occurred, and it was classified as such; • false positive results: no change occurred or a change that was not relevant occurred, but was classified by the algorithm as change; • false negative results: a relevant change occurred, but was not correctly classified; • true negative results: no change occurred or a change occurred that was not relevant, and the classifier correctly indicated this condition.
Based on these four factors, we can calculate traditional assessment metrics of precision and recall as follows: P recision = T P T P + F P (4) An accuracy metric can also be calculated to measure the overall performance of each algorithm, as shown in Equation (6).
We present four tables that illustrate the precision, recall and accuracy values for each of the types of classifiers described in Section 3. Table 8 provides results for the nearest neighbor classifier, Table 9 for support vector machine, Table 10 for the decision tree and Table 11 for the random forest classifier.Each row in the table corresponds to one of the training datasets described in Tables 3-5, 6 and 7. Table 8 shows the results of change prediction using a nearest neighbor classifier.Recall rates are highest when only using the high change tiles as training data (Datasets A1-A3); this holds true for all three test sites.However, when using this training data, overall accuracy clearly suffers.Generally, the highest combinations of precision, recall and accuracy values come from the training datasets that combine high change tiles and members from each of the change clusters (the F and G series datasets).However, overall, the results of using a nearest neighbor (NN) classifier are not compelling.Results that quantify the performance of support vector machine (SVM) classification can be found in Table 9.Overall, these results show a marked improvement over those of the NN classifier.Again, the use of training sets that combine high change tiles with representative cluster members continue to produce the best classification results.Precision and accuracy values hover around the 50% mark, while overall accuracy is between 70-80% for the Columbia and Las Vegas datasets.The Natanz dataset represents an anomaly, because of the relatively small amount of actual change in the dataset; precision and recall values typically top out no higher than 30%, but overall accuracy is in the mid-90% range.This occurs because the classifier is able to accurately predict a large number of true negative tiles within this pair.
Generally, the results of the CART decision tree classification shown in Table 10 indicate a slight decrease in precision, recall and accuracy compared to those of the SVM.The notable exception was that the D and E series datasets showed a slight improvement using the CART decision tree compared to SVM.
Finally, change prediction results using the random forest classifier shown in Table 11 are the best among the four algorithms presented.

Analysis of Results Using a Generalized F-Score
The F-score is a commonly-used assessment to determine a classification algorithm's accuracy in a way that takes into account both precision and accuracy.The traditional F 1 score is calculated as the harmonic mean of the precision and recall values and is bounded between zero and one.F 1 can be calculated as follows: where T P represents the number of true positive outcomes, F P the number of false positive outcomes and F N the number of false negative outcomes.
A more generalized version of the F-score can be calculated by introducing a variable β that allows more emphasis to be placed on the precision or recall component.This generalized F-score, F β , is calculated as: where a larger value of β weights recall more highly than precision and a smaller value of β emphasizes precision at the expense of recall.
Table 12 shows values of F 0.1 , F 1 and F 10 calculated using the precision and recall values from Table 8.The F 0.1 represents a measure in which precision is 10-times more important than recall.The F 10 measure weights precision and recall in a way that makes recall 10-times more important that precision.Finally, F 1 is the balanced weighting of the precision and recall values.Based on these three measures across the various datasets, we can see that, in general, the nearest neighbor classifier can produce satisfactory results if recall is preferred over precision, but it does not do so consistently.In particular, the Las Vegas and Natanz datasets very rarely have values greater than 0.5.Values for the three F-scores using the SVM classifier are shown in Table 13.In this table, we begin to see higher scores due to the improved quality of the classification results compared to the NN classifier.Scores for the Columbia dataset are typically greater than 0.5 for all three measures.Many F-score values for Las Vegas pass that threshold, as well.However, the results for Natanz show little improvement using the SVM classifier.
As we noted in Section 4.5, the results for the CART classifier generally seem to be slightly worse than those of SVM.We see this same trend when examining the various F-score values for the CART classifier shown in in Table 14.
Table 14.Analysis of decision tree (CART) change prediction results using a generalized F-Score.

Columbia
Las Vegas Natanz Finally, the breakthrough comes when we examine the F-score values for the random forest classifier shown in Table 15.In general, all training datasets for the Columbia tiles provide balanced results that favor neither precision nor recall.The Las Vegas training datasets produce results that favor precision over recall, as can be seen by their relatively high F 0.1 scores and their relatively low F 10 scores.It is interesting to note that the G series datasets for Las Vegas (i.e., G1, G2 and G3) produce low F 0.1 scores.Referring to Table 7, we can see that those datasets employ a very large percent of the pair's tiles.We believe that over-fitting is occurring, which prevents the classifier from generalizing well.The F-score values for the Natanz datasets show an interesting trend.Data Series A, C, D and E all produce low values for the three reported F-scores.Meanwhile, Data Series B, F and G report high scores for F 0.1 , which means that the precision is relatively high.Recall that Table 2 showed that the Natanz pair was filled with an overwhelming number of no-change tiles.Only Data Series B, F and G include significant numbers of no-change tiles that allow the random forest classifier to produce an effective model of the training data that generalizes to the test data.

Discussion and Conclusions
This manuscript presents a method for predicting areas of relevant change, within the GeoCDX system [2].This system combines automated change detection processing with human-in-the-loop rapid triage of change detection results.While the GeoCDX system is agnostic to the type of change detected, human judgment is used to conclude whether a tile should be tagged as containing "relevant" change depending on the analyst's task.As a user interrogates change detection results presented by GeoCDX, we showed that we were able to use the change/no-change annotations of the imagery analyst to help predict whether subsequent tiles contained relevant change or not.These predictions ultimately lead to decreased analysis time for the user.
Four different classification algorithms were used to perform the prediction; in general, the random forest classification algorithm performed the best.We also explored various schemes to construct a well-diversified training dataset that included areas of change and areas without change to ensure that the makeup of the training dataset reflects that of the entire dataset [25].Generally, training datasets that included samples from all of the GeoCDX change clusters produced the best classifiers.We demonstrated that with an appropriate training dataset, we can produce a random forest classifier that can typically predict relevant change with an accuracy of greater than 70% and even up to 97%.The classifiers that are produced generally favor precision over recall, meaning that there will be relatively few false positive change indications.
In future work, we plan to investigate using more granular features extracted from the imagery to predict changes at a finer scale.We recognize the limitations of using the features extracted from 256 by 256-meter tiles used by GeoCDX, but were generally pleased with the results that could be achieved with those features.Additionally, we plan to incorporate gaze tracking information gathered from system users [18] to better identify precisely which portions of the image are important for making decisions about relevant change versus irrelevant change versus no change.Using this eye tracking information along with more fine-grained image features will improve future change predictions.Finally, additional experiments should be performed to gauge the improvement in performance offered by using change/no change annotations from an imagery analyst to predict the existence of relevant change in other, unseen portions of the image.We anticipate significant efficiency improvements by using our semi-automated approach to suggest whether relevant change has occurred or not; however, this should be verified experimentally.

Figure 1 .
Figure 1.A high-level overview of the Geospatial Change Detection and Exploitation System (GeoCDX) processing flow.

Figure 2 .
Figure 2. In the GeoCDX web user interface, the far left-hand side contains the navigation menu for the GeoCDX software.Immediately to the right of that are clickable links to sets of change detection results in batches of twenty tiles (i.e., 1-20, 21-40, etc.).Further to the right are three images in each row representing the before image, the after image and the corresponding change map that highlights changed regions.Finally, on the far right side of each row, there is a UI element that allows an analyst to tag a tile as "change" (the button with the red text) or "no change" (the button with the green text).

Figure 4 .
Figure 4. Overall workflow for using binary classification to predict relevant change.

Table 1 .
Information about the three image pairs used during the experiments presented herein.

Table 2 .
Size of training dataset versus testing dataset.

Table 3 .
In the Series A datasets, we selected only tiles that were highly ranked.Each row represents a dataset with a different fraction of tiles selected.

Table 4 .
The Series B datasets utilize a selection of tiles that were found to have the highest and lowest amounts of change for training.Each row represents a dataset with a different percent of records selected for training.

Table 5 .
In the Series C datasets, we utilize the most prototypical cluster members from each pair for training data.Each row in this table shows the number of training tiles used as the number of cluster members is increased.

Table 6 .
The D, E, F and G series datasets combine tiles with high and low amounts of change in them with prototypical cluster members to form diverse training datasets.

Table 7 .
This table provides more detailed information on the composition of the training datasets introduced in Table6.

Table 8 .
Nearest neighbor change prediction results.

Table 9 .
Support vector machine change prediction results.

Table 10 .
Decision tree (CART) change prediction results.

Table 11 .
Random forest change prediction results.

Table 12 .
Analysis of nearest neighbor change prediction results using a generalized F-Score.

Table 13 .
Analysis of support vector machine change prediction results using a generalized F-Score.

Table 15 .
Analysis of random forest change prediction results using a generalized F-Score.