Manifold Learning Co-Location Decision Tree for Remotely Sensed Imagery Classification

Because traditional decision tree (DT) induction methods cannot efficiently take advantage of geospatial knowledge in the classification of remotely sensed imagery, several researchers have presented a co-location decision tree (CL-DT) method that combines the co-location technique with the traditional DT method. However, the CL-DT method only considers the Euclidean distance of neighborhood events, which cannot truly reflect the co-location relationship between instances for which there is a nonlinear distribution in a high-dimensional space. For this reason, this paper develops the theory and method for a maximum variance unfolding (MVU)-based CL-DT method (known as MVU-based CL-DT), which includes unfolding input data, unfolded distance calculations, MVU-based co-location rule generation, and MVU-based CL-DT generation. The proposed method has been validated by classifying remotely sensed imagery and is compared with four other types of methods, i.e., CL-DT, classification and regression tree (CART), random forests (RFs), and stacked auto-encoders (SAE), whose classification results are taken as “true values.” The experimental results demonstrate that: (1) the relative classification accuracies of the proposed method in three test areas are higher than CL-DT and CART, and are at the same level compared to RFs; and (2) the total number of nodes, the number of leaf nodes, and the number of levels are significantly decreased by the proposed method. The time taken for the data processing, decision tree generation, drawing of the tree, and generation of the rules are also shortened by the proposed method compared to CL-DT, CART, and RFs.


Introduction
In the past several decades, a number of methods for classifying remotely sensed imagery have been proposed and improved [1][2][3][4][5].Decision tree (DT) is one of the most prevalent methods [6][7][8][9].The DT method's application to remotely sensed image classification was first presented in 1975 by Wu et al. [10].Since then, many researchers have endeavored to develop new theories and methods for constructing optimal DTs.The typical algorithms include, but are not limited to, classification and regression tree (CART) [11], iterative dichotomiser 3 (ID3) [12], C4.5 decision tree (C4.5) [13], and C5.0 decision tree (C5.0) [14].Two primary issues in DT induction methods are: (1) the growth of the DT, where the split criteria are used to increase the number of nodes; and (2) the pruning phase of the DT, where the superfluous nodes and breaches are removed to improve the classification accuracy.To construct a high-quality DT, many methods have been proposed in the literature, such as [15][16][17][18].A number of methods for pruning have also been proposed in the literature, such as [19][20][21][22][23][24].They have aimed to improve the classification accuracy by removing superfluous nodes and branches.
In recent years, many advanced/improved DT methods have been widely applied in image classification, especially for those areas with different objects but equivalent/similar spectral characteristics in the imagery.For example, Li et al. [25] combined the Dempster-Shafer (D-S) evidence theory with CART to propose a called D-S evidence theory DT method.This method can overcome the defect of traditional DTs in that they cannot address the uncertainty of remote sensing data and scene classification.This method can achieve very high accuracy in the classification of Landsat-5 TM image data.In addition, "hybrid" DT methods have been proposed by several researchers.For example, Farid et al. [26] proposed a "hybrid" DT algorithm that employs a naive Bayes classifier to remove the noisy troublesome instances from the training set before the DT induction.The experimental results demonstrated that the proposed method produced impressive results in the classification of challenging multi-class problems.In addition, to improve the classification accuracy, Feras Al-Obeidat et al. [27] proposed an improved decision tree algorithm based on a fuzzy method to classify Landsat satellite images that yielded better efficiency results and an interpretable model.
Traditional DT methods, however, only consider spectral information without efficiently taking advantage of the geospatial relationship between instances.Consequently, these methods unavoidably result in misclassifications, which are much worse in regions with different objects but equivalent/similar spectra.To overcome this type of shortcoming, Zhou et al. [19,28] presented a co-location-based decision tree (CL-DT) method, which has been successfully applied in road surface classification.Compared with the traditional DT, CL-DT can yield better classification results because the CL-DT method considers both geospatial attributes and non-geospatial attributes and employs co-location mining technology to first classify the co-location attributes of instances that have similar characteristics within a certain geospatial distance.
However, CL-DT only considers the Euclidean distance (ED), i.e., distance (i, j), as a geometric constraint condition.Although the ED can be used to effectively determine the co-location relationship between instances when they belong to a linear distribution in three-dimensional (3D) space, it is not capable of effectively determining their co-location relationship when they are a nonlinear distribution in 3D or higher-dimensional space [29].For example, Figure 1a shows the data set of a Swiss roll in a 3D space, in which the ED AB between instances A and B is shorter than the distance ACB, which is along the surface of the roll and effectively reflects whether there is a co-location relationship between A and B. This can be demonstrated by the fact that, if the Swiss roll is unfolded as in Figure 1b using the maximum variance unfolding (MVU) method based on the notion of local isometry, the MVU method can preserve the original neighborhood relationship of A and B, including the lengths of edges and the angles between edges at the same node [30].After unfolding the input data, the unfolded distance (see Figure 1b) can effectively reflect the relationship between instances.Furthermore, in addition to the X/Y coordinates, the land surface temperature (LST), vegetation coverage (VC), and other factors are each able to be regarded as one of the dimensions of a high-dimensional space in which the instances belong to non-linear distributions.
For the problems mentioned above, this paper presents the maximum variance unfolding (MVU)-based co-location decision tree (MVU-based CL-DT) method.The contributions of this paper include: (1) an "unfolded distance" between instances is proposed for effectively reflecting the real co-location relationship between instances for which there is a nonlinear distribution in a high-dimension space; and (2) the proposed MVU-based CL-DT, induced by using the real co-location relationship between instances, can efficiently improve the classification accuracy of different objects with equivalent/similar spectral responds.In this paper, the classification results of SAE are considered as the "true value".However, since SAE requires learning many parameters, the efficiency will be affected.

Brief Review of MVU
The MVU algorithm was first proposed by [30], whose goal was to detect and discover faithful low-dimensional structures in high-dimensional data [31].The details of the MVU method can be found in [31].Briefly, the MVU algorithm can faithfully preserve distances and angles among nearby input data instances by computing a low-dimensional representation of the high-dimensional data set [31].After MVU processing, the neighboring instances of the input and output are invariant about the translation and rotation.Consequently, the unfolded distances between instances will be calculated to determine the co-location relationship between instances.
Over the past few decades, many efforts have been made to further develop the MVU method.Weinberger et al. [30,31] put forward an MVU algorithm that added a kernel matrix to the algorithm and used the positive semi definite constraint of the kernel matrix to realize the convex optimization of data.Shao et al. [32] discussed the deficiencies of kernel principal component analysis for monitoring nonlinear processes and proposed a new process monitoring technique based on maximum variance unfolding projections (MVUP).Liu et al. [33] analyzed the shortcomings of MVU for process monitoring and proposed an extended maximum variance unfolding (EMVU) method for nonlinear process monitoring.Ery et al. [34] analyzed and discussed the convergence of maximum variance unfolding, and they proved that it is consistent when the underlying sub manifold is isometric to a convex subset.

Basic Idea of the Proposed MVU-Based CL-DT
The details of the proposed method consist of three parts (see Figure 2).
1. Unfold the input data using the MVU method.To obtain the unfolded distance between instances, the original input dataset X, which is a nonlinear distribution in higher-dimensional space, is preprocessed using the MVU method.A new dataset X′, which is regarded as a linear distribution, is obtained and employed to: (1) calculate the unfolded distance; and (2) become the root node of the DT. 2. Mine co-location rules.Co-location rules are mined through a co-location algorithm with the unfolded distances, which are calculated from the dataset X′. 3. Generate MVU-based CL-DT.The dataset X′ is taken as the input data of the root node of the DT.At the beginning, all attributes of dataset X′ are "accepted" by the root node, and one "best" attribute, such as BA1, is selected from the input dataset X′; then, a splitting criterion is used to determine whether the root node will be split using a binary decision (Yes or No in Figure 2) into two intermediate nodes, noted by wi and wj (w = A, i = 1, and j = 2 in Figure 2).For each of the intermediate nodes (wi and wj), the splitting criterion will be applied to determine whether the node should be further split.If No, this node is considered a leaf node, and one of the class

Brief Review of MVU
The MVU algorithm was first proposed by [30], whose goal was to detect and discover faithful low-dimensional structures in high-dimensional data [31].The details of the MVU method can be found in [31].Briefly, the MVU algorithm can faithfully preserve distances and angles among nearby input data instances by computing a low-dimensional representation of the high-dimensional data set [31].After MVU processing, the neighboring instances of the input and output are invariant about the translation and rotation.Consequently, the unfolded distances between instances will be calculated to determine the co-location relationship between instances.
Over the past few decades, many efforts have been made to further develop the MVU method.Weinberger et al. [30,31] put forward an MVU algorithm that added a kernel matrix to the algorithm and used the positive semi definite constraint of the kernel matrix to realize the convex optimization of data.Shao et al. [32] discussed the deficiencies of kernel principal component analysis for monitoring nonlinear processes and proposed a new process monitoring technique based on maximum variance unfolding projections (MVUP).Liu et al. [33] analyzed the shortcomings of MVU for process monitoring and proposed an extended maximum variance unfolding (EMVU) method for nonlinear process monitoring.Ery et al. [34] analyzed and discussed the convergence of maximum variance unfolding, and they proved that it is consistent when the underlying sub manifold is isometric to a convex subset.

Basic Idea of the Proposed MVU-Based CL-DT
The details of the proposed method consist of three parts (see Figure 2).
1. Unfold the input data using the MVU method.To obtain the unfolded distance between instances, the original input dataset X, which is a nonlinear distribution in higher-dimensional space, is preprocessed using the MVU method.A new dataset X , which is regarded as a linear distribution, is obtained and employed to: (1) calculate the unfolded distance; and (2) become the root node of the DT. 2. Mine co-location rules.Co-location rules are mined through a co-location algorithm with the unfolded distances, which are calculated from the dataset X .3. Generate MVU-based CL-DT.The dataset X is taken as the input data of the root node of the DT.At the beginning, all attributes of dataset X are "accepted" by the root node, and one "best" attribute, such as BA 1 , is selected from the input dataset X ; then, a splitting criterion is used to determine whether the root node will be split using a binary decision (Yes or No in Figure 2) into two intermediate nodes, noted by w i and w j (w = A, i = 1, and j = 2 in Figure 2).For each of the intermediate nodes (w i and w j ), the splitting criterion will be applied to determine whether the node should be further split.If No, this node is considered a leaf node, and one of the class labels is assigned to this leaf node.If Yes, this node will be split by selecting another "best" attribute, BA 2 .Meanwhile, the MVU-based co-location criterion will be used to judge whether the "best" attribute BA 2 is co-located with BA 1 at the same layer.If Yes, this node will be "merged" into the node and named a co-location node.For example, nodes A1 and A2 are merged into a new node A1 in Figure 2.After that, one new "best" attribute will again be selected, and whether the selected "best" attribute co-occurs with the last "best" attribute is re-judged.If No, the node will be further split into a sub-set by repeating the above operation.These above processes are repeated until all data have been processed.labels is assigned to this leaf node.If Yes, this node will be split by selecting another "best" attribute, BA2.Meanwhile, the MVU-based co-location criterion will be used to judge whether the "best" attribute BA2 is co-located with BA1 at the same layer.If Yes, this node will be "merged" into the node and named a co-location node.For example, nodes A1 and A2 are merged into a new node A1′ in Figure 2.After that, one new "best" attribute will again be selected, and whether the selected "best" attribute co-occurs with the last "best" attribute is re-judged.If No, the node will be further split into a sub-set by repeating the above operation.These above processes are repeated until all data have been processed.

•
Neighbor Relationship Matrix Reservation The mapping function of an MVU method should preserve the neighborhood relationship of instances, including the angles and distances of instances.For this reason, the matrices denote the input dataset and output dataset, respectively, which have an exactly one-to-one correspondence to the setup to preserve the neighborhood relationship.We define a matrix δ to preserve the neighborhood relation among instances for X (and similarly, y for Y).

• MVU Function Establishment and Solution
The MVU algorithm "unfolds" the input data by maximizing the sum of pairwise squared distances among the output data.Therefore, the objective function can be expressed by: 0. where Here, the distances between nearby inputs are enforced to

• Neighbor Relationship Matrix Reservation
The mapping function of an MVU method should preserve the neighborhood relationship of instances, including the angles and distances of instances.For this reason, the matrices denote the input dataset and output dataset, respectively, which have an exactly one-to-one correspondence to the setup to preserve the neighborhood relationship.We define a matrix δ to preserve the neighborhood relation among instances for X (and similarly, y for Y).If the input instances X i and X k (i = k) are k-nearest neighbors, δ ik = 1; alternately, if X i and X k (i = k) are not k-nearest neighbors but another input instance X j is embedded between X i and X k , which makes X i and X k k-nearest neighbors, δ ik = 1 as well; otherwise, δ ik = 0 (and similarly for Y i and Y k ).

• MVU Function Establishment and Solution
The MVU algorithm "unfolds" the input data by maximizing the sum of pairwise squared distances among the output data.Therefore, the objective function can be expressed by: are the squared distances of the input instances ( X i , X k ) and output instances ( Y i , Y k ), respectively.Here, the distances between nearby inputs are enforced to match distances between nearby outputs by the first constraint.Meanwhile, a unique solution (up to rotation) is yielded by the second constraint, centering the outputs on the origin.
To finesse the apparent intractability of the above quadratic program, defining an inner product matrix L (L ik = Y i • Y k ), the objective function can be rewritten as follows (see [35,36] for details): The third constraint requires the inner product matrix L to be positive semidefinite.After solving (see [35,36] for details), the inner product matrix L can be obtained, and then the output instance Y i can be obtained by utilizing the method of matrix diagonalization.This algorithm can be described as follows.
Let U βi represent the β-th element of the i-th eigenvector, with eigenvalue λ β ; then, the inner product matrix L can be rewritten as: • Calculation of the Unfolded Distance between Instances Based on the preserved neighbor relation matrix, we can use the established MVU model to unfold input data, and then the unfolded distances between instances are calculated as follows.If Y i and Y j are k-nearest neighbors, the unfolded distance, noted as D U (Y i , Y j ), can be replaced by the Euclidean distance of Y i and Y j , noted as D E (Y i , Y j ).Otherwise, if they are not k-nearest neighbors, D U (Y i , Y j ) is represented by the shortest cumulative Euclidean distance of a number of neighbors.This criterion can be mathematically described by: To further explain the unfolded distance between instances, an example is given in Figure 3.As observed from Figure 3, k-nearest neighbors are connected by lines, such as Y 1 , Y 2 , and Y 3 .In Figure 3, the following two cases are considered: (1) the calculation of the unfolded distance between k-nearest neighbors; and (2) the calculation of the unfolded distance between instances that are not k-nearest neighbors.For k-nearest neighbors, such as Y 1 and Y 3 , the unfolded distance is D E (Y 1 , Y 3 ).However, for Y 10 and Y 1 , Y 10 is not one of the k-nearest neighbors of Y 1 .Hence, on the basis of Equation (5) k-nearest neighbors.For k-nearest neighbors, such as Y1 and Y3, the unfolded distance is DE(Y1, Y3).However, for Y10 and Y1, Y10 is not one of the k-nearest neighbors of Y1.Hence, on the basis of Equation ( 5), DU(Y1, Y10) is the shortest cumulative Euclidean distance of a number of neighbors (such as Y1, Y3, Y 5 , Y6, Y9, and Y10).Because the output data of MVU can be viewed as a connected graph and two arbitrary instances can be connected by a number of intermediate k-nearest neighbors, the shortest path can always be found.With the above analysis, the MVU unfolded distance (MUD) algorithm can be summarized as Algorithm 1, which calculates the unfolded distance after unfolding input data using MVU [30].If X i and X k satisfy K-NN Then δ ik = 1; else δ ik = 0; 6.

7.
Compute the maximum variance unfolding Gram matrix L of sample pairwise points, which is centered on the origin.8.
Perform generalized eigenvalue decomposition for the Gram matrix L obtained by Step 2. The eigenvectors, corresponding to the front d greater eigenvalues, are the result of embedding.

11.
Y Step 4: Calculate the unfolded distance of instances Y i and Y j 13.

Determination of MVU-based Co-Locations
The successful completion of the above steps only obtains the unfolded distance between instances.With the unfolded distances, an R-relationship (RRS) between instances, which describes the neighborhood relationship between geospatial instances [6,27], can be created by the method proposed below.
If and only if the unfolded distance between Y i and Y j is less than or equal to the distance threshold, there exists an RRS between them, and it is mathematically expressed by: where D θ is the threshold of unfolded distance, R(Y i , Y j ) represents the RRS of Y i and Y j , and is the unfolded distance of Y i and Y j .After the determination of the RRS, those that satisfy the RRS condition are the candidates of MVU-based co-location.After the process above, only the 2nd-order candidates of MVU-based co-location are obtained.To determine MVU-based co-locations, different attributes (e.g., SSM, LST, and VC) in the image classification of every instance are utilized.To this end, a density ratio (DR) is defined to determine whether instances are MVU-based co-location patterns, and it is expressed by: Because those instances with MVU-based co-location patterns have similar characteristics, the phenomenon of different objects having the same spectrum in remote sensing images will be eliminated.Thus, the classification accuracy can be improved.Because those instances with MVU-based co-location patterns have similar characteristics, the phenomenon of different objects having the same spectrum in remote sensing images will be eliminated.Thus, the classification accuracy can be improved.

Determination of Distinct Event-Types
These above processes only mine the MVU-based co-location instances and cannot ensure that these instances are distinct event types.Therefore, a constraint condition must be set up as below.
Let M = {m1, m2,…, mc} be a set of corresponding cluster centers of attributes a1, a2, …, ac such that the distinct event-type can be expressed by: where ||fi−mk|| denotes the Euclidean distance between fi and mk; fi is the value of the i-th instance in the k-th attribute ak of one class; is the cluster center of the k-th attribute ak of one class, where N is the number of instance of one class; i Ψ is a squared error clustering criterion; C is the number of attributes; and S is the number of instances.Therefore, if i Ψ is greater than a given threshold θ Ψ , the i-th instance is regarded as a distinct event.

Generation of MVU-Based Co-Location Rules
Accompanying the generation of MVU-based co-locations, the MVU-based co-location rules can be created from the candidates of MVU-based co-location.With the methods proposed above, the algorithm and pseudo code for determination of MVU-based co-locations (MCL) are developed in this paper and can be described via Algorithm 2, which uses unfolded distance to replace Euclidean distance of [19].

Determination of Distinct Event-Types
These above processes only mine the MVU-based co-location instances and cannot ensure that these instances are distinct event types.Therefore, a constraint condition must be set up as below.
Let M = {m 1 , m 2 , . . ., m c } be a set of corresponding cluster centers of attributes a 1 , a 2 , . . ., a c such that the distinct event-type can be expressed by: where f i −m k denotes the Euclidean distance between f i and m k ; f i is the value of the i-th instance in the k-th attribute a k of one class; N is the cluster center of the k-th attribute a k of one class, where N is the number of instance of one class; Ψ i is a squared error clustering criterion; C is the number of attributes; and S is the number of instances.Therefore, if Ψ i is greater than a given threshold Ψ θ , the i-th instance is regarded as a distinct event.

Generation of MVU-Based Co-Location Rules
Accompanying the generation of MVU-based co-locations, the MVU-based co-location rules can be created from the candidates of MVU-based co-location.With the methods proposed above, the algorithm and pseudo code for determination of MVU-based co-locations (MCL) are developed in this paper and can be described via Algorithm 2, which uses unfolded distance to replace Euclidean distance of [19].If Step 3: 10.

Pruning
In a DT induction algorithm, different pruning algorithms are employed to prune the nodes.In this paper, a pruning algorithm proposed by Zhou et al. [19] is employed.The pruning algorithm is briefly described as follows.
For a geospatial data set SP, let FT = ( f t 1 , f t 2 , f t 3 , ..., f t h ) be a set of geospatial attributes.Let Φ = (ϕ 1 , ϕ 2 , ϕ 3 , ..., ϕ n ) be a set of n instances in SP, where each instance is a vector containing the instance-id, geospatial attributes, and location.The geospatial attribute of instance i is denoted by ft i .It is assumed that the geospatial attributes of an instance are from the geospatial attribute set FT such that the location is within the geospatial framework of a geospatial database and there is an RRS in SP.Additionally, let O = (o 1 , o 2 , o 3 , ..., o k ) be a set of corresponding clusters centered in the data set SP, where k is the number of clusters of geospatial attributes.To capture the concept of "nearby," the criterion of co-occurrence is defined as: where ϕ q − o i is the Euclidean distance between x q and o i ; Ξ m is the squared error clustering criterion; and V = {v iq }, i = 1, 2, . . ., h, q = 1, 2, . . ., N, is a matrix that satisfies the following constraint conditions: Thus, if Ξ m is less than or equal to a given threshold, the two nodes are considered to be a co-occurrence and thus should be merged.

Inducting Decision Rules
After the MVU-based CL-DT is generated above, decision rules will be created by translating the decision tree into semantic expressions.Because the MVU-based CL-DT algorithm partitions a data space into several distinct disjoint regions via axis-parallel surfaces, the top-down search method will be employed to translate individual nodes into rules in this paper.
As a summary, through introducing the MVU used to obtain unfolded distance into CL-DT [19], the entire proposed MVU-based CL-DT algorithm is described in Algorithm 3. Step 5: Starting with a single node, the root, which includes all rules and attributes; 9. Step 6: Judge whether each non-leaf node will be further split, e.g., w i ; 10.
• Perform a label assignment test to determine if there are any labels that can be assigned; 11.
• An attribute is selected according to the splitting criterion to split w i further and judge whether the stop criterion is met; 12.
• If the selected attribute meets the splitting criterion, the node will be parted into subsets; 13.
• If the stop criterion is satisfied, stop splitting and assign w i as a leaf node; 14.Step 7: Apply the MVU-based co-location algorithm for each of the two non-leaf nodes in the same layer, e.g., w i and w j , to test whether the two nodes satisfy the co-location criterions.If yes, merge the two neighbor nodes; if no, go to Step 6.

Experiments and Analysis
In this section, we describe the test dataset and experimental environment and present the results of remotely sensed imagery classification.In addition to the proposed method, SAE (a deep neural network), CL-DT, CART, and random forests (RFs) are also compared to evaluate our method.The details are described as follows.(3) The third test area and dataset: The third test area is located from 29.718 • to 29.724 • north latitude and from 95.346 • to 95.354 • west longitude.The data were acquired on 23 June 2012 by the NSF-funded Center for Airborne Laser Mapping (NCALM) with a compact airborne spectrographic imager (CASI) over the University of Houston campus.A 311 × 278-pixel subset was used for classification in this paper.The data consists of 144 spectral bands in the 380-nm to 1050-nm region with 2.5-m spatial resolution and had been calibrated to at-sensor spectral radiance units.

Non-Spatial Attribute and Spatial Attribute Selection
In addition to the original TM imagery data, as a rule of thumb, the data of five non-spatial attributes, including SSM, LST, VC, the components of PCA, and texture (TEX), are considered to classify the land covers.SSM data can be generated by the method of [41].LST is retrieved from Landsat TM data by a mono-window algorithm [42].VC can be obtained by the vegetation index method [43].By applying the PCA function of ENVI 4.8 software for the pre-processed TM imagery data, the components of PCA can be acquired, whose i-th component can be represented by PCA i .Based on the co-occurrence measures, the texture (noted as TEX) data can be produced.Finally, there are 30,217,776 instances for each attribute in the first test area and 4,954,112 instances in the second test area.
For further processing, the instances are considered as five classes, i.e. water (WT), vegetation (VG), exposed carbonatite (EC), habitation (HB), and cultivated land (CL), in the first two experiments.
In addition to the non-geospatial attributes above, spatial attributes, including X and Y coordinates, are considered as well.The metadata for the Geospatial data are listed in Table 1.Furthermore, these TM imagery data (SSM, LST, VC, the components of PCA, TEX, and X/Y coordinates) are managed by a database.Each of these coordinates is regarded as one of the dimensions of high-dimensional space in which instances belong to a non-linear distribution.Therefore, the coordinate of every instance can be viewed as a vector expressed by (att 11 , att 12 , . . ., att 1n ), where att ij represents the value of the i-th instance of the j-th attribute.

Experiments
The flowchart of the experimental procedure is depicted in Figure 5. First, the MVU algorithm is utilized to "unfold" input data, and the MVU-unfolded distances between instances are calculated.Second, the unfolded distance is merged with the co-location mining algorithm to establish the exact RRS between instances, and then the MVU-based co-location rules are mined.Finally, the MVU-based co-location rules are used to induce the generation of the decision tree, and MVU-based co-location decision rules are obtained.

Experiments
The flowchart of the experimental procedure is depicted in Figure 5. First, the MVU algorithm is utilized to "unfold" input data, and the MVU-unfolded distances between instances are calculated.Second, the unfolded distance is merged with the co-location mining algorithm to establish the exact RRS between instances, and then the MVU-based co-location rules are mined.Finally, the MVU-based co-location rules are used to induce the generation of the decision tree, and MVU-based co-location decision rules are obtained.The experiment was conducted using a computer with an Intel Xeon E5645 six-core 2400-MHz CPU (with a 12-MB cache) and 4 GB of RAM.We implemented the proposed algorithm in ENVI + IDL 4.8.For all experiments, we have used 10-fold cross-validation.By connecting with Google Earth, we used polygons, which covered 1 to 4 sample points of the same class in a polygon, to manually collect the samples of five classes (WT, VG, EC, HB, and CL) in Landsat TM images.The number of collected samples containing training samples and testing samples are 10,000 for each class in the 1st and 2nd areas.The Global Moran's I method [44][45][46][47] is used to analyze the spatial autocorrelation of training samples (see Table 2).As shown in Table 2, the training samples are not spatially auto-correlated, which does not affect the classification results.Ten-fold cross-validation breaks datasets of the collected samples into 10 subsets of size N/10.Nine subsets of the samples of five classes are used to train the proposed classifier, and the one remaining subset is used to test/validate the proposed classifier.

Data input:
The MVU-based co-location database, which has completed the "pre-processing" using the MVU-based co-location mining method, is loaded.Additionally, a subset of data is applied to build the model, and the rest is utilized to validate the performance of the model.
Parameters input: It is necessary to input the parameters below to optimize the process of DT induction.These parameters chosen using cross-validation include the following: 1. Minimum node size: The minimum valid node size can be set in the range of 0-100.The experiment was conducted using a computer with an Intel Xeon E5645 six-core 2400-MHz CPU (with a 12-MB cache) and 4 GB of RAM.We implemented the proposed algorithm in ENVI + IDL 4.8.For all experiments, we have used 10-fold cross-validation.By connecting with Google Earth, we used polygons, which covered 1 to 4 sample points of the same class in a polygon, to manually collect the samples of five classes (WT, VG, EC, HB, and CL) in Landsat TM images.The number of collected samples containing training samples and testing samples are 10,000 for each class in the 1st and 2nd areas.The Global Moran's I method [44][45][46][47] is used to analyze the spatial autocorrelation of training samples (see Table 2).As shown in Table 2, the training samples are not spatially auto-correlated, which does not affect the classification results.Ten-fold cross-validation breaks datasets of the collected samples into 10 subsets of size N/10.Nine subsets of the samples of five classes are used to train the proposed classifier, and the one remaining subset is used to test/validate the proposed classifier.

Data input:
The MVU-based co-location database, which has completed the "pre-processing" using the MVU-based co-location mining method, is loaded.Additionally, a subset of data is applied to build the model, and the rest is utilized to validate the performance of the model.
Parameters input: It is necessary to input the parameters below to optimize the process of DT induction.These parameters chosen using cross-validation include the following: 1. Minimum node size: The minimum valid node size can be set in the range of 0-100.2. Maximum purity: If the purity of a node is 95% or higher, the algorithm stops splitting it.
Additionally, if its number of records is 1% or less of the total number of records, the algorithm stops splitting it.3. Maximum depth: The maximum valid depth is 20-30.

• Calculation of unfolded distances
There is a high correlation among different bands of remotely sensed imagery, and most of the data are redundant.The principle component analysis (PCA) is able to make these principal component images unrelated with each other.To save computational time in the MVU-based CL-DT generation, this paper proposes applying the PCA method to delete those instances that are apparently not neighbors after eliminating the correlation between images.This paper takes one attribute, PCA 1 , which represents the first components of PCA in image processing, as an example to explain the initial processing.
Different thresholds are set up for the instances of different classes to determinate the initial candidates of the MVU-based co-location.For example, the thresholds of PCA 1 are set as r θ1 and r θ2 (r θ1 < r θ2 ) for the exposed carbonatite.If the values of PCA 1 are within r θ1 and r θ2 , the instances with eligible values are the initial candidate instances.This can be mathematically expressed by: where r θ1 and r θ2 are the given minimum and maximum thresholds of PCA 1 , respectively; P is the set of PCA 1 (i, j); and i and j represent the i-th row and j-th column, respectively.With the implementation of this step, those non-neighbored instances can be deleted, and the other instances are called "initial candidate instances".This means that the non-neighbored instances are excluded in the following computation.As a result, computational time is significantly conserved.
From Equation ( 12), the components of PCA are also selected to determine the initial candidates of the MVU-based co-location.For example, the threshold of PCA 3 , as for the exposed carbonatite, is less than or equal to −5, i.e., When the above step is successfully finished, the MVU algorithm is utilized to unfold these data using Equation (2).After unfolding these data, the unfolded distances between instances are calculated using Equation (7).After calculation, a sparse matrix D m×m of unfolded distances among instances can be obtained.
where m is 30,217,776 and the units are meters.

• Determination of the MVU-based co-location
After obtaining the unfolded distances between instances, the RRSs between instances are determined using Equation (6), in which the threshold of unfolded distance is set as 60, i.e., D θ = 60 (m), because the resolution of the TM images is 30 m, except for the thermal infrared band.Therefore, we have With the above determination of the RRS, those instances with RRS are the candidates for the MVU-based co-location.A sparse matrix CD m×m of the candidates for MVU-based co-location is generated.
On the basis of Equation ( 7) and Equation ( 16), the PCA components, VC, SSM, LST, and TEX, of each candidate pair of MVU-based co-location instances are employed to determine the MVU-based co-locations.We take the EC as an example to describe the method here.As mentioned above, if the DRs of candidates of MVU-based co-location of the exposed carbonatite are greater than or equal to 4/5, they are MVU-based co-locations.If not, they will be deleted from the candidates.With the application of this algorithm, the kth-order MVU-based co-locations can be obtained, as shown in Figure 6.
Remote Sens. 2016, 8, 855 13 of 31 With the above determination of the RRS, those instances with RRS are the candidates for the MVU-based co-location.A sparse matrix CDm×m of the candidates for MVU-based co-location is generated.
On the basis of Equation ( 7) and Equation ( 16), the PCA components, VC, SSM, LST, and TEX, of each candidate pair of MVU-based co-location instances are employed to determine the MVU-based co-locations.We take the EC as an example to describe the method here.As mentioned above, if the DRs of candidates of MVU-based co-location of the exposed carbonatite are greater than or equal to 4/5, they are MVU-based co-locations.If not, they will be deleted from the candidates.With the application of this algorithm, the kth-order MVU-based co-locations can be obtained, as shown in Figure 6.With the calculation above, 8,122,155 instances are preserved.Figure 6c only shows 25 instances, and the 1st-order (ignored) through 6th-order table instances of MVU-based co-location are presented.Table 3 uses four examples to explain the generation processes of the determination of MVU-based co-location.Figure 6c only shows the DRs of the candidate patterns (3,4) and (11,12,16), which are greater than 4/5.For pattern (2, 3), although there is an RRS between them, the TEX is greater than 50, the SSM is not in the range of 5 to 8, and the VC is not less than 0.25 (and similarly so for pattern (10,15)), so they are deleted from the candidate set.

Determination of distinct-type events
On the basis of Equation ( 8), for EC, the cluster center value of attributes (such as VC) can be obtained from the average value of samples.Thus, we have  With the calculation above, 8,122,155 instances are preserved.Figure 6c only shows 25 instances, and the 1st-order (ignored) through 6th-order table instances of MVU-based co-location are presented.Table 3 uses four examples to explain the generation processes of the determination of MVU-based co-location.Figure 6c only shows the DRs of the candidate patterns (3,4) and (11,12,16), which are greater than 4/5.For pattern (2, 3), although there is an RRS between them, the TEX is greater than 50, the SSM is not in the range of 5 to 8, and the VC is not less than 0.25 (and similarly so for pattern (10,15)), so they are deleted from the candidate set.• Determination of distinct-type events On the basis of Equation ( 8), for EC, the cluster center value of attributes (such as VC) can be obtained from the average value of samples.Thus, we have With the calculation, the threshold is set to 0.04.If Ψ i is greater than or equal to 0.04, then the i-th instance is a distinct event.

Experimental Results
A DT is induced by the method proposed above.With the post-processing of the DT, five decision rules are induced (Figure 7).
With the calculation, the threshold is set to 0.04.If i Ψ is greater than or equal to 0.04, then the i-th instance is a distinct event.

Experimental Results
A DT is induced by the method proposed above.With the post-processing of the DT, five decision rules are induced (Figure 7).The feature/attribute importance can be measured by the importance rate of attribute (IRA) [48].Suppose n attributes, (att1, att2,…, attn), are used to build the DT.The total number, num_atti, of training samples classified using the i-th attribute atti can be obtained by: where m is the time of classification participation of the i-th attribute and splitnum(j) is the number of training samples classified using atti at the j-th time.Thus, the importance rate of the i-th attribute, IRAi, can be calculated by: After calculation, the importance rates of attributes' classification participation are 0.375, 0.208, 0.129, 0.117, 0.083, 0.051, and 0.037 for VC, PCA1, TEX, PCA3, SSM, LST, and PCA2, respectively.
The classification results in the first test area using the proposed method are presented in Figure 8a.For comparison analysis, the classified results using CL-DT, SAE [1,49], CART and RFs [50] methods are depicted in Figure 8b-e, respectively.Furthermore, their differences are marked by black circles in Figure 8.In addition, the distribution of exposed carbonatite is shown in Figure 9.The feature/attribute importance can be measured by the importance rate of attribute (IRA) [48].Suppose n attributes, (att 1 , att 2 , . . ., att n ), are used to build the DT.The total number, num_att i , of training samples classified using the i-th attribute att i can be obtained by: where m is the time of classification participation of the i-th attribute and splitnum(j) is the number of training samples classified using att i at the j-th time.Thus, the importance rate of the i-th attribute, IRA i , can be calculated by: After calculation, the importance rates of attributes' classification participation are 0.375, 0.208, 0.129, 0.117, 0.083, 0.051, and 0.037 for VC, PCA1, TEX, PCA3, SSM, LST, and PCA2, respectively.
The classification results in the first test area using the proposed method are presented in Figure 8a.For comparison analysis, the classified results using CL-DT, SAE [1,49], CART and RFs [50] methods are depicted in Figure 8b-e, respectively.Furthermore, their differences are marked by black circles in Figure 8.In addition, the distribution of exposed carbonatite is shown in Figure 9.

Experiments on the Second Test Area
Similarly, the classification results in the second test area using the proposed method and the CL-DT, SAE, DT, and RFs methods are shown in Figure 10a-e, respectively.Furthermore, their differences are marked with black circles in Figure 10.In addition, the distribution of exposed carbonatite in the second test area is shown in Figure 11.

Experiments on the Second Test Area
Similarly, the classification results in the second test area using the proposed method and the CL-DT, SAE, DT, and RFs methods are shown in Figure 10a-e, respectively.Furthermore, their differences are marked with black circles in Figure 10.In addition, the distribution of exposed carbonatite in the second test area is shown in Figure 11.

Experiments on the Third Test Area
To obtain a similar amount of bands with Landsat-5, the blue channel was simulated by the 21st-34th bands of the hyperspectral dataset, the green channel was simulated by the 34th-42nd bands, the red channel was simulated by the 57th-70th bands, and the near-infrared channel was simulated by the 84th-113th bands (see Figure 12).Through the methods mentioned in Section 3.2, VC, TEX, and PCA can be obtained.The spatial attribute is also described in X and Y coordinates.
According to the training samples given by NCALM, the instances are considered as the following five classes: vegetation (VG), building (BD), soil (SD), parking lot (PL), and road (RD).The number of samples, containing training and validation samples, is 300 for each class.

Experiments on the Third Test Area
To obtain a similar amount of bands with Landsat-5, the blue channel was simulated by the 21st-34th bands of the hyperspectral dataset, the green channel was simulated by the 34th-42nd bands, the red channel was simulated by the 57th-70th bands, and the near-infrared channel was simulated by the 84th-113th bands (see Figure 12).Through the methods mentioned in Section 3.2, VC, TEX, and PCA can be obtained.The spatial attribute is also described in X and Y coordinates.According to the training samples given by NCALM, the instances are considered as the following five classes: vegetation (VG), building (BD), soil (SD), parking lot (PL), and road (RD).The number of samples, containing training and validation samples, is 300 for each class.
Similarly, the threshold of distance Dθ and density ratio RDθ are 5 m and 6/7, respectively.The final rules are shown in Figure 13.The classification results in the third test area using the MVU-based CL-DT, CL-DT, SAE, DT, and RFs methods are shown in Figure 14a-e, respectively.

Classification Accuracy Comparison
To analyze the classification accuracy of the proposed MVU-based CL-DT method compared to other methods, such as SAE, CL-DT, CART, and RFs, this paper takes the classification results from the SAE method as "true values" and conducts statistical analysis using ENVI 4.8 software.Compared to the "true values," the relative difference (RD) of classification results and relative accuracy (RA) for test areas 1, 2, and 3 are shown in Tables 4-6, respectively.To further analyze the classification accuracy, the proportions of the five classes are statistically analyzed and shown in Figure 15.As observed from Figure 15a, the proportions of each class classified by SAE and MVU-based CL-DT are very close.For example, their smallest difference is approximately 0.2% for WT, and their largest difference is 1.1% for EC in the first test area.
As shown in Tables 4-6 and Figure 15, the proposed MVU-based CL-DT has the highest classification accuracy relative to the "true value" and that the CART has the worst accuracy.
To further explain the impact of the geospatial relationship for classification accuracy, spectral curves and magnified windows of classification results of MVU-based CL-DT, CL-DT, CART, and RFs are employed for visual verification (see Figures 16 and 17).As observed from Figure 16, although the walnut, pine, and grass families are three different features, their spectral curves are very similar.One of them, pine (masson pine), is the main natural vegetation in the first test area, which is represented as "VG" in the classification results.The grass family (for example, rice) and walnut are the main crops, which are represented as "CL" in classification results.Because their spectral characteristics are so close, the CART method misclassifies "VG" and "CL."For example, as observed from Figure 17  The experimental results above have demonstrated that the proposed MVU-based CL-DT method is capable of effectively decreasing misclassification, which is especially caused by images with different objects but the same spectra, and has a higher classification accuracy than CART and CL-DT.The experimental results above have demonstrated that the proposed MVU-based CL-DT method is capable of effectively decreasing misclassification, which is especially caused by images with different objects but the same spectra, and has a higher classification accuracy than CART and CL-DT.The experimental results above have demonstrated that the proposed MVU-based CL-DT method is capable of effectively decreasing misclassification, which is especially caused by images with different objects but the same spectra, and has a higher classification accuracy than CART and CL-DT.In addition, to further demonstrate the advantages and accuracy improvement in classification for the proposed method, independent ground truth samples from Google Earth and confusion matrixes (Tables 7-21) are employed to create the producer accuracy (PA), user accuracy (UA), over-accuracy (OA), average accuracy (AA), and Kappa coefficient (KC).The comparison analyses of five types of methods are presented in Table 22.In addition, to further demonstrate the advantages and accuracy improvement in classification for the proposed method, independent ground truth samples from Google Earth and confusion matrixes (Tables 7-21) are employed to create the producer accuracy (PA), user accuracy (UA), over-accuracy (OA), average accuracy (AA), and Kappa coefficient (KC).The comparison analyses of five types of methods are presented in Table 22.As mention in Section 1, the major shortcoming of the CL-DT method is that it only considers the Euclidean distance of neighborhood instances, which cannot truly reflect the real co-location relationship between instances for which there is a nonlinear distribution in a high-dimensional space.It has been demonstrated that MVU can preserve the original neighborhood relationship of instances which are with nonlinear distribution in a high-dimensional space [30].Obviously, the proposed MVU-based CL-DT method takes advantage of the characteristics of MVU to obtain the unfolded distance of instances.Therefore, the proposed method is capable of in fact ensuring that the co-location relationship of instances is real.As a consequence, the proposed method overcomes the shortcoming of CL-DT method, and, meanwhile, inherits all advantages of the CL-DT method.After using accuracy, which is assessed by testing the induced DT with new data sets and then comparing the predicted classes with the real classes, to evaluate the quality of DT, it is shown that the proposed MVU-based CL-DT can create a higher accuracy of DT than both the CL-DT does and traditional DT does because it uses unfolded distances to determine the co-locations.

Comparison Analysis for Parameters and Time-Consumption
To further evaluate the advantages of the proposed method, the induced DT parameters and computational time are compared to those from MVU-based CL-DT, CL-DT, CART, RFs, and SAE below.

• Comparison of the induced decision tree parameters
The DT parameters induced by the proposed method, CL-DT method, CART, and RFs are compared, and the results are listed in Table 23.As shown in Table 23, the total number of nodes, number of leaf nodes, and number of levels from the proposed method, when compared to CL-DT, decrease by 48%, 45%, and 25%, respectively.The total number of nodes, number of leaf nodes, and number of levels of the proposed method, when compared to CART, decrease by 56%, 54%, and 33%, respectively.The number of trees to RFs is 100, while to MVU-base CL-DT is 1.This means that the proposed MVU-based CL-DT algorithm is able to construct a simpler DT than the others.• Comparison of the computational time The running time of the proposed method consists of two parts, i.e., the pre-processing time (unfolding data sets) and the time of the DT construction.In the pre-processing phase, several minutes are taken to unfold the original dataset (i.e., large images).However, in the DT construction phase, the time can be largely decreased, which has been reported in Table 24.As observed in Table 24, the time taken for rule generation decreases by 43% and 59%, respectively, when compared to CL-DT and the CART.Moreover, the total running time of SAE is 113 s.Additionally, RFs needs 33 s to do parameterization and more than 300 s to classify images.This result demonstrates that the proposed method has faster computational speed in constructing decision trees than the other methods.
Through the comparison of complexity indicated to the shape and size of a DT, it is demonstrated that the proposed MVU-based CL-DT can create a simpler and better DT than both the CL-DT does and traditional DT does.

Classification Accuracy Validation in the Field
For the first test area, we conducted a field validation of the classification accuracy in Guanyang County, Guilin, China.We use a Magellan 210 GPS to collect 20 field validation samples of exposed carbonatite, including their longitudes and latitudes (Table 25).These field validation samples are used to validate the classification results obtained by the proposed method.Only two samples, samples 1 and 2, do not match the classified results, yielding an error proportion of 10%, i.e., a 90% correctness ratio.

Conclusions
The primary contribution of this research is the proposal of the maximum variance unfolding (MVU)-based co-location decision tree (MVU-based CL-DT) algorithm.The algorithm overcomes the deficiency of the traditional CL-DT method, in which the Euclidean distance of instances that are nonlinear distributions in high-dimensional space cannot accurately reflect the co-location relationship between instances.With the experimental results and comparison analysis in Section 4, it can be concluded that the proposed method is capable of effectively and successfully overcoming the classification problems, i.e., "the different objects but equivalent/similar spectrum" issue, that the remote sensing community has traditionally encountered.
This paper has provided detailed descriptions and steps for the algorithms.First, the MVU algorithm is utilized to unfold the input data, and the unfolded distances between instances in the unfolded data are calculated.Second, according to the unfolded distances, the R-relationship (RRS) between instances is determined.Third, MVU-based co-location instances are found on the basis of the RRS, the distinct events are determined, and the MVU-based co-location rules are generated.Finally, the MVU-based co-location rules are merged into the DT to generate the MVU-based co-location decision rules.
The proposed method has been used to classify remotely sensed imagery in three test areas.When compared to stacked auto-encoders (SAE), whose classification results are taken as "true values," CL-DT, traditional DT (classification and regression tree, i.e., CART), and random forests (RFs), the results herein have demonstrated that: (1) the proposed method has better classification accuracy, with relative accuracies of 91.07%, 91.54%, and 96.39% in the three test areas (with the CL-DT method achieving 83.62%, 85.28% and 92.23%, the CART yielding only 74.68%, 74.97% and 84.11% and RFs obtaining 90.57%, 90.92% and 95.38%); and (2) the proposed method can produce a better tree because the total number of nodes, the number of leaf nodes, and the number of levels of the proposed method decrease by 48%, 45%, and 25%, respectively, compared to the CL-DT method and decrease by 56%, 54%, and 33% compared to the CART method.The computational time taken for data processing, decision tree construction, and rule generation decreases by 25%, 38%, 44%, and 43%, respectively, compared to CL-DT and decreases by 50%, 55%, 55%, and 60% compared to the CART.
With the calculation results of the confusion matrix using the validation samples, it can be concluded that: (1) the over-accuracies (OAs) for MVU-based CL-DT reach 91.4%, 90.1% and 87.34% and the Kappa coefficients are 0.89, 0.87 and 0.84 in the three test areas; (2) the OAs for SAE achieve 96.24%, 95.86% and 90.66% and the Kappa coefficients are 0.95, 0.94 and 0.88 in the three test areas; (3) the OAs for CL-DT achieve 82.84%, 81.80% and 82.0% and the Kappa coefficients are 0.79, 0.77 and 0.78 in the three test areas; (4) the OAs for DT reach 80.04%, 82.04% and 76.01% and the Kappa coefficients are 0.75, 0.78and 0.70 in the three test areas; and (5) the OAs for RFs achieve 90.18%, 90.24% and 86.67% and the Kappa coefficients are 0.88, 0.88 and 0.83 in the three test areas.
The proposed method combines the MVU algorithm, co-location mining algorithm, and decision tree algorithm.On the one hand, the proposed method can produce a better tree than traditional DT and obtain satisfactory classification results; on the other hand, it appears that the proposed algorithm is somewhat complicated.People are required to have professional knowledge in the three fields to understand the details of the proposed algorithm.

Figure 1 .
Figure 1.(a) Data of a Swiss roll; and (b) the result of mapping the Swiss roll data by MVU.(Note: Figure 1 is based on the original figure of Weinberger and Saul [31])

Figure 1 .
Figure 1.(a) Data of a Swiss roll; and (b) the result of mapping the Swiss roll data by MVU.(Note: Figure 1 is based on the original figure of Weinberger and Saul [31]).

2. 3 . 1 .
MVU Unfolded Distance Algorithm The MVU unfolded distance algorithm consists of the following three steps: (a) neighbor relation matrix reservation; (b) MVU function establishment and solution; and (c) calculation of the unfolded distance between instances.They are detailed as follows.

2. 3 .
MVU-Based Co-Location Mining Rules 2.3.1.MVU Unfolded Distance Algorithm The MVU unfolded distance algorithm consists of the following three steps: (a) neighbor relation matrix reservation; (b) MVU function establishment and solution; and (c) calculation of the unfolded distance between instances.They are detailed as follows.

Figure 3 .
Figure 3. Determination of the unfolded distance between instances.Figure 3. Determination of the unfolded distance between instances.

Figure 3 .
Figure 3. Determination of the unfolded distance between instances.Figure 3. Determination of the unfolded distance between instances.

Algorithm 1 .
The Algorithm of MUD 1. Input: The number of nearest neighbors k; Original data set X; An N × N zero matrix δ 2. Output: Low-dimensional representation Y; An N × N binary matrix δ ; An Unfolded distance matrix U 3. Process: 4. Step 1: Construct the neighbor graph. 5.
is the i-th instance, similar AttrN (Y i , Y j ) is the number of attributes for which the values of Y i and Y j satisfy the same threshold, and Total AttrN is the total number of attributes used in this paper.If DR is greater than or equal to the threshold DR θ , (Y i , Y j ) is a 2nd-order MVU-based co-location pattern.When the 2nd-order MVU-based co-location patterns are determined as above, the k (k ≥ 3)-order MVU-based co-location patterns should be constructed.The method for constructing the k (k ≥ 3)-order MVU-based co-location patterns is as follows.Let the (k-1)th (k ≥ 3)-order MVU-based co-location pattern C k-1 connect the (k-2)th-order C k-2 to generate the k-th-order candidate of the MVU-based co-location pattern C k , and if C k satisfies the threshold of DR, it is regarded as the k-order MVU-based co-location pattern C k .Take the constructed process of the 3rd-order MVU-based co-location (A 1 , A 6 , A 11 ) in Figure4as an example.First, (A 1 , A 6 ) connects (A 6 , A 11 ) using the previous k-2 order, and then, the candidate of the MVU-based co-location pattern (A 1 , A 6 , A 11 ) is obtained.If the DR of (A 1 , A 6 , A 11 ) satisfies the threshold DR θ , (A 1 , A 6 , A 11 ) is an MVU-based co-location pattern.By parity of reasoning, all MVU-based co-location patterns are obtained.
k-1)th (k ≥ 3)-order MVU-based co-location pattern Ck-1 connect the (k-2)th-order Ck-2 to generate the k-th-order candidate of the MVU-based co-location pattern C′k, and if C′k satisfies the threshold of DR, it is regarded as the k-order MVU-based co-location pattern Ck.Take the constructed process of the 3rd-order MVU-based co-location (A1, A6, A11) in Figure4as an example.First, (A1, A6) connects (A6, A11) using the previous k-2 order, and then, the candidate of the MVU-based co-location pattern (A1, A6, A11) is obtained.If the DR of (A1, A6, A11) satisfies the threshold DRθ, (A1, A6, A11) is an MVU-based co-location pattern.By parity of reasoning, all MVU-based co-location patterns are obtained.

Figure 4 .
Figure 4. Determination of MVU-based co-locations (Note: Attribute-k represents the k-th attribute, and Ai represents the i-th instance of object A. In each of the attributes, the location of instance Ai is the same).

Figure 4 .
Figure 4. Determination of MVU-based co-locations (Note: Attribute-k represents the k-th attribute, and A i represents the i-th instance of object A. In each of the attributes, the location of instance A i is the same).

Algorithm 2 .
the Algorithm of the determination of MCL 1. Input: Unfolded distance matrix D U ; Unfolded dataset Y; Thresholds of distance D θ , and thresholds of density radio θ ; 2. Valuable: ζ: the order of a co-location pattern; R: the set of R-relationship between instances; C_ζ: the set of candidates of a co-location pattern whose size is ζ; Tab_ζ: the set of co-location patterns; Rul_ζ: the ζ-order co-location rule.3. Output: ζ-order co-location pattern; Rul_ζ.4. Process: 5. Step 1: 6.

Algorithm 3 .
the Algorithm of MVU-based CL-DT 1. Input: Training dataset TD; The threshold of unfolded distance D θ ; The thresholds of density radio θ ; Splitting criterion; The threshold of the terminal node; 2. Output: An MVU-based CL-DT with multiple condition attributes.3. Process: 4. Step 1: The algorithm of MUD: function MUD; 5. Step 2: The algorithm of the determination of MCL: function MCL; 6. Step 3: Judge whether co-locations are distinct event types; 7. Step 4: Build an initial tree; 8.

15 .
Step 8: Apply the algorithm recursively to each of the not-yet-stopped nodes; 16.Step 9: Generate decision rules by collecting decisions driven in individual nodes; 17.Step 10: The decision rules generated in Step 7 are used as the initialization of co-location mining rules.Apply the algorithm of the co-location mining rule to generate new associate rules.18. Step 11: Re-organize the input data set and repeat Step 2 through 19.Step 8 until the classified results by the colocation mining rule and decision tree (rules) are consistent.

( 1 )
The first test area and dataset: The first test area is located from 24.25 • to 26.38 • north latitude and 109.60 • thru 111.48 • east longitude, covering the entire city of Guilin, Guangxi Province, China.The area covers approximately 27,200 km 2 .The test area is a typical karst plain landform in which there are many exposed carbonate rocks.The images of Landsat-5 TM were acquired at local time on 21 September 2006.The spatial resolution of three visible bands, one near infrared band, one middle infrared band, and one far infrared are 30 m, and the spatial resolution of one thermal infrared band is 120 m.The Landsat-5, with the main detector TM, is operated at a sun-synchronous, 705-km orbit height with a 16-day revisit cycle.(2)The second test area and dataset: The second test area is located from 23.78 • to 24.58 • north latitude and 107.85 • thru 108.5 • east longitude, covering Du'an County of the City of Hechi, Guangxi Province, China.The area covers nearly 4459 km2 .The test area is a typical karst rocky desertification area in which exposed carbonate rocks are widespread.The images of Landsat-5 TM were acquired at local time on 30 January 2009, and were ordered from the website http://www.gscloud.cn/.

Figure 5 .
Figure 5.The flowchart of the experiment.

Figure 5 .
Figure 5.The flowchart of the experiment.

Figure 6 .
Figure 6.(a) The results of the determination of table instances; (b) the results are amplified two times; and (c) the results are amplified 40 times.

Figure 6 .
Figure 6.(a) The results of the determination of table instances; (b) the results are amplified two times; and (c) the results are amplified 40 times.

Figure 7 .
Figure 7.The final rules after verification and post-processing.

Figure 7 .
Figure 7.The final rules after verification and post-processing.

Figure 8 .
Figure 8.The results of classification in the first test area by: (a) the proposed method; (b) co-location decision tree (CL-DT); (c) stacked auto-encoders (SAE); (d) classification and regression tree (CART); and (e) random forests (RFs).

Figure 8 .
Figure 8.The results of classification in the first test area by: (a) the proposed method; (b) co-location decision tree (CL-DT); (c) stacked auto-encoders (SAE); (d) classification and regression tree (CART); and (e) random forests (RFs).

Figure 9 .
Figure 9.The extraction results of exposed carbonatite in the first test area by: (a) the proposed method; (b) CL-DT; (c) SAE; (d) CART; and (e) RFs.Figure 9.The extraction results of exposed carbonatite in the first test area by: (a) the proposed method; (b) CL-DT; (c) SAE; (d) CART; and (e) RFs.

Figure 9 .
Figure 9.The extraction results of exposed carbonatite in the first test area by: (a) the proposed method; (b) CL-DT; (c) SAE; (d) CART; and (e) RFs.Figure 9.The extraction results of exposed carbonatite in the first test area by: (a) the proposed method; (b) CL-DT; (c) SAE; (d) CART; and (e) RFs.

Figure 10 .
Figure 10.The results of classification in the second test area by: (a) the proposed method; (b) CL-DT; (c) SAE; (d) CART; and (e) RFs.

Figure 10 .Figure 11 .
Figure 10.The results of classification in the second test area by: (a) the proposed method; (b) CL-DT; (c) SAE; (d) CART; and (e) RFs.

Figure 11 .
Figure 11.The extraction results of exposed carbonatite in the second test area by: (a) the proposed method; (b) CL-DT; (c) SAE; (d) CART; and (e) RFs.

Figure 12 .
Figure 12.False color image of the 3rd test area.

Figure 13 .
Figure 13.The final rules of the third test area (R: NIR channel; G: red channel; B: green channel).

Figure 12 .
Figure 12.False color image of the 3rd test area.

Figure 12 .
Figure 12.False color image of the 3rd test area.

Figure 13 .
Figure 13.The final rules of the third test area (R: NIR channel; G: red channel; B: green channel).Figure 13.The final rules of the third test area (R: NIR channel; G: red channel; B: green channel).

Figure 13 .Figure 14 .
Figure 13.The final rules of the third test area (R: NIR channel; G: red channel; B: green channel).Figure 13.The final rules of the third test area (R: NIR channel; G: red channel; B: green channel).

Figure 14 .
Figure 14.The results of classification in the third test area by: (a) the proposed method; (b) CL-DT; (c) SAE; (d) CART; and (e) RFs.
, the classification results of SAE and MVU-based CL-DT are the closest to the ground truth.The classification results of the CART method are significantly different from the reality in the remotely sensed imagery.

Figure 15 .
Figure 15.Comparison of the proportion of classes retrieved by different methods in: the first test area (a); second test area (b); and third test area (c).

Figure 16 .
Figure 16.Spectral curves of features (note: these spectral curves of features are from the spectral libraries of ENVI 4.8).

Figure 15 .
Figure 15.Comparison of the proportion of classes retrieved by different methods in: the first test area (a); second test area (b); and third test area (c).

Figure 15 .
Figure 15.Comparison of the proportion of classes retrieved by different methods in: the first test area (a); second test area (b); and third test area (c).

Figure 16 .
Figure 16.Spectral curves of features (note: these spectral curves of features are from the spectral libraries of ENVI 4.8).

Figure 16 .
Figure 16.Spectral curves of features (note: these spectral curves of features are from the spectral libraries of ENVI 4.8).

Figure 17 .
Figure 17.(a-m) Magnified windows of classification results in the first test area: (g) the classification results of SAE; (b,h) ground truth; and magnified windows of the classification results of: (a,i) MVU-based CL-DT; (d,j) CL-DT; (c,k) SAE; (f,l) CART; and (e,m) RFs.

Figure 17 .
Figure 17.(a-m) Magnified windows of classification results in the first test area: (g) the classification results of SAE; (b,h) ground truth; and magnified windows of the classification results of: (a,i) MVU-based CL-DT; (d,j) CL-DT; (c,k) SAE; (f,l) CART; and (e,m) RFs. ) , D U (Y 1 , Y 10 ) is the shortest cumulative Euclidean distance of a number of neighbors (such as Y 1 , Y 3 , Y 5 , Y 6 , Y 9 , and Y 10 ).Because the output data of MVU can be viewed as a connected graph and two arbitrary instances can be connected by a number of intermediate k-nearest neighbors, the shortest path can always be found.

Table 2 .
The global Moran's I of training samples under the neighborhood search distance of 20,000 m by ArcMap.

Table 2 .
The global Moran's I of training samples under the neighborhood search distance of 20,000 m by ArcMap.

Table 3 .
Determination of table instances.

Table 3 .
Determination of table instances.

Table 4 .
The relative difference (RD) of classification results and relative accuracy (RA) in the 1st test area.

Table 5 .
The RD of classification results and RA in the 2nd test area.

Table 6 .
The RD of classification results and RA in the 3rd test area.

Table 7 .
Confusion matrix of the proposed method in the first area.

Table 8 .
Confusion matrix of the proposed method in the second area.

Table 9 .
Confusion matrix of the proposed method in the third area.

Table 7 .
Confusion matrix of the proposed method in the first area.

Table 8 .
Confusion matrix of the proposed method in the second area.

Table 9 .
Confusion matrix of the proposed method in the third area.

Table 10 .
Confusion matrix of the CL-DT in the first area.

Table 11 .
Confusion matrix of the CL-DT in the second area.

Table 12 .
Confusion matrix of the CL-DT in the third area.

Table 13 .
Confusion matrix of the CART in the first area.

Table 14 .
Confusion matrix of the CART in the second area.

Table 15 .
Confusion matrix of the CART in the third area.

Table 16 .
Confusion matrix of the SAE in the first area.

Table 17 .
Confusion matrix of the SAE in the second area.

Table 18 .
Confusion matrix of the SAE in the third area.

Table 19 .
Confusion matrix of the RFs in the first area.

Table 20 .
Confusion matrix of the RFs in the second area.

Table 21 .
Confusion matrix of the RFs in the third area.

Table 22 .
Comparison of over accuracy (OA) and Kappa coefficient (KC) in the three test areas.

Table 23 .
Comparison of the induced decision tree parameters.

Table 24 .
Comparison of the computational time of constructing the DT.