Refinement of Hyperspectral Image Classification with Segment-Tree Filtering

This paper proposes a novel method of segment-tree filtering to improve the classification accuracy of hyperspectral image (HSI). Segment-tree filtering is a versatile method that incorporates spatial information and has been widely applied in image preprocessing. However, to use this powerful framework in hyperspectral image classification, we must reduce the original feature dimensionality to avoid the Hughes problem; otherwise, the computational costs are high and the classification accuracy by original bands in the HSI is unsatisfactory. Therefore, feature extraction is adopted to produce new salient features. In this paper, the Semi-supervised Local Fisher (SELF) method of discriminant analysis is used to reduce HSI dimensionality. Then, a tree-structure filter that adaptively incorporates contextual information is constructed. Additionally, an initial classification map is generated using multi-class support vector machines (SVMs), and segment-tree filtering is conducted using this map. Finally, a simple Winner-Take-All (WTA) rule is applied to determine the class of each pixel in an HSI based on the maximum probability. The experimental results demonstrate that the proposed method can improve HSI classification accuracy significantly. Furthermore, a comparison between the proposed method and the current state-of-the-art methods, such as Extended Morphological Profiles (EMPs), Guided Filtering (GF), and Markov Random Fields (MRFs), suggests that our method is both competitive and robust.


Introduction
Hyperspectral image (HSI) classification is important for urban land use monitoring, crop growth monitoring, environmental assessment, etc. Various machine learning algorithms that process high-dimension data can be employed in pixel-wise classification, such as Support Vector Machines (SVMs) [1], Logistic Regression [2,3], Artificial Neural Networks (ANNs) [4], etc.However, these conventional approaches do not consider spatial HSI information between neighboring pixels, which can lead to noisy classification output.Including the spatial relationships between pixels can enhance the classification accuracy.For example, there is a high probability that a pixel shares the same class as its neighboring pixels if the similarity measure between them is high.Otherwise, if the similarity measure is low, this probability decreases.Therefore, HSI classification could be improved further by combining spatial and spectral features.
Many spatial-spectral methods have been proposed to incorporate spatial or contextual information.For example, spatial information was represented using Markov Random Fields (MRFs) in [5,6], and classification has been performed using α-Expansion [7] and Belief Propagation [8], which are commonly used max-flow/min-cut algorithms in MRF optimization.Another method is presented in [9], namely Extended Morphological Profiles (EMPs).After the first two principal components of the HSI are computed using the Principal Component Analysis (PCA) method, spatial features are extracted by morphological operations.Together with spectral information, they are concatenated for HSI classification.As morphological operations such as opening and closing involve neighboring pixel calculations, contextual information is naturally utilized in this manner.Another example of employing contextual information is via texture analysis.In [10], a Gray Level Co-occurrence Matrix (GLCM) is used to extract this type of contextual information, which is then employed to concatenate spectral features used for classification.In [11], segmentation is employed to represent spatial information based on a minimum spanning tree method, and majority voting is used to assign a class label to each region.Similar to [11], methods based on segmentation [12][13][14][15] have attracted increased attention because they produce satisfactory results.However, there are some drawbacks to algorithms based on hard segmentation.For example, they assume that all pixels in the same region are homogenous.After segmentation is completed, the relationship between pixels in different regions is fully disconnected; thus, if the segmentation is incorrect, the accuracy decreases dramatically.Although an over-segmentation approach is applied in [11,15] to improve the similarity in a region, the computational complexity increases considerably.Therefore, these algorithms are not efficient because of the complex voting processes in thousands of regions.In [16], super-pixel segmentation is applied to feature extraction and then classification is conducted in a novel framework via multiple kernels, which avoid voting but the super-pixel method still needs over-segmentation.To make classification more efficient, Edge-Aware Filtering and Edge-Preserving Filtering (EAF and EPF) methods [17,18] can be applied.These methods have been adopted successfully in many computer vision applications, such as stereo matching [19], optical flow [20], image fusion [21], etc.Unlike image segmentation, the most prominent merit of the EAF method is that in homogeneous image areas, EAFs can generate smooth output, while in inhomogeneous image areas, they can adaptively preserve boundaries, even in challenging situations.In this paper, we implement a tree-structure EAF for HSIs that is based on the segment-tree algorithm [22,23] and combines the advantages of segmentation and EAF.Unlike other EAF methods, the window size does not need to be set in this scheme.It is difficult to establish a proper window size for bilateral and guided filters [17], largely because objects of interest display the most prominent features at different scales.Another merit of this scheme is that the segment-tree filter is more efficient than other EAFs because of its tree structure [24].By traversing the tree in two sequential passes, from the leaves to the root and from the root to the leaves, every pixel in HSI can be filtered and labeled.
However, the Segment-Tree Filter cannot be used for original HSI directly because the computational cost of this method is extremely high and the Hughes phenomenon always makes the classification accuracy unsatisfactory.The hyperspectral bands that are contaminated by noise may destroy the true connection between neighboring pixels.To avoid this problem, there are several literature that introduce how to choose the bands of original HSI [25,26] or produce new salient features.In this paper, the Semi-supervised Local Fisher (SELF) discriminant analysis method [27] is employed to reduce dimensionality.The SELF method of feature reduction is used because it retains prior knowledge from training sets and statistical distribution of clusters, unlike methods such as PCA, Linear Discriminant Analysis (LDA)/Fisher Discriminant Analysis (FDA) [28], and Local FDA (LFDA) [29].In practice, the segmentation will be less sensitive to the number of training samples based on the SELF method.Additionally, we can construct the Segment-Tree Filter using a limited number of bands, which can reduce the computational cost of segmentation.Because the extracted SELF bands can reduce the effect of noise, a limited number of bands can well represent the inherent spatial structure of the image, which can lead to better output.
The remainder of this paper is organized as follows.In Section 1, we discuss some related methods and processes, including initial classification, SELF, Graph-based Segment-Tree Filter construction, and filtering.In Section 2, the proposed HSI classification scheme is described in detail.The experimental results are presented in Section 3. Finally, we draw our conclusions and present our outlooks for future research in Section 4.

HSI Classification Refinement Using Segment-Tree Filtering
A schematic diagram of the proposed method is shown in Figure 1.

1.
Step 1: Construct the Segment-Tree Filter, which involves feature extraction using the SELF method followed by building a tree-structure filter for an HSI based on dimensionality reduction.2.
Step 2: Use a Multi-class SVM method to obtain the initial classification map.

3.
Step 3: Perform Segment-Tree Filtering based on the Multi-class SVM, pixel-based initial classification map.By combining this initial classification map and the Segment-Tree Filter, we can incorporate spatial information and spectral features, adaptively.Finally, the HSI classification map can be derived from the result of Segment-Tree Filtering.

HSI Classification Refinement Using Segment-Tree Filtering
A schematic diagram of the proposed method is shown in Figure 1.

1.
Step 1: Construct the Segment-Tree Filter, which involves feature extraction using the SELF method followed by building a tree-structure filter for an HSI based on dimensionality reduction.

Initial Classification
In this paper, a Multi-class SVM classifier is adopted in the initial classification step.SVMs are widely used classifiers in remote sensing image classification.These supervised learning models are used for classification and regression in binary classification problems.In this paper, we utilize a Multi-class SVM from LIBSVM library with a radial basis function (RBF) kernel.In this method, a "one against one" strategy [30] is employed to extend the binary SVM to multi-class cases.The punishment parameter C and the spread of the kernel gamma are optimally determined by cross-validation.
In most cases, the output of the initial classification is a probability map, which can be represented as a tensor [18] as follows: M is either 0 or 1 depending on whether the sample belongs to the

Initial Classification
In this paper, a Multi-class SVM classifier is adopted in the initial classification step.SVMs are widely used classifiers in remote sensing image classification.These supervised learning models are used for classification and regression in binary classification problems.In this paper, we utilize a Multi-class SVM from LIBSVM library with a radial basis function (RBF) kernel.In this method, a "one against one" strategy [30] is employed to extend the binary SVM to multi-class cases.The punishment parameter C and the spread of the kernel gamma are optimally determined by cross-validation.
In most cases, the output of the initial classification is a probability map, which can be represented as a tensor [18] as follows: where (i, j) is the position of sample p in the image; H and W are the height and width of the image, respectively; k is the label of the sample; S is the total number of classes in the classification; and M k p is the probability that the sample p belongs to the kth class.Based on the Multi-class SVM classifier, M k p is either 0 or 1 depending on whether the sample belongs to the kth class.

Semi-Supervised Local Fisher Discriminant Analysis
In this section, we review SELF briefly.The SELF method seeks an embedding transformation such that the local, between-class scatter is maximized and the local, within-class scatter is minimized in both the training and test sets.We assume that HSI X has m hyperspectral bands and n samples, and the training set X has n samples.Then, the test set has n − n samples.There are three pre-defined input parameters: the trade-off parameter β, the dimensionality of the reconstruction space r, and the KNN parameter K.The five steps in this process are as follows: 1.
Local scaling coefficient σ i is pre-computed for each sample in the training set, which is equal to the Euclidean distance between the sample x i and its Kth nearest neighbor x K i among all samples in both the training and test sets.
Local between-class weight matrix W lb and local within-class weight matrix W lw are computed as Equations ( 4) and (5), respectively.In this step, if two samples have the same label in the training set, σ i is used to scale the local geometric structure with heat kernel weighting.
The local between-class scatter matrix S lb and local, within-class scatter matrix S lw are calculated by Equations ( 6) and (7): Note that 1 n is a unit column vector of size n × 1. Steps 1-3 are the same as those used in the LFDA procedure.In our procedure, only samples in the training set have been used at this point, and the statistical distribution of clusters has not been assessed or applied.

4.
The covariance matrix S t is computed based on all samples in both the training and test sets as below: where x is the mean of all samples.Then, the regularized, local, between-class scatter matrix S rlb and the regularized, local, within-class scatter matrix S rlw are derived by Equations ( 9) and (10), respectively.
β is the trade-off parameter based on prior knowledge from the training set and the statistical distribution of clusters.Therefore, SELF maintains the advantages of LFDA and PCA.Note that I m is an m × m identity matrix.

5.
Transformation matrix T can be computed based on generalized eigenvalue decomposition.T consists of weighted eigenvectors corresponding to the r largest eigenvalues.After T is determined, all the pixels in the HSI can be reprojected to a new low-dimensional space.
The parameter β plays an important role in the algorithm.When it is relatively small (e.g., β = 0.01), SELF is nearly identical to LFDA, and SELF becomes increasingly similar to PCA when β approaches 1. β balances prior knowledge regarding the labels in the training set and the statistical distribution of clusters in the test set.In our experiment, we set β to 0.6, as we found that this value fully utilizes the advantages of the algorithm.
As a Semi-Supervised Learning (SSL) method, SELF can adapt based on the number of training samples.When the number of samples in a training set is small and prior knowledge about labels is limited and/or noisy, SELF can use the statistical distribution of clusters in the test set to offset these issues.We demonstrate this advantage in Figures 2 and 3. Figure 2 illustrates the reconstructed image of the Indian Pines dataset using PCA, LDA, LFDA, and SELF when the training samples accounted for only 1% of all samples.Figure 3 shows the same reconstructed image when the training percentage increases to 20%.In both figures, the color images (R, G, and B) are composed of the first three bands extracted using the corresponding feature-transformation methods.In Figure 2a, because the reconstruction based on LDA only relies on a small number of training set samples, the image exhibits considerable noise and error.When the number of training samples increases, LDA reconstruction is improved (e.g., there are fewer fractions and more homogenous areas in Figure 3a than in Figure 2a).Additionally, LFDA is more robust than LDA, even if the number of training samples is limited.However, as the number of samples increases, numerous false edges and fractions can be observed in the reconstructed image based on LFDA, as shown in Figures 2b and 3b.The boundary of the purple trapezoid in the bottom-left portion of Figure 2b was correctly extracted; however, it was incorrectly extracted in Figure 3b.As shown in Figures 2d and 3d (see the red rectangular region), SELF is robust regardless of the number of samples in the training set.Conversely, PCA does not depend on the training set; therefore, Figures 2c and 3c display the same reconstruction result.However, the comparison between PCA and SELF in the rectangular region in Figure 3d shows that PCA creates more segments and fractions because it aligns the boundaries of objects rather than classes, which causes more errors in subsequent processing steps compared to using SELF.In our experiment, the best reconstructed images based on SELF had 10 bands.Thus, the spectral dimension of the original HSI was dramatically reduced, but the most discriminatory information within the spectral bands was retained.
The parameter β plays an important role in the algorithm.When it is relatively small (e.g., ), SELF is nearly identical to LFDA, and SELF becomes increasingly similar to PCA when β approaches 1. β balances prior knowledge regarding the labels in the training set and the statistical distribution of clusters in the test set.In our experiment, we set β to 0.6, as we found that this value fully utilizes the advantages of the algorithm.As a Semi-Supervised Learning (SSL) method, SELF can adapt based on the number of training samples.When the number of samples in a training set is small and prior knowledge about labels is limited and/or noisy, SELF can use the statistical distribution of clusters in the test set to offset these issues.We demonstrate this advantage in Figures 2 and 3. Figure 2 illustrates the reconstructed image of the Indian Pines dataset using PCA, LDA, LFDA, and SELF when the training samples accounted for only 1% of all samples.Figure 3 shows the same reconstructed image when the training percentage increases to 20%.In both figures, the color images (R, G, and B) are composed of the first three bands extracted using the corresponding feature-transformation methods.In Figure 2a, because the reconstruction based on LDA only relies on a small number of training set samples, the image exhibits considerable noise and error.When the number of training samples increases, LDA reconstruction is improved (e.g., there are fewer fractions and more homogenous areas in Figure 3a than in Figure 2a).Additionally, LFDA is more robust than LDA, even if the number of training samples is limited.However, as the number of samples increases, numerous false edges and fractions can be observed in the reconstructed image based on LFDA, as shown in Figures 2b and 3b.The boundary of the purple trapezoid in the bottom-left portion of Figure 2b was correctly extracted; however, it was incorrectly extracted in Figure 3b.As shown in Figures 2d and 3d (see the red rectangular region), SELF is robust regardless of the number of samples in the training set.Conversely, PCA does not depend on the training set; therefore, Figures 2c and 3c display the same reconstruction result.However, the comparison between PCA and SELF in the rectangular region in Figure 3d shows that PCA creates more segments and fractions because it aligns the boundaries of objects rather than classes, which causes more errors in subsequent processing steps compared to using SELF.In our experiment, the best reconstructed images based on SELF had 10 bands.Thus, the spectral dimension of the original HSI was dramatically reduced, but the most discriminatory information within the spectral bands was retained.

Segment-Tree Filter Construction
The image transformed using SELF is used as input for the graph-based Segment-Tree Filter.The implementation of Segment-Tree Filtering is based on the methodology presented in [22,23], which use the Kruskal algorithm to construct a Minimum Spanning Tree (MST).The general workflow is summarized as follows.First, a graph , where V represents the vertices and each pixel is a vertex.E represents an edge that links four neighbors, and there are ( 1) ( 1) m n n m − + − edges in total.A weight e w is assigned for each edge E to represent the dissimilarity between the linked vertices.Several dissimilarity measures, such as the L1-norm, L2-norm, L ∞-norm, and Spectral Angle Mapper (SAM), have been proposed in the literature [11].In our experiment, SAM is used as the dissimilarity measure.
1.All the edges are sorted in ascending order according to their weights.This step can be performed efficiently using a quicksort algorithm [31], even if the number of edges is very large.
2. For each vertex, we initialize a tree ( , ) 3. A subtree is then built for each segment.Then, subtrees are merged based on the order of sorted edges.Segment-Tree Filtering is a variant of the conventional MST approach that considers an extra criterion to merge trees [22,32], as shown in Equation (11): where e w is the weight of the edge between subtrees p T and q T , p T is the number of vertices in the subtree p T , and k is a constant.In our experiments, k is set to five times the standard deviation of all weights in the graph.If criterion (3) is satisfied, subtrees p T and q T are merged.Criterion (3) establishes a trade-off between the edge weights and the numbers of pixels in the subtrees.Initially, merging subtrees is easy because the number of pixels in each subtree is small.As the number of pixels increases, the criterion becomes increasingly rigorous; therefore, it is adaptive.4. All the remaining edges that are not part of any subtree are sorted again.If the number of vertices in a subtree is smaller than a threshold 0 T , then the subtrees should be merged.In our experiment, 0 6 T = .This processing step is based on the improvement presented in [23] to

Segment-Tree Filter Construction
The image transformed using SELF is used as input for the graph-based Segment-Tree Filter.The implementation of Segment-Tree Filtering is based on the methodology presented in [22,23], which use the Kruskal algorithm to construct a Minimum Spanning Tree (MST).The general workflow is summarized as follows.First, a graph G = {V, E} is constructed for an image (m × n pixels), where V represents the vertices and each pixel is a vertex.E represents an edge that links four neighbors, and there are m(n − 1) + n(m − 1) edges in total.A weight w e is assigned for each edge E to represent the dissimilarity between the linked vertices.Several dissimilarity measures, such as the L1-norm, L2-norm, L∞-norm, and Spectral Angle Mapper (SAM), have been proposed in the literature [11].In our experiment, SAM is used as the dissimilarity measure.

1.
All the edges are sorted in ascending order according to their weights.This step can be performed efficiently using a quicksort algorithm [31], even if the number of edges is very large.

2.
For each vertex, we initialize a tree T i (V i , E i ).

3.
A subtree is then built for each segment.Then, subtrees are merged based on the order of sorted edges.Segment-Tree Filtering is a variant of the conventional MST approach that considers an extra criterion to merge trees [22,32], as shown in Equation (11): where w e is the weight of the edge between subtrees T p and T q , T p is the number of vertices in the subtree T p , and k is a constant.In our experiments, k is set to five times the standard deviation of all weights in the graph.If criterion (3) is satisfied, subtrees T p and T q are merged.Criterion (3) establishes a trade-off between the edge weights and the numbers of pixels in the subtrees.Initially, merging subtrees is easy because the number of pixels in each subtree is small.As the number of pixels increases, the criterion becomes increasingly rigorous; therefore, it is adaptive.4.
All the remaining edges that are not part of any subtree are sorted again.If the number of vertices in a subtree is smaller than a threshold T 0 , then the subtrees should be merged.In our experiment, T 0 = 6.This processing step is based on the improvement presented in [23] to omit small fractions caused by noise.The obtained subtrees are illustrated in Figure 4, in which each color represents an obtained subtree.As shown, constructed subtrees can be used to segment HSIs adaptively.

5.
Finally, subtrees are merged until all vertices are included in the trees.For each tree, all the connected vertices exhibit the highest similarity and are within the shortest possible distance.
As shown in Figure 5b, the edges of the final tree minimally cross the boundaries between two regions.

Segment-Tree Filtering
The final step is to filter the initial probability maps using the tree-structure filter.The objective of the filtering process is to compute the aggregated probabilities.In the proposed approach, all vertices contribute to the aggregated probabilities, unlike using local neighbor methods.The non-local, aggregated probabilities d p M can be defined as follows: (12) where S p q is a weighting function that denotes the weight contribution of pixel q to p : where ei w is the weight of an edge in the tree structure connecting p and q and γ is a constant

Segment-Tree Filtering
The final step is to filter the initial probability maps using the tree-structure filter.The objective of the filtering process is to compute the aggregated probabilities.In the proposed approach, all vertices contribute to the aggregated probabilities, unlike using local neighbor methods.The non-local, aggregated probabilities d p M can be defined as follows: (12) where S p q is a weighting function that denotes the weight contribution of pixel q to p : where ei w is the weight of an edge in the tree structure connecting p and q and γ is a constant parameter.
Due to the tree structure, all the aggregated probabilities of class d in the image can be computed efficiently through traversing the tree in two sequential passes.In the first pass, forward filtering occurs from the leaf pixels to the root:

Segment-Tree Filtering
The final step is to filter the initial probability maps using the tree-structure filter.The objective of the filtering process is to compute the aggregated probabilities.In the proposed approach, all vertices contribute to the aggregated probabilities, unlike using local neighbor methods.The non-local, aggregated probabilities M d p can be defined as follows: where M d p is defined in Equation (1).S(p, q) is a weighting function that denotes the weight contribution of pixel q to p: where w ei is the weight of an edge in the tree structure connecting p and q and γ is a constant parameter.
Due to the tree structure, all the aggregated probabilities of class d in the image can be computed efficiently through traversing the tree in two sequential passes.In the first pass, forward filtering occurs from the leaf pixels to the root: where c(p) represents all the children of vertex c(p).In the second pass, backward filtering occurs from the root to the leaf pixels: where pa(p) represents the parent of vertex p.
As Figure 6 shows, vertex V 4 aggregates the probabilities of V 5 , V 6 , V 7 , and itself during the forward filtering step using Equation (16).During the backward filtering step, the probabilities of V 1 , V 2 , and V 3 contribute to V 4 based on Equation ( 17).After only two filtering steps, the aggregated probabilities of all vertices are computed, which reflects an extremely low computational complexity.Finally, the classification map is obtained using a simple Winner-Take-All (WTA) rule.
Remote Sens. 2017, 9, 69 8 of 17 where ( ) c p represents all the children of vertex ( ) c p .In the second pass, backward filtering occurs from the root to the leaf pixels: (1 ( ( ), )) where ( ) pa p represents the parent of vertex p .
As Figure 6 shows, vertex 4 V aggregates the probabilities of 5 V , 6 V , 7 V , and itself during the forward filtering step using Equation ( 16).During the backward filtering step, the probabilities of

Experiments and Results
The proposed method has been implemented in C++ with the OpenCV library and Lapack library.The implemented code is available by contracting author.Evaluations were performed using three hyperspectral benchmark datasets as below: 1.The first HSI is a 2 × 2 mile portion of agricultural area over the Indian Pines region in Northwest Indiana, which was acquired by NASA's Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) sensor.This scene with a size of 145 × 145 pixels, comprises 202 spectral bands in the wavelength range from 0.4 to 2.5μm, with spatial resolution of 20 m.The ground truth of scene (see Figure 7a) contains 16 classes of interest and total 10,366 samples.Due to the imbalanced number of available labeled pixels and a large number of mixed pixels per class, this dataset creates a challenge in HSI classification.

The second HSI is a 103-band image acquired by Reflective Optics Spectrographic Image
System (ROSIS-03) sensor over the urban area of the University of Pavia, Italy.The spatial resolution is 1.3 m and the scene contains 610 × 340 pixels and nine classes.The number of samples is 42,776 in total.The ground truth of the scene is shown in Figure 8a.3. The third HSI is also derived by AVIRIS sensor over Salinas Valley, California.This scene with a size of 512 × 217 pixels, and 204 spectral bands is used for classification.There are 16 classes in the ground truth image, which is shown in Figure 9a.
The overall accuracy (OA), average accuracy (AA), Kappa coefficient, and producer accuracy (PA) are used to assess the classification accuracy.

Experiments and Results
The proposed method has been implemented in C++ with the OpenCV library and Lapack library.The implemented code is available by contracting author.Evaluations were performed using three hyperspectral benchmark datasets as below: 1.
The first HSI is a 2 × 2 mile portion of agricultural area over the Indian Pines region in Northwest Indiana, which was acquired by NASA's Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) sensor.This scene with a size of 145 × 145 pixels, comprises 202 spectral bands in the wavelength range from 0.4 to 2.5µm, with spatial resolution of 20 m.The ground truth of scene (see Figure 7a) contains 16 classes of interest and total 10,366 samples.Due to the imbalanced number of available labeled pixels and a large number of mixed pixels per class, this dataset creates a challenge in HSI classification.

2.
The second HSI is a 103-band image acquired by Reflective Optics Spectrographic Image System (ROSIS-03) sensor over the urban area of the University of Pavia, Italy.The spatial resolution is 1.3 m and the scene contains 610 × 340 pixels and nine classes.The number of samples is 42,776 in total.The ground truth of the scene is shown in Figure 8a.

3.
The third HSI is also derived by AVIRIS sensor over Salinas Valley, California.This scene with a size of 512 × 217 pixels, and 204 spectral bands is used for classification.There are 16 classes in the ground truth image, which is shown in Figure 9a.
The overall accuracy (OA), average accuracy (AA), Kappa coefficient, and producer accuracy (PA) are used to assess the classification accuracy.

Influence of Different Parameters
Some parameters in our proposed method may affect the classification accuracy, such as β, K, and r.Therefore, the Indian Pines dataset is used to test the importance of these parameters to the classification.In this case, the training samples account for 15% of all samples, regardless of their class.When one of the parameters is measured, the other parameters are fixed.Five-fold cross validation is used to tune all parameters.The influences of β, K, and r on the classification accuracy are shown in Figures 10-12, respectively.The influences of β and r are less than 1%, while the influence of K is greater than 1%.Figures 10-12 illustrate that the classification accuracy is the most sensitive to the influence of parameter K. Different dissimilarity measures are adopted during graph-based Segment-Tree Filter construction in our experiments, including the Minkowski distance (from 1 to 6 and infinity) and SAM, as shown in Figure 13.All the parameters affect the classification accuracy by approximately 1% to 2%, and the SAM dissimilarity measure was the largest in our Segment-Tree Filtering approach.

Influence of Different Parameters
Some parameters in our proposed method may affect the classification accuracy, such as β , K , and r .Therefore, the Indian Pines dataset is used to test the importance of these parameters to the classification.In this case, the training samples account for 15% of all samples, regardless of their class.When one of the parameters is measured, the other parameters are fixed.Five-fold cross validation is used to tune all parameters.The influences of β , K , and r on the classification accuracy are shown in Figures 10-12, respectively.The influences of β and r are less than 1%, while the influence of K is greater than 1%.Figures 10-12 illustrate that the classification accuracy is the most sensitive to the influence of parameter K .Different dissimilarity measures are adopted during graph-based Segment-Tree Filter construction in our experiments, including the Minkowski distance (from 1 to 6 and infinity) and SAM, as shown in Figure 13.All the parameters affect the classification accuracy by approximately 1% to 2%, and the SAM dissimilarity measure was the largest in our Segment-Tree Filtering approach.

Influence of Different Parameters
Some parameters in our proposed method may affect the classification accuracy, such as β , K , and r .Therefore, the Indian Pines dataset is used to test the importance of these parameters to the classification.In this case, the training samples account for 15% of all samples, regardless of their class.When one of the parameters is measured, the other parameters are fixed.Five-fold cross validation is used to tune all parameters.The influences of β , K , and r on the classification accuracy are shown in Figures 10-12, respectively.The influences of β and r are less than 1%, while the influence of K is greater than 1%.Figures 10-12 illustrate that the classification accuracy is the most sensitive to the influence of parameter K .Different dissimilarity measures are adopted during graph-based Segment-Tree Filter construction in our experiments, including the Minkowski distance (from 1 to 6 and infinity) and SAM, as shown in Figure 13.All the parameters affect the classification accuracy by approximately 1% to 2%, and the SAM dissimilarity measure was the largest in our Segment-Tree Filtering approach.

Classification Accuracy Analysis
In this experiment, the training samples account for 15% of all the available samples, regardless of their class.The parameter settings are summarized as follows.In the initial classification step, the parameters were based on observed values, as discussed in Section 2.1.In the SELF transformation step, β = 0.6, K = 7, and r = 10.In the graph-based Segment-Tree construction step, SAM is used as the dissimilarity measure, and in the Segment-Tree Filtering step, γ is set as three times the standard deviation of e w .
All experiments would be repeated five times according to different sampling training-set.The average of the five classification accuracies is recorded.The results of our proposed method based on analyses of the Indian Pines, University of Pavia, and Salinas datasets are shown in Tables 1-3, respectively.The visual results of the initial classification of each HSI are shown in Figures 7b, 8b, and 9b, respectively.Although SVMs are powerful classifiers, the results of the initial classification based solely on spectral features contain substantial noise.However, our proposed method greatly improves the classification accuracy after Segment-Tree Filtering, as shown in Figures 7j, 8j, and 9j.The OA increases by 8.56% for the Indian Pines dataset, 4.64% for the Pavia University dataset, and 1.22% for the Salinas dataset.The largest PA increase is 32.60% for the Indian Pines dataset, followed by 22.71% for the Pavia University dataset and 3.41% for the Salinas dataset.

Classification Accuracy Analysis
In this experiment, the training samples account for 15% of all the available samples, regardless of their class.The parameter settings are summarized as follows.In the initial classification step, the parameters were based on observed values, as discussed in Section 2.1.In the SELF transformation step, β = 0.6, K = 7, and r = 10.In the graph-based Segment-Tree construction step, SAM is used as the dissimilarity measure, and in the Segment-Tree Filtering step, γ is set as three times the standard deviation of e w .
All experiments would be repeated five times according to different sampling training-set.The average of the five classification accuracies is recorded.The results of our proposed method based on analyses of the Indian Pines, University of Pavia, and Salinas datasets are shown in Tables 1-3, respectively.The visual results of the initial classification of each HSI are shown in Figures 7b, 8b, and 9b, respectively.Although SVMs are powerful classifiers, the results of the initial classification based solely on spectral features contain substantial noise.However, our proposed method greatly improves the classification accuracy after Segment-Tree Filtering, as shown in Figures 7j, 8j, and 9j.The OA increases by 8.56% for the Indian Pines dataset, 4.64% for the Pavia University dataset, and 1.22% for the Salinas dataset.The largest PA increase is 32.60% for the Indian Pines dataset, followed by 22.71% for the Pavia University dataset and 3.41% for the Salinas dataset.

Classification Accuracy Analysis
In this experiment, the training samples account for 15% of all the available samples, regardless of their class.The parameter settings are summarized as follows.In the initial classification step, the parameters were based on observed values, as discussed in Section 2.1.In the SELF transformation step, β = 0.6, K = 7, and r = 10.In the graph-based Segment-Tree construction step, SAM is used as the dissimilarity measure, and in the Segment-Tree Filtering step, γ is set as three times the standard deviation of w e .
All experiments would be repeated five times according to different sampling training-set.The average of the five classification accuracies is recorded.The results of our proposed method based on analyses of the Indian Pines, University of Pavia, and Salinas datasets are shown in Tables 1-3, respectively.The visual results of the initial classification of each HSI are shown in Figure 7b, Figure 8b, and Figure 9b, respectively.Although SVMs are powerful classifiers, the results of the initial classification based solely on spectral features contain substantial noise.However, our proposed method greatly improves the classification accuracy after Segment-Tree Filtering, as shown in Figure 7j, Figure 8j, and Figure 9j.The OA increases by 8.56% for the Indian Pines dataset, 4.64% for the Pavia University dataset, and 1.22% for the Salinas dataset.The largest PA increase is 32.60% for the Indian Pines dataset, followed by 22.71% for the Pavia University dataset and 3.41% for the Salinas dataset.

Influences of Different Techniques for Dimensionality Reduction
In the above section, we illustrated that incorporating spatial information can improve the classification accuracy.In the following section, we will evaluate how different techniques for dimensionality reduction can affect the classification accuracy.First, we assess the classification accuracy with/without dimensionality reduction.If no dimensionality reduction method is used, the Segment-Tree Filter is constructed using the original HSI. Figure 7c, Figure 8c, and Figure 9c show the results of classification using this scheme.Because redundant bands negatively affect segmentation, the classification accuracy using the original bands in the three HSI datasets is less than that produced using the proposed method.The fourth column in Tables 1-3 illustrates that reducing the dimensionality is necessary to increase the classification accuracy, and the average OA increased by approximately 0.53%.In addition, dimensionality can considerably improve the computational speed.
Next, we examine how different methods of dimensionality reduction can affect the classification accuracy.Figures 7-9 show the classification results for various methods, including the PCA, LDA, and LFDA methods; however, the classification accuracy produced by SELF is better than the accuracies of those methods.
Figure 7d-f, Figure 8d-f, and Figure 9d-f show the classification results for PCA, LDA, and LFDA, respectively.As expected, the OA and Kappa coefficient of SELF is the highest among these methods and the AA is the highest for the Indian Pines dataset and the second highest for the Pavia University and Salinas datasets.Although PCA sometimes performed better than SELF in some PAs, PCA with Segment-Tree Filtering is not a robust algorithm and can easily over-smooth spatial information, as illustrated in Figure 9d.The results illustrated in the red rectangle exhibit considerable classification error-based PCA, and the PA of Fallow_r_p decreases from 99.35% to 60.04% in the fourth row of Table 3.

Comparison to Other Methods of Spectral-Spatial Classification
As discussed in Section 1, spectral-spatial classification is a powerful method of combining contextual information.Therefore, we compared the proposed method to other common methods of spectral-spatial classification.We implemented the following spectral-spatial classification algorithms in our analysis.

1.
The first algorithm is based on EMPs [9].In [9], a neural network classifier was applied; however, an SVM is used instead of a back-propagation neural network to create a fair comparison.The EMPs are shown in Figure 7g, Figure 8g, and Figure 9g.

2.
The second approach uses MRFs [6] with Multi-class SVM, which is, to the best of our knowledge, the state of the art method for spatial-spectral image classification based on remote sensing.Multi-class SVM is used as the initial classifier, and the spatial optimization is performed using the max-flow/min-cut algorithms.In our experiment, α-expansion is adopted, and the regularization coefficient is fixed to 0.5.The results of the SVM with MRFs are shown in Figure 7h, Figure 8h, and Figure 9h.

3.
The third approach is based on a Guided Filter with PCA [18].The size and blur degree in Guided Filtering are tuned adaptively by cross-validation.The results of this classification are shown in Figure 7i, Figure 8i, and Figure 9i.
The proposed method produced results that were more accurate than those of the EMP-based method for the Indian Pines dataset; however, the results were similar for the other datasets.This result suggests that the proposed method is more suitable for different datasets compared to the EMP-based approach.
The classification accuracies of the proposed method and the SVM method with MRFs were nearly equal.This result indicates that the proposed method achieved an accuracy comparable to that of a state of the art method for spatial-spectral HSI classification.Furthermore, when the number of training samples within a class is very small, e.g., the PAs of the "Oats" class in Table 1, the classification accuracy of the proposed approach is perfect, while the SVM with MRFs method fails.This occurs because MRFs over smooth spatial features; thus, the regularization parameter requires complex tuning steps.Therefore, our proposed approach is more robust than the SVM with MRFs method.
The classification accuracy of the proposed method is slightly higher than that of the Guided Filter based on the OAs of the three datasets.
However, we computed the computational times associated with the three HSIs based on GF with PCA [18] and our method.We assume that N pixels, M bands, D classes, and local window size R are used for the reconstruction of the HSIs.The complexity of Segment-Tree Filtering is O (ND), and that of Guided Filtering is O (NDM).For the Indian Pines dataset, GF needed 2.6138 s to process 10 bands in the compressed dataset, while the proposed method required only 0.0536 s (all programs were executed using an Intel(R) Xeon(R) CPU E5-2620 with 24 GB of RAM.).Guided Filtering is slower because it computes the inverse covariance matrix for each sample.In extreme cases, Guided Filtering using an original HSI as the guide image can be time consuming and ineffective.

Effect of the Training Set on Classification
In this section, we assess how the number of training samples affects the classification accuracy of the proposed method.Thus, we varied the training sample size from 1% of all samples to 20% of all samples.We found that when the training sample size increases, the classification accuracy also increases.Therefore, we only illustrate how the number of training samples affects the classification accuracy of the Indian Pines dataset, as shown in Figure 14.The classification accuracy improves considerably until the number of pixels in the training set reaches 5% of the total pixel number.Then, the accuracy continues to improve but at a lower rate.
fails.This occurs because MRFs over smooth spatial features; thus, the regularization parameter requires complex tuning steps.Therefore, our proposed approach is more robust than the SVM with MRFs method.
The classification accuracy of the proposed method is slightly higher than that of the Guided Filter based on the OAs of the three datasets.
However, we computed the computational times associated with the three HSIs based on GF with PCA [18] and our method.We assume that N pixels, M bands, D classes, and local window size R are used for the reconstruction of the HSIs.The complexity of Segment-Tree Filtering is O (ND), and that of Guided Filtering is O (NDM).For the Indian Pines dataset, GF needed 2.6138 s to process 10 bands in the compressed dataset, while the proposed method required only 0.0536 s (all programs were executed using an Intel(R) Xeon(R) CPU E5-2620 with 24 GB of RAM.).Guided Filtering is slower because it computes the inverse covariance matrix for each sample.In extreme cases, Guided Filtering using an original HSI as the guide image can be time consuming and ineffective.

Effect of the Training Set on Classification
In this section, we assess how the number of training samples affects the classification accuracy of the proposed method.Thus, we varied the training sample size from 1% of all samples to 20% of all samples.We found that when the training sample size increases, the classification accuracy also increases.Therefore, we only illustrate how the number of training samples affects the classification accuracy of the Indian Pines dataset, as shown in Figure 14.The classification accuracy improves considerably until the number of pixels in the training set reaches 5% of the total pixel number.Then, the accuracy continues to improve but at a lower rate.We also evaluate the effects of the training sample size on the classification accuracies of the Guided Filter [18] and Multi-class SVM methods.As shown in Figure 14, the proposed method and Guided Filter approach improve the classification accuracy regardless of the size of the training set, and the proposed method yields better results.Furthermore, the advantage is larger when the size of the training set is small.

Conclusions
A novel and efficient approach based on a Segment-Tree Filter has been proposed for hyperspectral image classification.Our proposed approach is based on spatial-spectral filtering, which is a special EAF.This filter construction utilizes both spectral features using a SELF transformation and spatial information using a Segment-Tree algorithm.After an initial classification map is generated by Multi-class SVM, we can filter the map using the Segment-Tree Filter.One advantage of our proposed approach is that the classification accuracy has been We also evaluate the effects of the training sample size on the classification accuracies of the Guided Filter [18] and Multi-class SVM methods.As shown in Figure 14, the proposed method and Guided Filter approach improve the classification accuracy regardless of the size of the training set, and the proposed method yields better results.Furthermore, the advantage is larger when the size of the training set is small.

Conclusions
A novel and efficient approach based on a Segment-Tree Filter has been proposed for hyperspectral image classification.Our proposed approach is based on spatial-spectral filtering, which is a special EAF.This filter construction utilizes both spectral features using a SELF transformation and spatial information using a Segment-Tree algorithm.After an initial classification map is generated by Multi-class SVM, we can filter the map using the Segment-Tree Filter.One advantage of our proposed approach is that the classification accuracy has been improved dramatically.Experimental results show that the proposed method produced a high classification accuracy for hyperspectral image benchmark sets, including 93.34% for the Indian Pines dataset, 93.89% for the Pavia University dataset, and 92.78% for the Salinas dataset.Compared to other spatial-spectral methods, another advantage of the proposed method is that it provides a more robust classification approach for different datasets and training sets of different sizes.
In the future, two major aspects of our approach could be improved.First, dimensionality reduction using could be performed in SELF to reconstruct HSIs with nonlinear projections.Second, other classifiers, including fuzzy classifiers, could be applied to improve the classification accuracy.

2 .
Step 2: Use a Multi-class SVM method to obtain the initial classification map. 3. Step3: Perform Segment-Tree Filtering based on the Multi-class SVM, pixel-based initial classification map.By combining this initial classification map and the Segment-Tree Filter, we can incorporate spatial information and spectral features, adaptively.Finally, the HSI classification map can be derived from the result of Segment-Tree Filtering.

Figure 1 .
Figure 1.Workflow of Segment-Tree Filtering for HSI classification.

Figure 1 .
Figure 1.Workflow of Segment-Tree Filtering for HSI classification.

Figure 2 .
Figure 2. Indian Pines reconstructed using different methods of dimensional reduction: (a) LDA; (b) LFDA; (c) PCA; (d) SELF.The number of samples in the training set accounts for only 1% of all samples.

Figure 2 .
Figure 2. Indian Pines reconstructed using different methods of dimensional reduction: (a) LDA; (b) LFDA; (c) PCA; (d) SELF.The number of samples in the training set accounts for only 1% of all samples.

Figure 3 .
Figure 3. Indian Pines reconstructed using different methods of dimensional reduction: (a) LDA; (b) LFDA; (c) PCA; (d) SELF.The number of samples in the training set accounts for 20% of all samples.

Figure 3 .
Figure 3. Indian Pines reconstructed using different methods of dimensional reduction: (a) LDA; (b) LFDA; (c) PCA; (d) SELF.The number of samples in the training set accounts for 20% of all samples.

Figure 5 .
Figure 5. Tree structure of the Segment-Tree Filter for the Indian Pines dataset: (a) Image of the segment tree; (b) Close-up of the red rectangular region in (a).

1 V , 2 VFigure 6 .
Figure 6.Segment-Tree Filtering in two sequential passes: (a) Forward filtering from the leaves to root; (b) Backward filtering from the root to leaves.

Figure 6 .
Figure 6.Segment-Tree Filtering in two sequential passes: (a) Forward filtering from the leaves to root; (b) Backward filtering from the root to leaves.

Figure 10 .
Figure 10.Influence of parameter β on the classification accuracy.

Figure 11 .
Figure 11.Influence of parameter K on the classification accuracy.

Figure 10 .
Figure 10.Influence of parameter β on the classification accuracy.

Figure 10 .
Figure 10.Influence of parameter β on the classification accuracy.

Figure 11 .
Figure 11.Influence of parameter K on the classification accuracy.

Figure 11 .
Figure 11.Influence of parameter K on the classification accuracy.

Figure 12 .
Figure 12.Influence of parameter r on the classification accuracy.

Figure 13 .
Figure 13.Influences of different dissimilarity measures on the classification accuracy.

Figure 12 .
Figure 12.Influence of parameter r on the classification accuracy.

Figure 12 .
Figure 12.Influence of parameter r on the classification accuracy.

Figure 13 .
Figure 13.Influences of different dissimilarity measures on the classification accuracy.

Figure 13 .
Figure 13.Influences of different dissimilarity measures on the classification accuracy.

Figure 14 .
Figure 14.Classification accuracy based on the number of training samples for the Indian Pines dataset.

Figure 14 .
Figure 14.Classification accuracy based on the number of training samples for the Indian Pines dataset.

Table 1 .
Number of training and test samples from the Indian Pines dataset and the classification accuracies (in percentages) of different methods (the bolded item in each line means the best accuracy).

Table 2 .
Number of training and test samples from the Pavia University dataset and the classification accuracies (in percentages) of different methods (the bolded item in each line means the best accuracy).

Table 3 .
Number of training and test samples from the Salinas dataset and the classification accuracies (in percentages) of different methods (the bolded item in each line means the best accuracy).