Hyperspectral Image Denoising and Classiﬁcation Using Multi-Scale Weighted EMAPs and Extreme Learning Machine

: Recently, extended multi-attribute proﬁles (EMAPs) have attracted much attention due to its good performance while applied to remote sensing images feature extraction and classiﬁcation. Since the EMAPs connect multiple attribute features without considering the pixel-based Hyperspectral Image (HSI) classiﬁcation, homogeneous regions may become unsmooth due to the noise to be introduced. To tackle this problem, we propose the weighted EMAPs (WEMAPs) to reduce the noise and smoothen the homogeneous regions based on weighted mean ﬁlter (WMF). Then, we construct multiscale WEMAPs to product multiscale feature in order to extract di ﬀ erent spatial structures of the HSI and produce better classiﬁcation results. Finally, a new joint decision fusion and feature fusion (JDFFF) framework is proposed based on the decision fusion (DF) and the multiscale WEMAPs (MWEMAPs) based on extreme learning machine (ELM) classiﬁer. That is, the classiﬁcation results from various scales are combined into a ﬁnal one with ELM to perform the HSI classiﬁcation. Experiment results show that the proposed algorithm signiﬁcantly outperforms many state-of-the-art HSI classiﬁcation algorithms.


Introduction
Hyperspectral images (HSIs) have been successfully applied in a wide range of applications, such as land and ocean mapping, geological analysis, brain cancer detection, mining, and precision agriculture [1][2][3]. As HSI contains rich spectral and spatial information, each pixel can be classified in the scene [4]. However, the supervised classification of the HSIs still remains a challenging task due to the unbalanced ratio between the limited training samples and the large number of spectral bands, i.e., the Hughes phenomenon [5]. To this end, many advanced techniques have been proposed for feature extraction and dimensionality reduction [6,7], such as the principal component analysis (PCA) and its variations [8,9], the attribute profiles (APs) [10,11], the linear discriminate analysis (LDA) [12], the nonparametric weighted feature extraction (NWFE) [13], and the singular spectrum analysis (SSA) [14], etc. In addition, a number of the state-of-the-art technologies have been successfully applied for HSI data classification, which include the support vector machine (SVM) [15], the extreme learning machine (ELM) [16], the sparse multinomial logistic regression (SMLR) and its be discussed in Section 3, including an overall flowchart in Figure 1. The main contributions of this work can be highlighted as follows: (1) The WEMAPs are proposed for integrating different EMAPs attributes to reduce the noise and smoothen the homogeneous regions. (2) The JDFFF framework is proposed to capture different spatial structures and better model the discriminant information of the HSIs. (3) The classification results of different scales are combined into a final one using the MV with the proposed JDFFF for the HSI classification.
The rest of this paper is organized as follows: Some related algorithms are reviewed in Section 2 and the proposed framework of the JDFFF is presented in Section 3; the extended experiment results and the analysis are presented in Section 4; Section 5 concludes our works with some remarks.

Normalization
As an important preprocessing step for HSI, there are a number of normalization approaches [26]. We chose the Max method for normalization for the purpose of simplicity and consistency since it is a widely used normalization method [26]. Given a HSI data X = [x * , x * , … , x * ] ∈ R × , where N denotes the number of pixels in HSI and d is the number of bands of a HSI dataset. The Max normalization method can be expressed as: x = x * /max (X) (1) where max (X) and x * are the largest value of HSI and any pixel value of HSI, respectively.

Attribute Profiles (APs)
The mathematical morphology is a powerful framework for analyzing the spatial information of the HSIs [22,32,33]. Let γ and ϕ be the morphological opening and closing, and their operations on a grey level image f can be defined as [10]: where δ and ε are the dilation and erosion with a given structure element of size i (i = 1, …, n), R and R are the geodesic reconstruction by the dilation and the erosion, respectively [32]. Then, the MPs can be defined as follows:

Normalization
As an important preprocessing step for HSI, there are a number of normalization approaches [26]. We chose the Max method for normalization for the purpose of simplicity and consistency since it is a widely used normalization method [26]. Given a HSI data X = [x 1 * , x 2 * , . . . , x N * ] ∈ R d×N , where N denotes the number of pixels in HSI and d is the number of bands of a HSI dataset. The Max normalization method can be expressed as: where max(X) and x ij * are the largest value of HSI and any pixel value of HSI, respectively.

Attribute Profiles (APs)
The mathematical morphology is a powerful framework for analyzing the spatial information of the HSIs [22,32,33]. Let γ R and φ R be the morphological opening and closing, and their operations on a grey level image f can be defined as [10]: where δ i and ε i are the dilation and erosion with a given structure element of size i (i = 1, . . . , n), R δ f and R ε f are the geodesic reconstruction by the dilation and the erosion, respectively [32]. Then, the MPs can be defined as follows: Electronics 2020, 9, 2137 4 of 17 Analogous to the MP, for a given sequence of thresholds λ 1 , . . . , λ n , the APs can be represented as a concatenation of a series of the attribute thinning and the attribute thickening operations [10,24] as follows: whereγ R andφ R denote the thinning and thickening transformations, respectively.

Extreme Learning Machine (ELM)
The ELM [34,35] was originally proposed for the single hidden layer feedforward neural network [36] with one linear output layer and one hidden layer [26]. Let X = [x 1 , x 2 , . . . , x n ] ∈ R d×n and Y = y 1 , y 2 , . . . , y n ∈ R m×n be n training samples and their corresponding labels with m classes, the output function of the ELM with L hidden neurons can be represented by: where β j is the output weight between the hidden layer and the output layer, h, is the activation function. The above n equations can be rewritten in a matrix form as: where β = [β 1 , . . . , β L ] T ∈ R L×m and The solution of Equation (7) of the original ELM can be expressed by: where H † is the Moore Penrose generalized inverse of the matrix H [37]. That is, H † = H T (HH T ) −1 or H † = (H T H) −1 H T . A positive value I C is normally added to every diagonal element of HH T or H T H in order to improve the stability and the generalization of the inverse operator [26], where I is an identity mattrix. Finally, Equation (6) can be rewritten as: It should be noted that the sizes of I in Equations (10) and (11) are different, which depend on the dimensions of H T H and HH T , respectively. Similar to the SVM [38,39], the output function of the ELM using the RBF kernel can be represented as follows: where Ω ELM = HH T , and Two well-known constrained optimization models of the improved ELM are widely used. One is to define β in Equation (10) or Equation (11) without a kernel, which introduce a regularization term to ELM for overfitting problem called the generalized ELM (GELM) defined as follows: Another one is to introduce new kernels in Equation (12), which introduce the kernel function to GELM for kernel data representation called the kernel ELM (KELM) as defined below:

Weighted Mean Filters (WMFs)
For n training samples X = [x 1 , x 2 , . . . , x n ] ∈ R d×n and the corresponding class labels Y = y 1 , y 2 , . . . , y n ∈ R m×n , let p i , q i be the spatial coordinate of the i-th training sample, x i . The local pixel neighborhood centered at x i can be expressed by [3]: where c = w 0 −1 2 , and w 0 is the odd width (scale) of the neighborhood window. Let s = w 0 2 − 1 denote the total number of the neighboring pixels of x i , and also denote the pixels in its spatial neighborhood N(x i ) by x i , x i1 , x i2 , . . . , x is . The spatial WMF of a labeled pixel x i can be represented by: where the weight v k = exp (−γ x i − x ik 2 ) measures the spectral distance between the center pixel and all of its neighboring pixels. According to [3], the degree of filtering γ is set to be 0.2 in this paper.

The Proposed Weighted Extended Multi-Attribute Profiles (WEMAPs)
The MPs [21] and the APs [10,11] have been successfully applied for combining both the spectral and spatial information for HSI classification [24]. The APs are acquired by applying a sequence of the attribute filters to a gray level image [10]. Since the original APs were employed to process only one spectral band, the dimensionalities of the APs will be very large if the full spectral bands of the HSIs are utilized to extract all the APs [22]. To address this problem, the first several principal components of the HSIs were used for extracting the APs to reduce the dimensionalities [11,22].
The EAPs with the first p principal components of the HSIs can be expressed as follows: The EMAPs are composed of v EAPs with different attributes {1,2, . . . , v} as: Although the EMAPs lead to an increase in the feature dimensionalities, it also increases the capability of extracting the spatial information compared to a single EAP [10]. However, there exists a key drawback in the EMAPs, i.e., it combines multiple attribute features without considering the nature of pixel-based HSI classification. This will introduce noise in the HSIs and lead to non-smoothness of Electronics 2020, 9, 2137 6 of 17 homogeneous regions. To tackle these problems, the WEMAPs are proposed in this paper for the HSI classification, which will be introduced in detail as follows.
Let X EMAPs = The proposed WEMAPs can be represented by: (18) and (20), it can be seen that the proposed WEMAPs consider the ensemble situation. The multiple attribute features used here can not only reduce the noise but also smoothen the homogeneous regions of the HSIs.

Feature Fusion (FF)
Since different features such as WMF and WEMAPs may represent certain characteristics and reflect various properties [26], their combination becomes a natural choice [40]. Then, it is straightforward to stack different features into a composite one. In the proposed FF, the WMFs and the proposed WEMAPs are combined as follows:

Joint Decision and Feature Fusion (JDFFF)
Different from the FFs, the objective of the DFs [27] is to reach a joint decision based on the classification results by multiple classification results [26]. Based on both the FF and the DF, a JDFFF framework is proposed for performing the HSI classification. The main steps of the proposed JDFFF framework can be summarized as follows. First, the discriminant features extracted by the WMFs and the proposed WEMAPs are combined to form the multiscale features. Second, the classification results of corresponding scale using the ELMs (GELM or the KELM) are combined into the final one based on using the MV within the proposed JDFFF.
For multiscale features, the widths (scales) of the neighborhood windows w 0 can be set to be 3, 5, 7, etc. Figure 1 illustrates the flowchart of the proposed JDFFF framework.

Experimental Results and Discussion
In this section, the proposed JDFFF framework will be evaluated on two well-known hyperspectral datasets which are detailed below.
Indian Pines dataset: This dataset [18] was acquired by the National Aeronautics and Space Visible/Infrared Imaging Spectrometer (AVIRIS) sensor in June 1992. It has 145 × 145 pixels with 220 bands between 400 nm and 2450 nm covering visible and infrared spectrum regions. The spatial resolution of this dataset is 20 m. After removing 20 water absorption bands, there are 200 bands in this image. There are in total 10,366 pixels labeled in 16 classes for classification.
Pavia University dataset: This dataset [18] was collected by the Reflective Optics System Imaging Spectrometer (ROSIS) sensor in 2002. The image contains 610 × 340 pixels with 103 valid bands after removing 12 noisy and water absorption bands. The dataset has 42,776 sample pixels labeled in 9 classes. Kennedy Space Center (KSC): This dataset [41] was obtained by NASA AVIRIS (Airborne Visible/Infrared Imaging Spectrometer) instrument at Kennedy Space Center in Florida. AVIRIS collected data in 224 bands with a width of 10 nm, and a center wavelength between 400 nm and 2500 nm. It has 512 × 614 pixels and 224 bands, and the spatial resolution of KSC data acquired from a height of approximately 20 km is 18 m. After removing the water absorption and low SNR frequency bands, 176 frequency bands were used for analysis. In order to classify, 13 categories are defined for the site, representing the various types of land cover that occur in this environment.
Relevant results are summarized and discussed as follows. In addition, all the abbreviations in this article have been listed in Table 1.

Evaluation Criteria and Parameter Settings
The parameter settings and evaluation criteria used in our experiments are discussed as follows. For both the EMAPs and the proposed WEAMPs, although many attributes can be utilized for extracting the discriminant features, according to [42], here only four attributes including the area, the moment of inertia, the standard deviation and the length of the diagonal of the bounding box are considered. The thresholds of the areas, moments of inertias, the standard deviations, and the lengths of the diagonals of the bounding boxes are selected respectively in the sets of {100, 200, 500,  [43]. For all the kernel based algorithms, the radial basis function (RBF) is used, where the kernel parameter σ and the penalty parameter C are fine-tuned in the training stage. The parameters C in the GELM with the composite kernels (CKs) are also fine-tuned. The parameter σ varies in the range 2 −4 , 2 −3 , . . . , 2 3 , 2 4 and the penalty parameter C varies in the range { 2 1 , 2 2 , . . . , 2 19 , 2 20 . All the above parameters are automatically optimized by using three fold cross validations. Other parameters in the KSVM, KSVM-CK, GELM, GELM-CK, KELM and KELM-CK are referred to [43]. The LIBSVM toolbox in the MATLAB is used for implementing the SVM algorithms [44]. The parameters for the SMLR-SPATV (the SMLR with the weighed Markov random fields) are referred to [45]. All the experiments are conducted in MATLAB R2015a on a computer with the 2.9 GHz CPU and the 32 GB RAM. All the classification results are randomly run ten times and an average of the ten groups of results is computed for performance assessment.

Investigation on the Effect of Different Strategies
In this subsection, the performance of different methods and the denoising performance of the proposed model by introducing various levels of noise in the initial raw dataset are investigated. In each class, 15 samples are randomly selected for training and the remaining ones are used for testing. The scales of the WMFs, WEMAPs and FF are set to be 3, i.e., the parameter c in Equation (19) is set to be 1. For the proposed JDFFF, the scales are set in the range from 3 to 9. We add Gaussian noise to initial raw dataset and set σ = 0.02, 0.04, 0.06 (i.i.d: zero mean with σ 2 variance). Table 2 summarizes the classification results obtained from different strategies, from which some observations can be made as follows: (1) Adding WMFs can help to achieve better results than corresponding approaches without WMFs, and the proposed WEMAPs have produced better classification results than both the EMAPs and the WMFs. By combining both the WMFs and our WEMPAS, the proposed FF has achieved better results than only using the proposed WEMAPs. Additionally, the proposed JDFFF further improves the classification results. (2) We found that only using raw data or WMFs, the noise will greatly affect their performance, and the greater the noise, the greater the impact. The performance of EMAPs, WEMAPs, FF and JDFFF are almost immune to the amount of noise in the initial raw dataset.

Investigation on the Suitability of Different Datasets
In this subsection, we study the applicability of the proposed method on different datasets. In each class, 15 samples are randomly selected for training and the remaining samples are used for testing. The scales of the WEMAPs and FF are set to be 3, i.e., the parameter c in Equation (19) is set to be 1. For the proposed JDFFF, the scales are set in the range from 3 to 9. Table 3 summarizes the classification results obtained from different datasets and strategies, from which some observations can be made as follows: Our proposed method can be applied to datasets of various sizes and spatial resolutions, and it takes more time when testing larger dataset.

Investigation on the Effect of Scales
In this subsection, the impacts of the scales of the WMFs, proposed WEMAPs and FF on the GELM and KELM are investigated. Again, 15 training samples in each class are randomly selected for training and the remaining samples are used for testing. Figure 2 shows the classification results, where the following observations can be made: The proposed WEMAPs have better classification accuracies than WMFs and FF develop the classification accuracies of WEMAPs.

Investigation on the Effect of Scales
In this subsection, the impacts of the scales of the WMFs, proposed WEMAPs and FF on the GELM and KELM are investigated. Again, 15 training samples in each class are randomly selected for training and the remaining samples are used for testing. Figure 2 shows the classification results, where the following observations can be made: The proposed WEMAPs have better classification accuracies than WMFs and FF develop the classification accuracies of WEMAPs.

Classification Results and Comparisons on the Two Datasets
In this subsection, the performance of the proposed JDFFF is further evaluated by comparing with some well-known state-of-the-art approaches, where different numbers of samples are randomly chosen from each class for training, here Q = 5, 10, 15, 20, 25, 30. Note that the number of training samples in each class is capped to 50% if Q becomes more than half of the samples in that class. We also apply the proposed FF and JDFF to KSVM in order to show the good performance of proposed frameworks. For the JDFFF and FF, the scale is set to range from 3 to 9 and 3, respectively.

Classification Results and Comparisons on the Two Datasets
In this subsection, the performance of the proposed JDFFF is further evaluated by comparing with some well-known state-of-the-art approaches, where different numbers of samples are randomly chosen from each class for training, here Q = 5, 10,15,20,25,30. Note that the number of training samples in each class is capped to 50% if Q becomes more than half of the samples in that class. We also apply the proposed FF and JDFF to KSVM in order to show the good performance of proposed frameworks. For the JDFFF and FF, the scale is set to range from 3 to 9 and 3, respectively. For performance evaluation, four metrics are used including overall accuracies (OA), average accuracies (AA), kappa statistic (k) and standard deviation (S) [43]. Tables 4 and 5 show the classification results for the Indian Pines dataset and the Pavia University dataset, respectively. Without loss of the generality, the average computational times of all the methods with 30 training samples are also listed in the tables for comparison. From these two tables, we can have the following observations: (1) The proposed FF achieved better classification results compared with the CK and the MRF methods, which was further improved by the proposed JDFFF. Furthermore, we can also see that proposed FF-GELM and FF-KELM have produced better classification accuracies than FF-SVM. The same situations are happened to JDFFF methods, i.e., JDFFF-GELM, JDFFF-KELM and JDFFF-SVM.
(2) The proposed FF based on the ELM algorithms (the GELM and the KELM) have less computational times than other spectral spatial based SVM algorithms. This is because the proposed FF based on the ELM algorithms inherits the fast speed of the ELM. Additionally, the proposed JDFFF with the ELM algorithms (the GELM and the KELM) have the advantages of less computational time compared with other spectral spatial based SVM methods.
In addition, Figures 3 and 4 show the results for the Indian Pines dataset and the Pavia University dataset with about 30 training samples per class. Electronics 2020, 9, x FOR PEER REVIEW 11 of 17

Conclusions
In this paper, we propose a novel framework for HSI classification. First, an improved EMAPs, called the WEMAPs, is found to be able to better model the discriminant information and reduce the noise in the HSIs. Second, the features extracted by the WEMAPs and the WMFs are combined to obtain the better FF. Third, the proposed multiscale FF, namely the JDFFF, has further improved this HSI classification results. Finally, the GELM and KELM are applied to the proposed JDFFF for performing this HSI classification. Experimental results show a good performance of proposed framework.
In the future work, the dimension reduction [46] will be applied to further reduce the computational time of the proposed JDFFF. Additionally, the hyperspectral unmixing [47] will be explored to further improve the classification results.