SMOTE-Based Weighted Deep Rotation Forest for the Imbalanced Hyperspectral Data Classiﬁcation

: Conventional classiﬁcation algorithms have shown great success in balanced hyperspectral data classiﬁcation. However, the imbalanced class distribution is a fundamental problem of hyperspectral data, and it is regarded as one of the great challenges in classiﬁcation tasks. To solve this problem, a non-ANN based deep learning, namely SMOTE-Based Weighted Deep Rotation Forest (SMOTE-WDRoF) is proposed in this paper. First, the neighboring pixels of instances are introduced as the spatial information and balanced datasets are created by using the SMOTE algorithm. Second, these datasets are fed into the WDRoF model that consists of the rotation forest and the multi-level cascaded random forests. Speciﬁcally, the rotation forest is used to generate rotation feature vectors, which are input into the subsequent cascade forest. Furthermore, the output probability of each level and the original data are stacked as the dataset of the next level. And the sample weights are automatically adjusted according to the dynamic weight function constructed by the classiﬁcation results of each level. Compared with the traditional deep learning approaches, the proposed method consumes much less training time. The experimental results on four public hyperspectral data demonstrate that the proposed method can get better performance than support vector machine, random forest, rotation forest, SMOTE combined rotation forest, convolutional neural network, and rotation-based deep forest in multiclass imbalance learning. and 91.28% respectively on Indian Pines AVRIS , KSC , Salinas and University of Pavia ROSIS . These results demonstrate that when there are


Introduction
Hyperspectral imagery is simultaneously obtained by remote sensors in dozens or hundreds of narrow and contiguous wavelength bands [1][2][3][4][5]. Compared with traditional panchromatic and multispectral remote sensing images, hyperspectral imagery carry a wealth of spectral information, which enables more accurate discrimination of different objects. Consequently, in recent years, hyperspectral imagery has gained extensive attention for a variety of applications in Earth observations [1,[6][7][8][9][10], such as urban mapping, precision agriculture, and environmental monitoring [11][12][13][14][15]. The hyperspectral image classification is a significant research topic and it centers on assigning class labels to pixels. Class distribution, i.e., the proportion of samples belonging to each class, plays an extremely important part in classification research. Some traditional classification methods, such as maximum likelihood classification [16], support vector machine (SVM) [17] and artificial neural network [18], have acquired satisfactory performance on the balanced hyperspectral data.
However, since the hyperspectral image scene usually contains many objects of various sizes and sample labeling is difficult in the real world, the class imbalanced is a fundamental problem in hyperspectral image classification [19]. Generally, the majority classes are defined as the classes with a large number of instances while minority classes are the classes with a small number of samples [9]. Because the cost of misclassifying the minority class is usually much higher than the cost of majority classes [20]. With the skewed class distribution, the classifier is inclined to predict that the input instances belong to the majority class to keep high prediction accuracy [20][21][22][23][24]. Such a strategy is not effective for distinguishing the minority classes, even if they are usually foreground classes of interest. Therefore, one of the biggest challenges that machine learning and remote sensing face is how to classify imbalanced data effectively.
Generally, the aim of imbalance learning is to strive for acquiring a classifier that can provide high classification accuracy for the minority class without heavily compromising the accuracy of the majority classes [25][26][27]. Traditionally, the class-imbalance problem has been dealt with either in the data level [28][29][30] or in the algorithm level [31][32][33][34]. The data level focuses on modifying the sample distribution of classes in training sets to reduce the degree of class imbalance, which makes it fit for the classification prediction of the standard algorithm model. The most common method to deal with the imbalance problem in data level is resampling whose major advantages are that no modification to the classifier is needed and the balanced data can be reused in other applications or classification tasks [35,36]. Resampling can be further divided into two types: undersampling [37] and oversampling [38].

•
Undersampling methods: Undersampling alters the size of training sets by sampling a smaller majority class, which reduces the level of imbalance [37] and is easy to perform and have been shown to be useful in imbalanced problems [39][40][41][42]. The major superiority of undersampling is that all training instances are real [35]. Random undersampling (RUS) is a popular method that is designed to balance class distribution by eliminating the majority class instances randomly. However, the main disadvantage of undersampling is that it may neglect potentially useful information, which could be significant for the induction process. • Oversampling methods: Over-sampling algorithms increase the number of samples either by randomly choosing instances from the minority class and appending them to the original dataset or by synthesizing new examples [43], which can reduce the degree of imbalanced distribution. Random oversampling is simply copying the sample of the minority class, which easily leads to overfitting [44] and has little effect on improving the classification accuracy of the minority class. The synthetic minority oversampling technique (SMOTE) is a powerful algorithm that was proposed by Chawla [29] and has shown a great deal of success in various applications [45][46][47]. SMOTE will be described in detail in Section 2.1.
The main idea at the algorithm level is to modify the existing classification algorithm model appropriately in combination with the actual data distribution. The typical methods include active learning [48], cost-sensitive learning [49,50], and Kernel-based learning [51].
• Active learning methods: Traditional active learning methods are utilized to deal with problems with the unlabeled training dataset. In recent years, various algorithms on active learning from imbalanced data problems have been presented [48,52,53]. Active learning is a kind of learning strategy that selects samples from a random set of training data. It can choose more worthy instances and discard the instances which have less information, so as to enhance the classification performance. The large computation cost for large datasets is the primary disadvantage of these approaches [48]. • Cost-sensitive learning methods: Cost-sensitive learning solves class imbalance problems by using different cost matrices [50]. Currently, there are three commonly used cost-sensitive strategies. (1) The cost-sensitive sample weighting: converting the cost of misclassification into the sample weights on the original data set. (2) The costsensitive function is directly incorporated into the existing classification algorithm, which will ameliorate internal structure of the algorithm. (3) The cost-sensitive ensemble: cost-sensitive factors are integrated into the existing classification methods and combine with ensemble learning. Nevertheless, cost-sensitive learning methods require the knowledge of misclassification costs, which are hard to obtain in the datasets in the real world [54,55]. • Kernel-based learning methods: Kernel-based learning is focused on the theories of statistical learning and Vapnik-Chervonenkis (VC) dimensions [56]. The support vector machines (SVMs), which is a typical kernel-based learning method, can obtain the relatively robust classification accuracy for imbalanced data sets [51,57]. Many methods that combine sampling and ensemble techniques with SVM have been proposed [58,59] and effectively improve performance in the case of imbalanced class distribution. For instance, a novel ensemble method, called Bagging of Extrapolation Borderline-SMOTE SVM (BEBS) was proposed to incorporate the borderline information [60]. However, as this method is based on SVM, it is difficult to implement in a large dataset.
The classification approaches only using the spectral information cannot capture the crucial spatial variability perceived for data, which usually leads to lower performance, especially for the hyperspectral data [61]. Recently, approaches based on deep learning have been developed for the spectral-spatial hyperspectral datasets classification and exhibited their high effectiveness and performance [61,62]. Deep learning is an emerging method that has achieved excellent performance in hyperspectral image classification with sufficient well-labeled data sets [63,64]. Generally, a deep graph structure includes a cascade of layers which is consists of multiple linear and non-linear transformations. Compared with traditional machine learning approaches, deep learning methods can automatically extract informative features from the original hyperspectral dataset by a sequence of hierarchical layers [63]. In addition, deep learning has stronger robustness and higher accuracy than machine learning methods with shallower structures. However, most deep learning approaches, such as the convolutional neural network (CNN), have no algorithmic strategy for dealing with imbalanced data [63,65,66]. As the data set grows larger, the detrimental impact of class imbalance on deep learning methods augments. As mentioned before, the imbalance problem has been comprehensively researched in classical machine learning approaches, nevertheless, it has acquired less attention in the context of deep learning [66]. Besides, the training process of traditional deep learning methods generally consumes much time. The rotation-based deep forest [67], a novel deep learning method, is proposed for the classification of hyperspectral images and achieves satisfactory results with less training time. Nevertheless, this method does not solve the classification problem when the data distribution is imbalanced.
To improve the classification ability of non-ANN based deep-learning approaches for imbalanced hyperspectral datasets, a novel SMOTE-Based Weighted Deep Rotation Forest(SMOTE-WDRoF) algorithm is proposed in this paper. First, the neighboring pixels of instances are introduced as spatial information and multiple new synthetic balanced datasets are created by using the SMOTE algorithm. And then, these datasets are fed into the WDRoF model that consists of the rotation forest and the multi-level cascaded random forests. Specifically, the rotation forest is utilized to generate rotation feature vectors, which are input into the subsequent cascade forest. Moreover, the output probability of each level and the original data are stacked as the dataset of the next level. And the sample weights are automatically adjusted according to the dynamic weight function constructed by the classification results of each level. In summary, the proposed algorithm integrates the advantages of SMOTE, spatial information, and adaptive sample weights. The main contributions of this paper are as follows: (1) The proposed SMOTE-WDRoF based on deep ensemble learning combines deep rotating forest and SMOTE internally. It can obtain higher accuracy and faster training speed for the imbalanced hyperspectral data. (2) Besides, the introduction of the adaptive weight function can alleviate the defect of SMOTE, which is that SMOTE would generate additional noise when synthesizing new samples.
The remaining part of this paper is summarized as follows. Section 2 describes the related work. Section 3 presents the detail information about the proposed methodology. Then, Section 4 shows the results and discussion. Finally, the conclusions are given in Section 5.

Related Works
2.1. Synthetic Minority Over-Sampling Technique (SMOTE) SMOTE, presented by Chawla et al. [29], is the most popular oversampling approach which can solve the overfitting problem. Its main idea is to randomly synthesize new minority samples in the k nearest neighborhood of the selected one through interpolation. It should be noted that the artificial samples are created in the feature space instead of in the data space. The detail process of SMOTE is as follow: (1) Calculate k nearest neighbors with minority class samples in accordance with Euclidean distance for each minority instance x i . (2) A neighbor x j is randomly chosen from the k nearest neighbors of x i .
(3) Create a new instances x new between x j and x i : where δ is the random number between 0 and 1.

Random Forest (RF)
Inspired by the bagging algorithm [68], Breiman first proposed random forests [69] in 2001. Its main idea is random sample selection and random feature selection. In RF, all trees are independent of each other, so that the training and testing process are in parallel. Let us suppose a dataset D m with m samples (X, Y), where X ∈ R D . First of all, n instances are randomly selected from the original data set D m with replacement. These instances are utilized to build the current decision tree. Second, f features ( f < D) are first randomly chosen from the original D features. Based on the criterion of Gini impurity or mean squared error (MSE), Classification and Regression Trees (CART) are created. Finally, the classification result is obtained according to the majority voting criterion.

Rotation Forest (RoF)
Drawing upon the idea of RF, Rodriguez proposed RoF in 2006 [70]. Based on the idea of feature transformation, this algorithm focuses on improving the difference and accuracy of the base classifier. An RoF model of T size is constructed by implementing the following steps.
(1) Firstly, the feature space F is split into K feature sets which are disjoint and each subset includes N = F/K number of features. (2) Secondly, a new training set is obtained by using bootstrap algorithm to randomly selected the 75% of the training data. (3) Then, the coefficients a t,g (g ≤ G, t ≤ T) is obtained by employing the principal component analysis (PCA) on each subspace F t,g (g ≤ G, t ≤ T) and the coefficients of all subspaces are organized in a sparse "rotation" matrix R t (t ≤ T).
(4) The columns of R t is rearranged by matching the order of original features F to build the rotation matrix R t . Then, construct the new training set S t = [S t R t , Y t ], which is used to train an individual classifier.
(5) Repeat the aforementioned process on all diverse training sets and generate a series of individual classifiers. Finally, the results are obtained by the majority vote rule.

Rotation-Based Deep Forest (RBDF)
As a simple deep learning model, the rotation-based deep forest (RBDF) includes L level random forests and each level contains w RF models. This approach adopts the output probability of each level as the supplement feature of the next level [67]. The RBDF model contains three steps. First, spatial information is acquired by using a sliding window to extract the neighboring pixels of training samples. Second, the training samples and its neighboring pixels are fed into the RoF model. And each RoF will generate rotation matrices and construct the rotation feature vector. Third, feed the rotation feature vector into an RF model and obtain the classification probability. Then, all the classification probability vectors of level l are averaged to acquire the averaged probability vector which is stacked into the original dataset as the input data of the next level. Finally, the result is generated by finding out the maximum classification probability.

Method
In this section, the SMOTE-WDRoF method is proposed to deal with hyperspectral imbalanced data. Firstly, the local spatial structure of instances is introduced and balance datasets are generated by SMOTE, which allows more wealthy information to be obtained from hyperspectral images and alleviates class imbalance in the data level. Then, multiple levels of the forest are utilized to construct the WDRoF model which is the key ingredient of the whole algorithm. More specifically, the rotation forest is utilized to generate rotating feature vectors, which are input into the subsequent cascade forest. Moreover, the output probability of each level and the original data are stacked as the dataset of the next level. And the sample weights are automatically adjusted according to the dynamic weight function constructed by the classification results of each level. The details of the algorithm are as follows.

Spatial Information Extraction and Balanced Datasets Generation
The object in the image usually contains consistent spatial structure, i.e., neighbor pixels are likely to have the same label. Consequently, spatial-contextual information should be taken into account when classifying. The proposed algorithm combines spatial neighborhood information extraction strategy with SMOTE approach to select informative spatial neighbors and balance the dataset distribution to increase classification accuracy.
First, spatial information is extracted by using a sliding window. Let us assume X ∈ R M×N×D is the hyperspectral image, where M, N, D represent the height, width, and the number of spectral bands of the image respectively. The a m,n,d represents the value of the pixel that is located in line m, column n and band d. To obtain the spectral and spatial information of the hyperspectral dataset, the patch is constructed by extracting pixels in a window of size w 1 × w 2 × D and step size 1 with the central pixel. Suppose the spectral vector of a pixel is x ∈ R D , the patch A i can be defined as where . After scanning the whole hyperspectral image, it can be obtained K patches, where K = (M − w 1 )(N − w 2 ). Take a 3 × 3 × D sliding window, for example, each sample and its 8 neighboring pixels are extracted, which is shown in Figure 1a. Due to spatial similarity, each instance is generally the same as those of its spatial neighbors and the material fractions are close to each other. Therefore, they have the same label. The hyperspectral imbalance datasets {s 1 , s 2 , ..., s 9 } denoted as S are formed by extracting the pixels of corresponding positions in all patches and combining the sample labels Y.

(a) Spatial Information Extraction
The Majority Class The Minority Class The Majority Class The Minority Class The New Sample The Class Imbalanced Data Synthesizing the New Samples Second, according to the proportion of majority class instances to minority class instances, SMOTE oversamples each imbalance data s w (w ∈ 9). As is shown in Figure 1b, the circle and star stand for the majority class samples and minority class instances, respectively. Suppose that the new sample is created from sample x i with T = 5. SMOTE will randomly choose a sample from the minority class and its nearest five neighbors. Assuming sample x j is selected. The newly synthesized instance highlighted by the square shape is generated between x i and x j by Equation (1). And then the balanced datasets {s 1 , s 2 , ..., s 9 } denoted S can be obtained.

Weighted Deep Rotation Forest (WDRoF)
In this part, we propose the WDRoF algorithm, which is shown in detail in Figure 2. This algorithm adopts a multi-level random forest cascade to classify the hyperspectral dataset. Each level of the random forest produces the classification probabilities and misclassification information of the data which are used as guidance information for the next level. More specifically, the classification probabilities form a class vector that is concatenated with original data to constitute input of the next level. And the classifica-tion probability of each layer will be applied to all subsequent layers. Furthermore, the misclassification probability is employed to update the sample weight adaptively. When growing a new level, the performance of this level will be evaluated on the test set. If there is no obvious performance gain, the training procedure will finish. Consequently, the number of levels for RF is automatically identified. The implementation steps of WDRoF are as follows.
(1) The datasets {s 1 , s 2 , ..., s W } that have been generated by SMOTE are fed into the RoF models where W = w 1 × w 2 . The s w (w ∈ W) can be written as where K stands for the number of instances. In RoF, we apply PCA for features transformation which is a mathematical transformation method that transforms a set of variables into a set of unrelated ones. Its goal is to obtain the projection matrix Q = [q 1 , q 2 , .., q K ]: First of all, the self-correlation matrix for X is computed: where E[X] is the expected number of X and [·] T represents transposition. Second, eigen decomposition is applied on cov(X) to calculate its eigenvalues: λ 1 , λ 2 , · · · , λ K and corresponding eigenvectors: α 1 , α 2 , · · · , α K . Finally, the principal component coefficient can be calculated by the following: Construct the rotation matrix with Equation (2) and then generate the rotation feature vectors { f 1 , f 2 , . . . , f W } by the RoF. (2) The rotation feature vectors { f 1 , f 2 , . . . , f W } are fed into the first level of the random forest and the weight of the sample Weight w,l−1 (x k ) is set to 1. In level 1, each RF will generate the classification probability and classification error information of each instance for the dataset. All the classification probabilities vector P = {p 1 , p 2 , . . . , p W } of level 1 are averaged to obtain a robust estimation P: where h i represents the ith decision tree output and N tree stands for number of decision tree in RF. In addition, according to the classification error, the weights of the sample (x k , y k ) can be computed where v w (x k , c) is the number of votes of any other class with the wth RF model. The weight of a sample will be increased if it is misclassified by the previous level, which makes the sample play a more significant role in the next level and forces the classifier to focus attention on the misclassified samples.
(3) In the last level, after the average probability vector is calculated, the prediction label is acquired by finding the maximum probability.

RF
Calculate the the classification probability p w 20 end for 21 Obtain the average probability vector P with (7)  22 Concatenate P with the input feature vector to constitute input of the next level 23 end for 24 Output: The prediction label y * = argmax ∑ W w=1 I(v w (x k ) = c, c ∈ 1, 2, .., C)

Datasets
Four hyperspectral imagery (http://www.ehu.eus/ccwintco/index.php?title=Hyper spectral_Remote_Sensing_Scenes) with a high imbalance ratio (IR), including Indian Pines AVRIS, Kennedy Space Center (KSC), Salinas and University of Pavia scenes, are adopted to assess the effectiveness of the proposed WDRoF. For the sake of assessing the performance of the classification algorithms objectively, the training data and the test data should be independent. For Indian Pines AVRIS and KSC, 30% of samples of each class are randomly selected to construct the training set, and the remaining 70% of samples from each class constitute the test set. For Salinas and University of Pavia scenes, 5% of samples of each class are chosen to construct the training set, and the remaining samples constituted the test set. Furthermore, if the number of samples in a certain class is less than 100, half of the samples in that class are selected for training and the remaining half for testing. More detailed information related to the number of training and testing instances is listed in Table 1

Experiment Settings
In order to demonstrate the advantages of the proposed SMOTE-WDRoF, six popular methods, SVM, RF, RoF, SMOTE combined rotation forest (SMOTE-RoF), Convolutional neural network(CNN) [71], and RBDF is utilized in the comparative analysis. The settings of the six methods are introduced as follows. (1) 7) In the proposed SMOTE-WDRoF, each RF also contains 20 trees and 20 features are included in each sample subset for RoF. In addition, for Indian Pines AVRIS and Kennedy Space Center (KSC), the 7 × 7 neighborhood pixels are utilized to classify in RBDF and SMOTE-WDRoF. For Salinas and University of Pavia scenes, these two algorithms use 5 × 5 neighborhood pixels for classification. All the programs are implemented using Python language. The results are generated on a PC equipped with an Intel(R) Core(TM) i5-10200H CPU with 2.4 GHz.

Assessment Metric
Because the Overall Accuracy (OA) can reflect the overall classification performance of the classifier, it is often adopted to evaluate traditional machine learning classification algorithms. However, when there is a serious imbalance between the data classes, the classification model may be strongly biased towards the majority classes, which results in poor recognition of the minority classes. Therefore, OA is not the most appropriate index to evaluate the model since it might result in inaccurate conclusions [72]. Consequently, this paper adopts five main metrics as performance measures, including the precision, average accuracy, Recall, F-measure, and Kappa.
• Precision: Precision is employed to measure the classification accuracy of each class in the imbalanced data. The precision i measures the prediction rate when testing only samples of class i where m ii and m ji stand for the true prediction of the ith class and the false prediction of the ith class into ith class, respectively.
• Average Accuracy (AA): As a performance metric, AA provides the same weight to each of the classes in the data, independently of the number of instances it has. It can be defined as • Recall: True Positive Rate is defined as recall denoting the percentage of instances that are correctly classified. Recall is particularly suitable for evaluating classification algorithms that deal with multiple classes of imbalanced data [73]. It can be computed as the following equation: where m ij stand for the false prediction of the ith class into jth class. • F-measure: F-measure, an evaluation index obtained by integrating precision and Recall, has been widely used in the imbalance data classification [55,74,75]. In the process of classification, precision is expected to be as high as possible, and it is also expected to Recall as large as possible. In fact, however, the two metrics are negatively correlated in some cases. The introduction of F-measure synthesizes the two, and the higher F-measure is, the better the performance of the classifier is. F-measure can be calculated as the following equation: where Recall i can be calculated by n ii •

Kappa:
The metric that assesses the consistency of the predicted results is Kappa, which checks if the consistency is caused by chance. And the higher Kappa is, the better the performance of the classifier is Kappa can be defined as where p i andp i stand for the actual sample size of class i and the predicted sample size of class i, respectively.

Performance Comparative Analysis
In the experiments, the results acquired according to precision, AA, Recall, F-measure, and Kappa are exhibited in Tables 2-5 for SVM, RF, RoF, SMOTE-RoF, CNN, RBDF and the proposed SMOTE-WDRoF on the four imbalanced hyperspectral datasets. The best results in each hyperspectral dataset are highlighted in bold.

Experimental Results on Indian Pines AVRIS
The results of seven algorithms for Indian Pines AVRIS are listed in Table 2. The first 16 rows are the results of precision, AA, Recall, F-measure, and Kappa coefficients are shown in the last four rows. Among the seven methods, SMOTE-WDRoF achieves the best classification performance in most cases. Because it not only introduces spatial neighborhood pixels and synthesizes samples to increase the sample size and balance the dataset but also adjusts the sample weights adaptively. The proposed method obtains AA of 91.55%, Recall of 91.67%, F-measure of 91.51%, and Kappa of 88.64%, which are the best classification results among the seven methods. Compared with other methods, SMOTE-WDRoF enhances at least 2.61% in AA, 1.90% in Recall, 3.30% in F-measure, and 2.29% in Kappa. Moreover, the SMOTE-WDRoF algorithm obtains 10 of the highest class accuracies among 16 classes in all. Besides, for the class with the least number of training samples, namely Class 9, the accuracy of the proposed algorithm achieves 96.39%, which is at least 14.50% higher than other methods and 53.30% higher at most. The proposed algorithm is superior to other methods in both precisions of the minority classes and overall performance. Figure 3 shows the classification maps obtained by different classification methods for Indian Pines AVRIS. It shows that the proposed SMOTE-WDRoF acquires the best performance on Indian Pines AVRIS dataset.

Experimental Results on KSC
For KSC dataset, the statistical classification results are summarized in Table 3, and the classification results of different methods are shown in Figure 4. As can be observed in Table 3, SMOTE-WDRoF is superior to the other six comparison methods by generating balanced data sets and multi-level forests feature learning. For KSC data containing 13 classes, SMOTE-WDRoF obtained the highest classification accuracy of 10 classes, including multiple minority classes, such as Class 2, Class 4, and Class 7. Furthermore, among all the methods, SMOTE-WDRoF acquires the best statistical results in terms of the AA, Recall, F-measure, and Kappa, and the accuracy of the four metrics is improved by at least 3.63%, 5.20%, 4.54%, and 3.36% respectively. Although RF and RoF algorithms achieve 100.00% accuracy for Class 16, they are far less effective than SMOTE-WDRoF in terms of other performance, especially for the minority classes. In addition, although algo-rithm SMOTE-RoF has balanced the dataset by synthesizing new samples, its classification performance is worse than SMOTE-WDRoF. In addition, worth noting that SVM algorithm is the worst performer as it pays no attention to the recognition of the minority classes, and its classification accuracy for Class 7 is 0. Therefore, it is demonstrated that the proposed SMOTE-WDRoF has the best classification performance when processing the KSC dataset.

Experimental Results on Salinas
The classification results of seven different methods on the Salinas dataset are shown in Table 4. SMOTE-WDRoF is superior to the other six comparison methods and acquires AA of 95.92%, Recall of 96.05%, F-measure of 95.73%, and Kappa of 91.01%. In addition, SMOTE-WDRoF obtained the highest accuracy for half of the classes on Salinas dataset. For the two classes with the least number of training samples, namely Class 13 and Class 14, the precision of SMOTE-WDRoF reaches 97.92% and 98.81% respectively, which proves its ability to handle the minority classes better than the other comparison methods. Although SMOTE-RF has the highest accuracy in the two classes, the other classes performance of it is not superior. The corresponding classification maps on the data set are illustrated in Figure 5. The experimental results on this dataset testify that the SMOTE-WDRoF shows better classification performance than traditional methods when dealing with class imbalanced data.

Experimental Results on University of Pavia scenes
The results for the proposed SMOTE-WDRoF and six comparison methods on the University of Pavia ROSIS are exhibited in Table 5. Compared with the other methods, SMOTE-WDRoF improves the classification performance by creating new samples to construct the balanced dataset and automatically updating the sample weights based on the classification error information. The proposed SMOTE-WDRoF surpasses RBDF by 2.59%, 2.21%, and 2.32% in terms of Recall, F-measure, and Kappa. Although the AA of the RDBF algorithm is slightly higher than SMOTE-WDRoF, its F-measure that is the synthesis of recall and AA, is significantly lower than SMOTE-WDRoF. While dealing with the minority classes, such as Class 5 and Class 7, the SMOTE-WDRoF performs better than CNN, RDBF, and the other four traditional methods. For visual comparisons, Figure 6 shows the categorization maps of the classification results for all these methods. From Figure 6, we can observe that the proposed method exhibits the best result with least noise. It is obvious that SMOTE-WDRoF obtains the best effect on the University of Pavia ROSIS dataset.

Training Time of Different Deep Learning Methods
The training times of CNN and SMOTE-WDRoF are shown in Table 6. For the CNN, the model needs to continuously adjust the parameters through backpropagation to achieve good performance. Consequently, a large number of parameters need to be calculated in the time-consuming training process. Different from traditional deep learning methods which require backpropagation, the SMOTE-WDRoF needs much less training time.  In order to study the influence of level on SMOTE-WDRoF, we present in Figure 7 the evolution of the AA and Recall on Indian Pines AVRIS, KSC, Salinas and University of Pavia ROSIS. Similar to the traditional depth model, the deep forest structure of SMOTE-WDRoF is of great significance to improve the classification performance. When the output of each level is used as the feature and stacked with the original feature as the input of the next level, the sample weights are adjusted accordingly. Consequently, the classification accuracy is enhanced with the growth of the level. As can be seen from Figure 7a, AA of four hyperspectral datasets is increased significantly when the level increased from 1 to 3. When the level is 4, the growth rate of AA slowed down gradually. And when the level number exceeds 5, the AA of the four datasets reaches a stable value. For Indian Pines AVRIS, KSC, Salinas and University of Pavia ROSIS, the stable values are 91.55%, 91.87%, 95.44% and 88.37% respectively. The evolution of the Recall on four hyperspectral datasets is shown in Figure 7b. It can be observed that at first recall is increased greatly. With the increase in levels, the recall turns to a relatively stable value. When the level is set as 5, the stable values are 91.67%, 92.40%, 96.05% and 91.28% respectively on Indian Pines AVRIS, KSC, Salinas and University of Pavia ROSIS. These results demonstrate that when there are too many levels in the proposed model, the output of the last several levels cannot afford helpful information for classification anymore. Therefore, statistically better performance can be achieved when L is equal to 5. And in other experiments, the level is set as 5.

Influence of the Window Size
Due to the spatial homogeneity of hyperspectral images, neighboring samples are likely to belong to the same class. Consequently, neighboring pixels are introduced to be the local spatial information by a sliding window in SMOTE-WDRoF. To study the influence of window size on classification accuracies, we vary this parameter from 1 × 1 × D to 7 × 7 × D for four hyperspectral datasets to introduce different numbers of spatial neighbor pixels. D represents the number of bands for hyperspectral data. For Indian Pines AVRIS, KSC, Salinas and University of Pavia ROSIS, D are 220, 176, 224 and 103 respectively. The results with different window sizes are shown in Figure 8. While the window size increases, the classification accuracy also presents an upward trend. More specifically, for Indian Pines AVRIS, AA, Recall, F-measure and Kappa increase from 87.69%, 71.27%, 74.81% and 85.05% to 91.71%, 91.12%, 91.29% and 88.41% respectively when the size of window is changed from 1 × 1 × 1 × 220 to 7 × 7 × 220. And the highest precision of Indian Pines AVRIS is obtained at 7 × 7 × 220. For KSC, the three indexes, Recall, AA, and F-measure, reach the highest precision at 7 × 7 × 176. Besides, the value of Kappa first rises and then falls, and it achieves the highest value at 5 × 5 × 176. For Salinas, the high precision is obtained at 5 × 5 × 224, and then the precision almost no longer increases with the expansion of the window size. In addition, SMOTE-WDRoF with a window size of 5 × 5 × 103 delivers the best performance for University of Pavia ROSIS. This phenomenon is not surprising. More useful spatial information can be introduced by a relatively large window, which is beneficial to improve classification performance. However, if the window size is too large, samples that do not belong to the same class as the central pixel will be extracted, which will result in the decreased accuracy.

Conclusions
In this paper, the SMOTE-Based Weighted Deep Rotation Forest(SMOTE-WDRoF) algorithm is proposed for the imbalanced hyperspectral data classification. First of all, the local spatial structure of samples is extract to enrich the data information, and the balanced datasets are built by SMOTE. Second, RoF and multi-layer cascade RF form the WDRoF model which uses the output probability of each layer as a supplement feature of the next layer and updates the sample weights adaptively to improve classification performance. The proposed method is validated on four public hyperspectral image datasets. Compared with the traditional deep learning models, SMOTE-WDRoF consumes much less training time. Experimental results show that the proposed SMOTE-WDRoF is effective for dealing with multi-class imbalanced data and significantly outperforms SVM, RF, RoF, SMOTE-RoF, CNN, and RBDF and Besides, the parameter analysis has also been implemented and the results have demonstrated the advantages of our algorithm in terms of accuracy and robustness.