A Deep Random Forest Model with Symmetry Analysis for Hyperspectral Image Data Classification Based on Feature Importance

Lian, Jie; Feng, Wei; Wang, Qing; Dong, Yuhang; Dauphin, Gabriel; Bai, Jian

doi:10.3390/sym17122172

Open AccessArticle

A Deep Random Forest Model with Symmetry Analysis for Hyperspectral Image Data Classification Based on Feature Importance

by

Jie Lian

^1,2,

Wei Feng

^3,4,5,6,*,

Qing Wang

⁷,

Yuhang Dong

⁶,

Gabriel Dauphin

⁸

and

Jian Bai

²

¹

The College of Computer, National University of Defense Technology, Changsha 410073, China

²

Beijing Institution of Remote Sensing Equipment, Beijing 100039, China

³

School of Information Mechanics and Sensing Engineering, Xidian University, Xi’an 710071, China

⁴

Xi’an Key Laboratory of Advanced Remote Sensing, Xi’an 710071, China

⁵

Shaanxi Innovation Center for Multi-Source Fusion Detection and Recognition, Xi’an 710071, China

⁶

Hangzhou Institute of Technology, Xidian University, Hangzhou 311200, China

⁷

School of Electronic Engineering, Xidian University, Xi’an 710071, China

⁸

Laboratory of Information Processing and Transmission (L2TI), Institut Galilée, University Paris XIII, Sorbonne Paris Cité, 93430 Villetaneuse, France

^*

Author to whom correspondence should be addressed.

Symmetry 2025, 17(12), 2172; https://doi.org/10.3390/sym17122172

Submission received: 7 November 2025 / Revised: 30 November 2025 / Accepted: 11 December 2025 / Published: 17 December 2025

(This article belongs to the Section Computer)

Download

Browse Figures

Versions Notes

Abstract

Hyperspectral imagery (HSI), as a core data carrier in remote sensing, plays a crucial role in many fields. Still, it also faces numerous challenges, including the curse of dimensionality, noise interference, and small samples. These problems severely affect the generalization ability and classification accuracy of traditional machine learning and deep learning algorithms. Existing solutions suffer from bottlenecks such as unknown cost matrices and excessive computational overhead. And ensemble learning fails to fully exploit the deep semantic features and feature importance relationships of high-dimensional data. To address these issues, this paper proposes a dual ensemble classification framework (DRF-FI) based on feature importance analysis and a deep random forest. This method integrates feature selection and two-layer ensemble learning. First, it identifies discriminative spectral bands through feature importance quantification. Then, it constructs a balanced training subset through random oversampling. Finally, it integrates four different ensemble strategies. Experimental results on three benchmark hyperspectral datasets demonstrate that DRF-FI exhibits outstanding performance across multiple datasets, particularly excelling in handling highly imbalanced data. Compared to traditional random forests, the proposed method achieves stable improvements in both overall accuracy (OA) and average accuracy (AA). On specific datasets, OA and AA were enhanced by up to 0.84% and 1.24%, respectively. This provides an effective solution to the class imbalance problem in hyperspectral images.

Keywords:

hyperspectral image; feature importance; deep random forest; ensemble learning

1. Introduction

Hyperspectral images (HSI) are widely regarded as essential in the realm of remote sensing [1]. These high-dimensional images provide abundant spatial–spectral information and have been widely applied in various domains such as land use surveys, environmental monitoring, and mineral exploration. Moreover, there are still some important issues that cannot be ignored in the hyperspectral image classification task. First of all, although hyperspectral data has rich features that can be exploited, the “dimensional disaster” problem caused by high data dimensions cannot be ignored [2]. With the expansion of dimensions, the volume of feature space grows exponentially, and the training efficiency of the model will plummet, the generalization ability will become worse, the computational cost will increase significantly, and even overfitting will occur. This makes the processing of hyperspectral data more challenging than other data classification tasks. In addition, hyperspectral images are inherently vulnerable to multiple noise sources, such as random noise, striping artifacts, and dead pixels. This allows hyperspectral imaging to capture spectral and spatial information, but the impact of noise will further increase the difficulty of processing. In hyperspectral data processing, it is of great significance to deeply explore symmetry. In high-dimensional small-sample hyperspectral data, the symmetry of spectral features is manifested as the interrelationship between different bands. Some bands may show symmetrical response patterns in different categories of ground objects; that is, the spectral features of certain ground objects under specific band combinations show similarity. This similarity reflects a certain symmetry of the intrinsic attributes of ground objects and plays a key role in identifying these ground objects.

HSI is well-suited for deep learning and machine learning, as these models learn discriminative features from the data automatically [3,4]. Machine learning has become a fundamental methodology in HSI research, facilitating the automated extraction of discriminative features and hidden data structures [5]. From the methodological standpoint, classification techniques are broadly categorized as either supervised or unsupervised [6]. The absence of label dependency makes unsupervised approaches—such as K-means clustering and graph-based partitioning—particularly effective for exploratory analysis of data with poorly defined or unknown structural patterns [7]. But these models tend to be unstable, yield low classification accuracy, and often fail to match actual categories. Moreover, in the absence of prior knowledge about the relationship between classes, class partitioning becomes indeterminate [8].

Classical supervised classification is typically realized through well-established models such as Support Vector Machines (SVM), Random Forests (RF), and k-Nearest Neighbors (k-NN), alongside probabilistic approaches like Naive Bayes and Logistic Regression. The key advantage of these is their capacity to achieve high classification accuracy and effectively learn discriminative features when provided with sufficient labeled training data. However, they rely heavily on large volumes of high-quality labeled samples, are sensitive to small samples, and can be negatively affected by high dimensionality. Despite their superior performance, they require labeled training data [9,10,11]. For instance, algorithms like RF and other ensemble methods are often biased toward the majority classes in imbalanced datasets. While effective for high-dimensional and small-sample problems, the SVM algorithm is highly sensitive to parameter tuning and kernel selection. Intuitive and simple distance-based methods like the k-NN algorithm perform poorly on high-dimensional data and are sensitive to noise. Furthermore, Naive Bayes, despite its computational efficiency, operates on a strong assumption of feature independence, a condition rarely met by the highly correlated spectral bands in HSI. Consequently, these traditional classifiers often struggle to deliver optimal performance when faced with the combined challenges of high dimensionality, class imbalance, and the intricate data structures inherent to hyperspectral imagery [12,13].

In hyperspectral image classification research, traditional machine learning methods have long dominated. The rapid advancement of deep learning has led to its widespread adoption in hyperspectral image classification, enabling a shift from handcrafted feature extraction toward end-to-end learning and overcoming key limitations of conventional approaches. The current mainstream deep learning methods are mainly based on convolutional neural networks [14,15,16], graph convolutional networks [17,18], and Transformer architectures [19]. Among these approaches, 3D-CNNs capture joint spatial–spectral representations via stacked 3D convolutional and max-pooling operations, thereby mitigating the curse of dimensionality inherent in hyperspectral data [20]. Subsequently, to balance efficiency and performance, hybrid 2D-3D CNN architectures [21] emerged. The MRGAT architecture [22] proposed by Ding et al., which captures local–global semantic relationships using 1D-CNNs and Graph Attention Networks (GAT), offers advantages but relies on precise superpixel construction and incurs high computational costs. The Convolutional–Spectral Space Transformer (CMSST) network designed by Chen et al. [23] enables information flow and fusion across different scales and modalities, yet involves complex architecture and a large parameter scale.

The Transformer architecture takes this further, leveraging its powerful global context modeling capabilities to overcome the limitation of convolutional neural networks that only consider local information [24]. The Transformer-based WaveFormer developed by Ahmad et al. [25] can capture multi-scale spectral–spatial dependencies, but Transformer models require high computational power and large amounts of labeled data. Hu et al. [26] proposed a Hybrid Convolutional Attention Network (HCANet) that combines the strengths of CNNs and Transformers to jointly model global and local features, thereby effectively enhancing the denoising performance of hyperspectral images. However, it remains fundamentally a computationally complex “black-box” model with low interpretability, and it heavily relies on large amounts of labeled data for end-to-end training.

Meanwhile, alternative deep architectures have emerged to capture complex data characteristics more effectively. For instance, networks based on Long Short-Term Memory (LSTM) are adept at extracting discriminative spectral–spatial features, enabling robust crop classification even with incomplete data sequences [27]. Spectral–spatial fractal residual networks mitigate sample imbalance through data balancing enhancement [28], yet their complex network architecture relies heavily on large amounts of labeled data. The Higher-Order Deformable Differential Convolutional Network (HorD2CN) explicitly models feature interactions through higher-order convolutional blocks [29], but its black-box nature makes the decision process difficult to interpret. In a different approach, deep forest frameworks perform hierarchical feature learning through a cascade of random forests [30]. These models iteratively refine the feature representation at each level to construct a robust representation without relying on gradient-based training [31]. Tong et al. [32] proposed a Spectral–Spatial Depth Random Forest (SSDRF) method that leverages more precise spatial information by combining fixed-size image patches with shape-adaptive superpixels. However, its “depth” primarily manifests in input feature engineering rather than deepening the feature representation itself. In the absence of an iterative refinement process, the method remains constrained by hand-engineered feature representation and limited model depth, hindering its ability to learn hierarchical patterns. A growing body of research highlights that deep neural networks face persistent challenges in adapting to complex and heterogeneous data distributions, as well as in capturing discriminative features under limited supervision [33]. These methods still fall short in achieving deep feature learning while maintaining good interpretability, and it remains challenging to realize deeper representation learning under lightweight computational constraints.

Algorithm-level methods modify the learning algorithm to adapt to the true class distribution [5]. Notable strategies in this context include cost-sensitive training frameworks [34], which account for class imbalance through weighted loss functions, and active learning paradigms [35], designed to reduce annotation effort by selecting the most informative samples. Cost-sensitive learning addresses class imbalance by assigning higher penalty weights to misclassifications of minority class samples during training. However, in real-world applications, cost matrices are typically unknown, making accurate estimation of true costs difficult. Active learning, on the other hand, selects informative instances and discards low-information samples to improve classification performance, though this often results in high computational costs [36]. Among algorithm-level techniques, ensemble methods are the most widely used. They combine multiple base learners to build a strong predictive model [37,38,39] and are among the most effective strategies [13,39,40]. Unlike feature selection or data sampling, ensemble methods integrate various sampling and optimization strategies to mitigate data imbalance indirectly. Wang et al. proposed a hybrid framework called Sample-and-Feature Selection Hybrid Ensemble Learning [41]. Qin et al. integrated a wrapper method based on particle swarm optimization with Adaptive Boosting to propose an ensemble-based strategy for handling class imbalance. Cmv et al. [40] designed a novel sequential ensemble framework that partitions the majority class into disjoint subsets to train multiple weak learners without sacrificing accuracy. Building on these insights, this paper proposes a novel method that combines ensemble learning with Deep Random Forest (DRF). The ensemble component enhances robustness and generalization, while DRF captures complex feature representations through its deep-layered structure. This integration fully leverages the advantages of both technologies, thereby improving classification performance, especially in situations where high-dimensional, small samples are difficult to identify.

To address the limitations of existing feature-level and algorithm-level solutions, this study proposes a Deep Random Forest based on Feature Importance (DRF-FI), which is specifically designed for high-dimensional, small-sample hyperspectral data. The DRF-FI method integrates feature selection and ensemble learning. It first computes spectral feature importance scores to select informative bands, then constructs multiple balanced training subsets using random oversampling, and finally trains an ensemble of Random Forest classifiers.

In our experiments, we compare the proposed DRF-FI method with a series of baseline ensemble models, including Boosting, Bagging, Random Forest (RF), and Deep Random Forest (DRF). To ensure a fair comparison, we adopt a two-stage framework: first, features are extracted using either RF or DRF. Then, each of the four ensemble methods is applied to the extracted features. This results in a total of eight experimental configurations, as shown in the table header structure. The performance of all methods is assessed using three well-established benchmark datasets: Indian Pines, University of Pavia, and Salinas. Performance is measured using Overall Accuracy (OA), Average Accuracy (AA), and min-Recall. Experimental results show that the proposed DRF-FI consistently surpasses all baseline methods, especially on data characterized by elevated feature dimensionality and pronounced class skew. These results validate the effectiveness and robustness of the DRF-FI method in real-world hyperspectral image classification tasks.

2. Related Works

2.1. Adaboost

Boosting is a central approach in ensemble learning, comprising a class of algorithms designed to upgrade weak learners into strong ones. Among them, AdaBoost [38] is widely regarded as the prototypical and most influential method. Its main procedure is outlined in Algorithm 1.

Algorithm 1: Adaboost

Input: Training data set

S = {(x_{1}, y_{1}), (x_{2}, y_{2}), \dots, (x_{m}, y_{m})}

;

Base learning model

ζ

;

Number of boosting iterations T;

Initialization: Initialize sample weights

D_{1} (x) = \frac{1}{m}

for all

x \in S

.

Iterative Process:

for

t = 1

to T do

1. Train the weak classifier:

h_{t} = ζ (S, D_{t})

;

2. Compute the weighted classification error:

ε_{t} = P_{x \sim D_{t}} (h_{t} (x) \neq y);

3. if

ε_{t} > 0.5

then exit the loop;

4. Compute the classifier weight:

α_{t} = \frac{1}{2} ln (\frac{1 - ε_{t}}{ε_{t}});

5. Update the sample weights according to the rule:

D_{t + 1} (x) = D_{t} (x) \cdot \frac{e^{- α_{t} h_{t} (x) y}}{Z_{t}}

where

Z_{t}

is the normalization constant ensuring that the sum of

D_{t + 1}

equals 1.

end

Output: The final classifier is given by:

H (x) = sign (\sum_{t = 1}^{T} α_{t} h_{t} (x))

2.2. Bagging

The bagging technique, short for Bootstrap Aggregating [40], relies on two core components: bootstrap sampling and aggregation [42]. Algorithm 2 provides a summary of the bagging process. In bagging, a bootstrap sample is generated by randomly sampling the training data with replacement. For prediction, the test instance is passed through the base classifiers, and their outputs are aggregated using a majority voting strategy, where the most frequent output becomes the final prediction [42].

Algorithm 2: Bagging

Input: Training data set

S = {(x_{1}, y_{1}), (x_{2}, y_{2}), \dots, (x_{m}, y_{m})}

;

Base learning method

ζ

;

Number of boosting iterations T;

Iterative Process:

for

t = 1

to T do

1. Draw a bootstrap sample

D_{b s}

from S;

2. Train a weak classifier

h_{t} = ζ (S, D_{b s})

;

end

Output: The ensemble classifier is:

H (x) = sign (\sum_{t = 1}^{T} h_{t} (x))

2.3. Random Forest

Random Forest, proposed by Leo Breiman, is an ensemble learning algorithm based on decision trees. Its core lies in improving generalization performance by constructing multiple decision trees and fusing their prediction results. It ensures the diversity of base learners through two randomization strategies: first, generating different training subsets via Bootstrap Sampling; second, when splitting each decision tree node, randomly selecting a subset of features and choosing the optimal feature for splitting [43], which effectively avoids the overfitting problem of a single decision tree.

In prediction, for classification tasks, results are determined by majority voting, while for regression tasks, the average of predictions from multiple trees is taken. RF requires no complex hyperparameter tuning, has strong robustness to noisy data and outliers [44], and can evaluate feature importance, making it widely used in classification, regression, and feature selection tasks. However, its performance may be inferior to deep models on high-dimensional data. Algorithm 3 summarizes the procedure of Random Forest.

Algorithm 3: Random forest

Input: Data set

S = (x_{1}, y_{1}), (x_{2}, y_{2},), \dots, (x_{m}, y_{m})

;

Base decision tree algorithm

ζ

;

Number of decision trees T;

Number of features sampled at each split k.

Initialization:

Forest = \emptyset

.

Iterative process:

for

t = 1

to T do

1. Bootstrap Sampling: Draw a sample set

S_{t}

with replacement from S, size m;

2. Build Decision Tree: Use

ζ

to train a decision tree

h_{t}

on

S_{t}

. At each node split, randomly

select k features and choose the optimal one for splitting.

end

Output: For classification:

H (x) = MajorityVote ({h_{1} (x), h_{2} (x), \dots, h_{T} (x)})

For regression:

H (x) = Average ({h_{1} (x), h_{2} (x), \dots, h_{T} (x)})

2.4. Deep Forest

Deep Forest [43] is an ensemble learning model that leverages Cascade Forest, where the main idea is to iteratively extract and combine features through multiple rounds of random forests or extremely randomized trees. The model’s two fundamental components are the Cascade Structure and Feature Augmentation. In this approach, each cascade level consists of several forest groups that produce class probability vectors for the input samples.then, this probability vector is concatenated with the original features to form a new feature representation, which is passed to the next cascade level. For test samples, Deep Forest aggregates the outputs of forests across all cascade levels and finally uses the majority voting strategy to determine the predicted class or takes the average to determine the predicted value. Algorithm 4 summarizes the procedure of Deep Forest.

Algorithm 4: Deep Forest

Input: Data set

S = (x_{1}, y_{1}), (x_{2}, y_{2},), \dots, (x_{m}, y_{m})

;

Base decision tree algorithm

ζ

;

Number of cascade levels L;

Number of random forests in each level T;

Number of features sampled at each tree split k.

Initialization:

Split S into training set

S_{train}

and validation set

S_{val}

;

Initialize cascade structure:

Cascade = [{Level}_{1}, {Level}_{2}, \dots, {Level}_{L}]

, each level contains T random forests.

Iterative process:

for

l = 1

to L do

1. Train Random Forests in Current Level:

For each random forest

{RF}_{t}

in

{Level}_{l}

:

Use

ζ

to train

{RF}_{t}

on the current data representation.

2. Extract Prediction Features:

For each sample x, collect prediction probabilities or predictions from all forests

in

{Level}_{l}

, concatenate these with the original features to form a new feature

representation.

3. Early Stopping Check:

If the performance on

S_{val}

stops improving, break the cascade loop.

end

Output: The final prediction is given by the last cascade level’s aggregated output.

3. Proposed Method

To further enhance the feature representation capability and classification performance of random forests on complex, high-dimensional, and imbalanced datasets, A Deep Random Forest Model Based on Feature Importance Analysis method (DRF-FI) is proposed in this study. The proposed method introduces a layer-wise iterative learning mechanism on the basis of the conventional random forest framework, through which deeper feature representations are progressively extracted by multiple prediction and pseudo-label expansion processes. In this study, methods such as SMOTE/ADASYN/Focal Loss were not adopted. This is because these methods are incompatible with the prediction label-feature importance mechanism of DRF-FI, may disrupt spectral structure, or are inapplicable to non-gradient models. Random oversampling preserves the true spectral distribution and is therefore more suitable for the DRF-FI framework.

The core concept of the Deep Random Forest lies in the iterative refinement of the training data. During each iteration, the classification task employs a randomly constructed forest-based ensemble to learn from the updated training dataset, yielding final hard label assignments. These hard labels represent the classification decisions made by the previous layer’s model, transmitting information between layers in this manner. This mechanism enables subsequent layers to simultaneously leverage both the original data distribution and the collective decision history of preceding layers. The newly constructed dataset is subsequently utilized as the input for the next round of random forest training.

In the proposed DRF-FI framework, the deep hierarchical structure is primarily achieved through multiple rounds of random forest training and pseudo-label generation. Specifically, the labeled data are initially divided into two disjoint subsets designated for model learning and performance evaluation, respectively. In the k-th stage, an ensemble of decision trees is constructed from the high-level features extracted at the preceding (k − 1)-th stage, utilizing the training subset. Crisp class predictions for both subsets are subsequently derived through majority voting mechanisms. The resulting label outputs are then fused with the initial spectral–spatial features to yield an augmented feature space, serving as the input representation for the next hierarchical level.

To prevent unlimited iteration and error accumulation, this study adopts a bounded-depth strategy coupled with adaptive layer selection. For each dataset, a maximum depth

K_{m} a x

is predefined (set to 5 in our experiments), and the Overall Accuracy (

O A_{k}

) is evaluated using the hold-out validation subset at each layer. Among all candidate layers, the one achieving peak classification accuracy on the hold-out validation subset, denoted as

k^{*} = arg max_{1 \leq k \leq K_{\max}} A A_{k} .

(1)

is selected as the “optimal layer.” All subsequent feature importance analysis and feature selection are conducted exclusively within the feature space of this optimal layer. Since labels are only generated on the training/validation data, and the test set neither participates in iterative training nor in label updating, no training-test information leakage is introduced.

After several iterations, the model produces a comprehensive training dataset that contains both the original features and the multi-level pseudo-label information. This enriched dataset, after iteration, is sorted according to feature importance. Then, these data are used as input to the ensemble classification algorithm according to a certain ratio to obtain the final classification results. Such an architecture enables the model to progressively exploit the high-level semantic information captured by successive random forest learners, thereby improving the separability of the data and enhancing overall classification accuracy.

3.1. Feature Importance in Deep Random Forests

In the proposed DRF-FI framework, feature extraction is performed through a multi-level iterative learning process, which extends the traditional random forest feature importance estimation. Let the initial training set be denoted as

S^{(0)} = {(X, Y)} = {(x_{i}, y_{i})}_{i = 1}^{n} .

(2)

At each iteration k, a random forest

R F_{k}

is trained on the current dataset

S^{(k - 1)}

. After training, labels predicted by

R F_{k}

are denoted as

{\hat{Y}}^{(k)} = R F_{k} (X^{(k - 1)})

. The predicted label outputs are integrated with the initial labeled data to construct an expanded training corpus:

S^{(k)} = {(X, [Y, {\hat{Y}}^{(1)}, {\hat{Y}}^{(2)}, \dots, {\hat{Y}}^{(k)}])} .

(3)

It is worth noting that each layer of the random forest does not output labels for self-training, but rather contributes intermediate representational features similar to stacking to subsequent layers. Consequently, the final feature importance is calculated based on actual labels rather than hard labels themselves, thereby avoiding the overfitting issue inherent in typical pseudo label learning—where the model predicts and then re-trains itself.

For each random forest

R F_{k}

, let v index an individual tree in the ensemble, and let

N_{v}^{(k)}

represent the count of non-leaf nodes in the v-th tree at layer k. Each node n in the tree corresponds to a subset of samples with indices

K_{v, n}^{(k)}

, feature vectors

x_{i}^{v, k}

, and labels

y_{i}^{v, k}

. At node n, the samples are split by feature

f_{v, q}^{(k)}

and threshold

s_{t}^{(k)}

:

K_{v, n}^{- (k)} = \{i \in K_{v, n}^{(k)} ∣ x_{i, f_{v, q}^{(k)}}^{v, k} \leq s_{t}^{(k)}\}, K_{v, n}^{+ (k)} = \{i \in K_{v, n}^{(k)} ∣ x_{i, f_{v, q}^{(k)}}^{v, k} > s_{t}^{(k)}\} .

(4)

The class probabilities at node n are computed as

\begin{matrix} p_{v, n, c}^{(k)} & = \frac{1}{| K_{v, n}^{(k)} |} \sum_{i \in K_{v, n}^{(k)}} I (y_{i}^{v, k} = c), \\ p_{v, n, c}^{- (k)} & = \frac{1}{| K_{v, n}^{- (k)} |} \sum_{i \in K_{v, n}^{- (k)}} I (y_{i}^{v, k} = c), p_{v, n, c}^{+ (k)} = \frac{1}{| K_{v, n}^{+ (k)} |} \sum_{i \in K_{v, n}^{+ (k)}} I (y_{i}^{v, k} = c) . \end{matrix}

(5)

The core of feature importance calculation lies in quantifying the purity gain achieved by splitting a node. This is measured by the reduction in Shannon Entropy [45], a fundamental metric of impurity or uncertainty in information theory. For a given node n, its entropy

H (n)

is defined as

H (n) = - \sum_{c = 1}^{C} p_{v, n, c} {log}_{2} (p_{v, n, c}) .

(6)

Here, C denotes the overall number of semantic categories, while

p_{v, n, c}

represents the fraction of training instances assigned to category c within node n of the v-th tree. (as computed in Equation (5)). A lower entropy value indicates a purer node (samples predominantly belong to a single class), while a higher value indicates greater mixing of classes.

The significance of a partition at node n induced by feature

f_{v, q}^{(k)}

is quantified via information gain, computed as the reduction in Shannon entropy from the parent node to its descendants, weighted by sample distribution:

Gain (f_{v, q}^{(k)}, n) = H (n) - (\frac{|K_{v, n}^{-}|}{|K_{v, n}|} H (n^{-}) + \frac{|K_{v, n}^{+}|}{|K_{v, n}|} H (n^{+})) .

(7)

Here,

H (n^{-})

and

H (n^{+})

are the entropies of the left and right child nodes, respectively, and the terms

\frac{|K_{v, n}^{-}|}{|K_{v, n}|}

and

\frac{|K_{v, n}^{+}|}{|K_{v, n}|}

are the weights representing the proportion of samples going to each child. Here,

H (n^{-})

and

H (n^{+})

denote the Shannon entropies associated with the left and right descendant partitions, while the ratios

\frac{|K_{v, n}^{-}|}{|K_{v, n}|}

and

\frac{|K_{v, n}^{+}|}{|K_{v, n}|}

serve as weighting factors reflecting the fraction of instances routed to each branch. A larger gain signifies that the feature split effectively reduces uncertainty and increases the purity of the resulting subsets.

Therefore, the feature importance

T h e t a_{v, n}^{(k)} (f)

for feature f at node n (Equation (8)) is precisely this information gain, but calculated exclusively for the case when the splitting feature is f (enforced by the indicator function

I (f = f_{v, q}^{(k)}))

.

The feature importance of

f_{v, q}^{(k)}

at node n in layer k is defined as the reduction in entropy caused by the split:

\begin{matrix} Θ_{v, n}^{(k)} (f) = I (f = f_{v, q}^{(k)}) (\sum_{c = 1}^{C} p_{v, n, c}^{(k)} {log}_{2} \frac{1}{p_{v, n, c}^{(k)}} - \sum_{c = 1}^{C} p_{v, n, c}^{- (k)} {log}_{2} \frac{1}{p_{v, n, c}^{- (k)}} - \sum_{c = 1}^{C} p_{v, n, c}^{+ (k)} {log}_{2} \frac{1}{p_{v, n, c}^{+ (k)}}) . \end{matrix}

(8)

When averaging

Θ_{v, n}^{(k)}

, the node weight is determined by the relative number of samples handled by each node:

p_{v, n}^{(k)} = \frac{| K_{v, n}^{(k)} |}{N_{v}^{(k)}} .

(9)

Accordingly, the feature importance of the k-th layer random forest is expressed as

Θ^{(k)} (f) = \frac{1}{V_{k}} \sum_{v = 1}^{V_{k}} \sum_{n = 1}^{N_{v}^{(k)}} p_{v, n}^{(k)} Θ_{v, n}^{(k)} (f) .

(10)

where

V_{k}

is the number of decision trees in layer k.

It is crucial to emphasize that our feature importance calculation is based on the hard labels generated by the multi-layer random forests, rather than probabilistic soft labels. As random forests possess inherent robustness to label noise, the use of hard labels does not amplify the uncertainty of labels. Furthermore, DRF-FI computes feature importance solely at the optimal iteration layer

k^{*}

, rather than accumulating scores across layers. This strategy avoids the potential bias transfer that might occur in the early stages of pseudo-label generation.

In each layer of the DRF, we first compute the overall classification accuracy (AA) on the validation set and select the best-performing layer k. Since the predictive labels from different layers represent varying semantic depths, cross-layer fusion could lead to inconsistencies in feature importance measurement. Therefore, this paper adopts a “best-layer importance” strategy. Subsequently, within layer k, feature importance is calculated based on the entropy reduction in nodes across all categories, and a weighted aggregation is performed over all trees within that layer to obtain the final global feature importance ranking. The node entropy reduction-based feature importance is derived through a weighted average across trees and nodes; thus, it does not rely on the prediction quality of any single pseudo-label. Moreover, the importance ranking is only used to select the Top-K optimal features for dimensionality reduction, and the feature compression process itself is a means to suppress overfitting. In summary, the use of labels in DRF-FI does not lead the feature importance into a typical overfitting pattern; instead, it forms an interpretable and stable feature selection mechanism.

Finally, the overall feature importance in the Deep Random Forest is derived exclusively from the optimal layer

k^{*}

identified during the iterative process:

Θ_{DRF} (f) = Θ^{(k)} (f) .

(11)

The proposed feature extraction scheme thus integrates the hierarchical structure of the Deep Random Forest with entropy-based feature importance evaluation at the optimal layer. This “best-layer importance” strategy directly leverages the most discriminative and stable feature representations, avoiding inconsistencies from cross-layer merging. The number of iteration layers is very limited (for example, up to 3–4 layers), and thus it will not form the kind of error accumulation over dozens of layers seen in deep semi-supervised networks. Random Forest is inherently less sensitive to a small amount of label noise, and a few errors in the pseudo-labels will not immediately destroy the overall structure. Our architecture further mitigates error propagation through the optimal layer selection mechanism, avoiding the accumulation of uncertainties from earlier layers.Deep networks can use softmax probability to set confidence thresholds, while the output probability of random forests is often less stable than that of hard voting in high-dimensional and small-sample HSI. Additionally, the prediction label is a hard label obtained by training a random forest on the training set. The prediction labels of the test set are only used for forward reasoning and performance evaluation and do not participate in any model training process. Therefore, from the perspective of “predicting whether the label feeds the test information back to the training”, there is no leakage of training-test data.

3.2. Double Ensemble Classification Based on Deep Random Forests

Building upon the ensemble feature extraction scheme described above, the proposed DRF-FI framework is further extended into an ensemble classification algorithm. After multi-layered data tag integration, the final feature matrix

X^{(K)} = [X, {\hat{Y}}^{(1)}, {\hat{Y}}^{(2)}, \dots, {\hat{Y}}^{(K)}] .

(12)

is obtained, which encodes both low-level input features and high-level semantic information derived from the pseudo-label propagation process. These data are used as input to the final ensemble classification algorithm according to a certain ratio, where multiple ensemble strategies are incorporated for robust decision fusion and generalization enhancement. Specifically, four ensemble configurations are considered:

Random Forest: The baseline ensemble model consisting of multiple decision trees trained on bootstrapped subsets of the feature space. The ultimate class label is determined by aggregating individual tree outputs via a majority rule, while the random subspace strategy promotes diversity among constituent models.
Deep Random Forest: An extension of the standard RF that performs layer-wise iterative learning. Each layer refines the feature representation by integrating pseudo-labels from the preceding stage, thus progressively capturing higher-order feature interactions before final aggregation.
Bagging-based DRF Ensemble: In this configuration, multiple DRF models are trained on different bootstrap samples of the training set. The final output is produced by averaging or voting over the predictions of all sub-models:

${\hat{Y}}_{Bagging} = \frac{1}{M} \sum_{m = 1}^{M} D R F^{* (m)} (X^{(K)}) .$

(13)

where M denotes the number of base DRF learners. This approach effectively reduces variance and mitigates overfitting while maintaining the depth-aware representation capability.
Boosting-based DRF Ensemble: To further enhance discriminative power, a sequential ensemble of DRF models is constructed where each learner focuses on the misclassified samples of its predecessors. The final prediction is expressed as a weighted combination:

${\hat{Y}}_{Boosting} = sign (\sum_{m = 1}^{M} α_{m} D R F^{* (m)} (X^{(K)})) .$

(14)

where $α_{m}$ represents the adaptive weight determined by the performance of the m-th DRF learner. This mechanism improves classification accuracy through iterative error correction and feature reweighting across layers.

By integrating the proposed iterative feature extraction process with diverse ensemble learning strategies, the framework achieves a balance between representation richness and classification robustness. In particular, the DRF-based Bagging and Boosting ensembles effectively exploit the multi-level structural learning of the DRF model, providing improved generalization on high-dimensional and imbalanced datasets compared with conventional ensemble methods.

The process of model training and pseudo-label integration in the Deep Random Forest Model based on the Feature Importance Analysis method can be summarized in Algorithm 5:

Algorithm 5: Deep Random Forest (DRF) with Ensemble Feature Importance Calculation

Inputs:

Training set $S = {(x_{1}, y_{1}), (x_{2}, y_{2}), \dots, (x_{n}, y_{n})}$ ;
K:number of iterations (layers);
Base learner: Random Forest $R F$ ;
Ensemble classifier type $F_{ens} \in {RF, DRF, Bagging, Boosting}$ ;
Feature selection ratio $α \in (0, 1]$ .

Process:

Initialization: Set $S^{(0)} = S = {(X, Y)}$ .
For $k = 1$ to K do
(a)
Train base random forest: Train a random forest $R F_{k}$ on $S^{(k - 1)}$ .
(b)
Generate pseudo-labels: Obtain predictions for all samples ${\hat{Y}}^{(k)} = R F_{k} (X^{(k - 1)})$ .
(c)
Construct augmented dataset: Concatenate ground-truth and pseudo-labels to form $S^{(k)} = {(X, [Y, {\hat{Y}}^{(1)}, {\hat{Y}}^{(2)}, \dots, {\hat{Y}}^{(k)}])}$ .
Feature importance calculation and ranking:
(a)
Compute feature importance scores $I = {I_{1}, I_{2}, \dots, I_{d}}$ for all d features in $S^{(K)}$ using the trained random forests.
(b)
Sort features in descending order based on importance scores: $I_{(1)} \geq I_{(2)} \geq \dots \geq I_{(d)}$ .
(c)
Select top $⌊ α \cdot d ⌋$ features to form the reduced feature set $X_{selected}^{(K)}$ .
Final ensemble classification with feature selection: Using the selected feature representation $X_{selected}^{(K)}$ , train an ensemble classifier $F_{ens}$ on the modified dataset $S_{selected}^{(K)} = {(X_{selected}^{(K)}, Y)}$ according to the selected strategy:
•
If $F_{ens} = RF$ : train a standard random forest;
•
If $F_{ens} = DRF$ : perform additional hierarchical DRF learning;
•
If $F_{ens} = Bagging$ : aggregate multiple DRF models by averaging or voting;
•
If $F_{ens} = Boosting$ : construct a sequential ensemble with adaptive weighting.
Prediction: Obtain the final outputs ${\hat{Y}}_{final} = F_{ens} (X_{selected}^{(K)})$ .

End

Output: Ensemble classifier

F_{ens}

that integrates multi-level pseudo-label representations

with feature selection and enhanced discriminative learning.

4. Experimental Study

To comprehensively evaluate the performance of the proposed DRF algorithm, a comparative study is conducted across three major experimental groups: (1) the Raw group, which uses the original data without preprocessing; (2) the RF-based feature extraction group, where Random Forest feature importance is employed for data preprocessing; and (3) the DRF-based feature extraction group, where DRF feature importance is utilized for preprocessing. Each group includes four ensemble classifiers—Boosting, Bagging, Random Forest (RF), and Deep Random Forest (DRF)—yielding a total of 12 combinations (e.g., Raw_Boosting, RF_DRF, etc.). All ensemble classifiers were implemented using 100 decision trees. For Adaboost, Bagging, and Random Forest, we retained the default parameter settings provided by the corresponding R libraries, including randomForest and adabag. Each algorithm was executed ten times independently, and the final performance values were obtained by averaging the results across these runs.

4.1. Evaluative Performance Metrics

Accuracy is widely used as a metric for evaluating classifier performance, but it has been shown to be inappropriate for studies involving imbalanced data. Therefore we adopt

A A

,

O A

, F-

m e a s u r e

, G-

m e a n

and m-

r e c a l l

as performance measures in our experiments [5].

Notations. Assume an L-class classification problem and let $n_{i j}$ denote the number of samples whose true label is class i but are predicted as class j. Let $s u m (S)$ be the size of the training set.
Overall accuracy (OA). We define OA as an overall correctness indicator derived from the collection of per-class recalls:

$\begin{matrix} OverallAccuracy = \frac{\sum_{i = 1}^{L} R e c a l l_{i}}{s u m (S)} \end{matrix}$

(15)

The recall of class i (also called class-wise accuracy) is computed from the confusion counts as

$\begin{matrix} R e c a l l_{i} = \frac{n_{i i}}{\sum_{j = 1}^{L} n_{i j}} \end{matrix}$

(16)
Average accuracy (AA). To avoid dominance by majority classes, AA assigns identical importance to each class by averaging recalls:

$\begin{matrix} AverageAccuracy = \frac{\sum_{i = 1}^{L} R e c a l l_{i}}{L} \end{matrix}$

(17)
F-Measure. For imbalanced learning, we additionally report an F-type score that combines recall and precision. It is written as

$\begin{matrix} F - measure = \frac{2}{L} \frac{\sum_{i = 1}^{L} R e c a l l_{i} \sum_{i = 1}^{L} P r e c i s i o n_{i}}{\sum_{i = 1}^{L} R e c a l l_{i} + \sum_{i = 1}^{L} P r e c i s i o n_{i}} \end{matrix}$

(18)

where the class-i precision is defined by $P r e c i s i o n_{i} = \frac{n_{i i}}{\sum_{j = 1}^{L} n_{j i}}$ .
G-mean. As another imbalance-sensitive criterion, G-mean summarizes the recalls through a geometric aggregation:

$\begin{matrix} G - mean = \prod_{i = 1}^{L} R e c a l l_{i}^{1 / L} \end{matrix}$

(19)
M-recall. To provide a macro-level view of sensitivity over all categories, we compute M-recall as the mean of the class-wise recall values:

$\begin{matrix} M - recall = \frac{1}{L} \sum_{i = 1}^{L} R e c a l l_{i} \end{matrix}$

(20)

4.2. Data Information

We randomly partition the dataset into two disjoint subsets for model development and evaluation, i.e., a training set and a test set. To demonstrate the split protocol, Table 1 lists the corresponding statistics when 5% of the labeled (reference) samples are selected for training and the remaining samples are assigned to testing.

To examine the effect of feature retention, we considered a set of candidate ratios (5%, 10%, 20%, 40%, 60%, 80%, and 100%). For each candidate, a single Random Forest layer was executed and the AA score was recorded. The ratio yielding the highest AA was taken as the best feature proportion.

4.3. Results and Analysis

This section reports the empirical evaluation of the baseline models (Raw and RF) together with the proposed DRF-FI framework. The main aims of this section are summarized below:

Examine how the DRF-FI model performs on the benchmark hyperspectral datasets.
Compare the DRF-FI approach against several data sampling strategies to assess its robustness.
Investigate the sensitivity of the DRF-FI model to key parameter settings.

The feature importance figure (Figure 1) is plotted with the feature indices on the x-axis and their corresponding importance values on the y-axis. Each value on the x-axis represents an individual spectral band, while the magnitude of the associated y-axis value reflects the relative contribution of that band to the classification process. A higher importance value indicates that the corresponding band plays a more influential role in distinguishing between classes within the hyperspectral data.

Table 2, Table 3 and Table 4 present the experimental results in terms of average accuracy (AA), overall accuracy (OA), F-measure, G-mean, and m-recall for twelve ensemble configurations. These configurations are organized into three main groups—Raw, RF, and DRF—each including four ensemble variants: Boosting, Bagging, RF, and DRF, resulting in combinations such as RF_Boosting, DRF_RF, and so on. The Raw group corresponds to the baseline models that do not employ any feature importance information. In contrast, the RF and DRF groups incorporate feature importance derived from the standard Random Forest and from the proposed Deep Random Forest Based Importance Figuration (DRF-FI), respectively.

All experiments were conducted on three widely used hyperspectral datasets: Indian Pines (AVIRIS), University of Pavia (ROSIS), and Salinas. For clarity, the highest scores in each table are marked in bold. “The original data presented in the study are openly available in the Hyperspectral Remote Sensing Scenes repository at https://www.ehu.eus/ccwintco/index.php?title=Hyperspectral_Remote_Sensing_Scenes (accessed on 23 August 2024).”

From these results, it can be observed that the DRF-based ensembles achieve statistically improvements over both the Raw and RF groups, demonstrating that feature importance derived from DRF can effectively enhance the discriminative capability of ensemble models for high-dimensional imbalanced hyperspectral data.

Specifically, on the Indian Pines (AVIRIS) dataset, the DRF-based ensemble models achieved noticeable improvements: compared to the RF and Raw groups, the overall accuracy (OA) increased by approximately 14%, 9%, 1%, and 1%, respectively, while the average accuracy (AA) improved by about 33%, 25%, 1%, and 2%, respectively. Similarly, on the University of Pavia (ROSIS) dataset, the DRF ensembles demonstrated enhancements: the OA increased by approximately 13%, 10%, 1%, and 1%, and the AA improved by around 31%, 19%, 1%, and 1%, respectively, relative to the RF and Raw groups. On the Salinas dataset, the DRF models continued to exhibit gains: the OA increased by roughly 1%, 5%, 1%, and 1%, and the AA improved by about 1%, 6%, 1%, and 1%, respectively, compared with the RF and Raw groups.

In the adabag package, the implementation of Bagging requires explicit bootstrap sampling and data replication at each iteration, whereas Boosting performs resampling implicitly through iterative weight updates. Under the same number of base learners and identical rpart control parameters, this implementation-level difference results in slightly longer running times for Bagging, even though the theoretical computational complexity of the two methods is comparable. Additionally, it is possible that on the evaluated datasets, the optimal feature subset size (Best_K) selected during Bagging is larger than that of Boosting. If Bagging consistently operates with a higher number of active feature dimensions during later iterations, the associated computational burden would naturally increase, further contributing to its longer execution time.

Furthermore, the results for F-measure, G-mean, and M-recall across all three datasets consistently confirm the superior and more robust performance of the DRF-based ensembles relative to the other methods.

Although a few negative t-values appear in the results in Table 5, Table 6 and Table 7, these cases are primarily attributable to the ordering of the paired comparisons rather than a systematic underperformance of the proposed method. Overall, DRF_RF and DRF_DRF outperforms the competing algorithms in most pairwise evaluations; the isolated instances where it yields slightly lower values reflect dataset-specific variations rather than a consistent weakness of the framework.

To facilitate a qualitative assessment, we present classification maps for the Raw method, RF, and the proposed DRF-FI (see Figure 2, Figure 3, Figure 4 and Figure 5, Figure 2 is a classification of the raw data before processing.) The maps are generated using twelve competing classifiers on three benchmark scenes, namely Indian Pines AVRIS, University of Pavia ROSIS, and Salinas. Visual inspection indicates that DRF produces cleaner thematic maps than Raw and RF. In particular, it better preserves minority-class regions while keeping the overall performance stable. This effect is especially pronounced in Indian data. This conclusion is consistent with the descriptions for the above tables.

The computational complexity of the proposed DRF-Fl framework primarily arises from two components: the feature importance estimation stage and the construction of the double-ensemble model. The time complexity of building a single decision tree is

O (N log N \times D)

, where N is the number of training samples and D denotes the feature dimensionality. The random forest construction of T base learners incurs a computational cost of

O (N log N \times D \times T)

. Because DRF-FI employs only a single random forest to estimate feature importance, the complexity of this step is reduced to

O (N log N \times D \times T)

.

In the subsequent double-ensemble phase, several DRF models are executed in parallel. Although this stacked design adds extra learning procedures, the feature dimensionality for this stage is greatly compressed to N (

N < D

) through DRF-based feature selection. As a result, the computational burden of the second stage becomes

O (N log N \times N \times T^{2})

.

Since each balanced subset retains all positive examples and the number of negative samples is approximately twice that of the positives, the overall computational complexity of the full framework can be summarized as follows:

O (N log N \times N \times T (T + 1) + N) .

While this complexity is moderately higher than that of a standard random forest, such an increase is typical when addressing class-imbalanced learning tasks. More importantly, the substantial reduction in the effective feature dimension

N

leads to a lower computational burden compared with conventional imbalance-handling approaches operating in the original high-dimensional spectral space.

With respect to space complexity, the memory consumption is dominated by the storage of T tree-based learners in both stages of the framework. As each learner stores only

N

selected features instead of the full D-dimensional space, the space requirement is bounded by

O (T \times N) .

This linear memory growth is significantly more efficient than feature-agnostic ensemble methods, where each tree stores all original features. Overall, the proposed DRF-Fl achieves a favorable balance between computational efficiency and classification performance, and remains scalable to large-scale hyperspectral data despite its double-ensemble design.

4.4. Parameter Analysis

To examine how the sizes of the trees and feature subsets influence the performance of the proposed approach, we evaluate the average accuracy, overall accuracy, and m-recall on the Indian Pines AVIRIS, University of Pavia ROSIS, and Salinas datasets. In this experiment, the ensemble size is fixed at

T = 100

. The candidate numbers of trees are selected from the set (10, 40, 70, 100, 130, 160, 190), while the tested feature counts are chosen from (10, 20, 40, 80, 120, 160, 200). Figure 6, Figure 7, Figure 8, Figure 9, Figure 10 and Figure 11 show that DRF attains its optimal performance when the feature size is 160 for the Indian Pines AVIRIS dataset, 130 for the University of Pavia ROSIS dataset, and 60 for Salinas. These results demonstrate that DRF can maintain high predictive capability even with a substantially reduced feature set, thereby easing the computational load.

5. Discussion

The experimental results presented in Section 4 demonstrate that the proposed DRF-Fl framework substantially improves the classification of multi-class imbalanced hyperspectral data. Beyond the numerical advantages reflected in AA, OA, F-measure, and G-mean across three benchmark datasets, several underlying mechanisms explain why the method achieves these consistent gains. First, by integrating feature importance-guided feature selection with oversampling-based ensemble learning, DRF-Fl effectively enhances the discriminative capacity of spectral bands most relevant to minority classes. As shown in Figure 2, the importance values across spectral bands vary significantly, especially for the Indian Pines and Salinas datasets, indicating that feature redundancy and noise are non-negligible. By selecting the most informative spectral features before training, the DRF-Fl framework lowers the overlap that appears within the same class. It also reduces the bias that noise may introduce. As a result, the ensemble becomes more stable when the data distribution is skewed, and the model can handle such imbalance more effectively. To further analyze the class-wise behavior of the competing methods, we present the confusion matrices of DRF, RF_DRF, and DRF_DRF for the three hyperspectral datasets, as shown in Figure 12, Figure 13 and Figure 14. Compared with DRF and RF_DRF, the proposed DRF_DRF exhibits reduced off-diagonal errors, especially for minority classes, indicating improved discrimination under severe class imbalance.
The class-wise analysis reveals that the performance improvement is not uniformly distributed across categories. The greatest gains are consistently observed in minority classes, which traditionally pose the greatest challenges for standard RF, Bagging, and Boosting methods. This trend is evident in Table 2, Table 3 and Table 4, where DRF-Fl substantially elevates the recall and F-measure of small classes without sacrificing majority-class accuracy. The adaptive oversampling mechanism ensures that minority samples are sufficiently represented in each bootstrap subset, enabling base classifiers to better capture their decision boundaries. Nevertheless, certain extremely scarce classes—such as Oats in the Indian Pines dataset and several highly fragmented categories in the University of Pavia scene—still exhibit limited improvement. This suggests that oversampling alone cannot fully compensate for severely insufficient or highly noisy samples, and highlights the need for more advanced minority-class enhancement strategies in future work.
The findings of this study align with and extend prior research on ensemble-based imbalance learning. Earlier studies have shown that Bagging extensions, SMOTE-based ensembles, and DRF variants can mitigate class imbalance by increasing sample diversity or adjusting decision boundaries. However, these methods rarely address the high-dimensional nature of hyperspectral data. Compared with traditional random forests and earlier DRF-style models, DRF-Fl introduces an explicit feature importance-guided mechanism that simultaneously promotes dimensionality reduction and noise suppression prior to ensemble construction. This hybrid design not only reduces computational burden (as discussed in Section 4.4), but also enhances minority-class recognition more effectively than purely sampling-based or algorithm-level methods. In addition, studies on multi-class imbalanced data point out that some samples are more difficult to learn. William et al. [35] stress this idea and show that these hard instances need special attention. DRF-Fl inherently incorporates this notion by amplifying the representation of difficult minority instances through informed oversampling and feature refinement. As a result, the method bridges the gap between feature-level and data-level imbalance mitigation, offering a more comprehensive and effective learning framework compared with existing approaches.
Deep learning models have been used in hyperspectral image classification, including 3D-CNNs, hybrid CNN designs, and Transformer-based networks. These methods can reach very high accuracy. However, they often rely on many parameters. They also require strong GPU support and long training time, which limits their practical use. Moreover, they generally rely on significantly larger proportions of labeled samples (often 10–40%), and their performance gains often stem from exploiting joint spatial–spectral information. Representative examples include 3D-CNN models [20] (requiring approximately 10% labeled samples), Hybrid 3D–2D CNNs [21] (around 20% labeled samples), SpectralFormer Transformers [46] (about 30% labeled samples), SSFTT spectral–spatial token Transformers [47], and graph-based convolutional networks such as SGC/GCN [18], all of which generally depend on either substantial annotation ratios or spatial-context construction.
Such assumptions are fundamentally different from the setting of this study, where only spectral features are modeled, making direct comparisons under identical conditions neither fair nor meaningful. The goal of this work is not to compete with large-scale deep learning architectures under fully supervised settings, but rather to develop a non-parametric, lightweight, interpretable, and robust framework suitable for small-sample and highly imbalanced scenarios. Unlike deep networks, the proposed method does not depend on gradient-based training or heavy annotation requirements, and it maintains stable performance even without spatial context. Therefore, classical ensemble models such as Boosting, Bagging, RF, and DRF are selected as baselines to ensure comparability in terms of data assumptions, computational cost, and methodological principles. Nonetheless, integrating DRF-Fl with modern deep spectral–spatial feature extraction architectures or benchmarking against deep learning models under controlled and comparable settings remains an important direction for future research. Additionally, exploring more advanced generative oversampling techniques or uncertainty-aware ensemble mechanisms may further enhance the applicability of the proposed framework to highly imbalanced and noise-prone hyperspectral scenes.Future research directions will extend towards deep learning models in order to obtain more comprehensive classification algorithms for processing hyperspectral data.

6. Conclusions

A dual ensemble classification framework (DRF-FI) is proposed in this study. It is based on feature importance analysis and deep random forest (DRF). It addresses three key challenges in hyperspectral image (HSI) classification: high dimensionality, noise interference, and class imbalance. These challenges have long plagued traditional machine learning and deep learning algorithms.

The DRF-FI framework integrates feature selection and two-layer ensemble learning. First, feature importance is quantified to identify discriminative spectral bands. Balanced training subsets are constructed through random oversampling. Finally, four ensemble strategies are integrated: Random Forest (RF), Deep Random Forest (DRF), Bagging-based DRF, and Boosting-based DRF.

The framework uses a layer-wise iterative learning mechanism. Pseudo-labels are generated in each round of random forest training. They are fused with original labels to form information-augmented datasets. Entropy reduction is used to evaluate and aggregate feature importance across layers.

Experiments are conducted on three benchmark HSI datasets: Indian Pines AVRIS, University of Pavia ROSIS, and Salinas. The results show that DRF-FI outperforms mainstream methods. These methods include Boosting, Bagging, RF, and DRF. Key metrics are improved: Overall Accuracy (OA), Average Accuracy (AA), F-measure, G-mean, and m-recall. For example, on the highly imbalanced Indian Pines dataset, OA is increased by up to 14% and AA by up to 33%.

Parameter analysis reveals that DRF-FI maintains excellent performance with a small feature set. Computational complexity is thus reduced.

DRF-FI provides an effective and robust solution for HSI classification. It is particularly suitable for scenarios with high dimensionality, small samples, and class imbalance. It exhibits broad application potential in fields such as land use surveys, environmental monitoring, and mineral exploration.

Author Contributions

J.L., J.B. and W.F. conceived and designed the experiments. Y.D. performed the experiments. Q.W. and G.D. edited the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by the Natural Science Foundation of Ningxia Province of China (grant number 2024AAC05057, 2024AAC02035), National Natural Science Foundation of China (62201438, 62331019), National Key R&D Program of China (2022YFA1604803), and Natural Science Basic Research Program of Shaanxi (No. 2025JC-YBMS-020).

Data Availability Statement

The original data presented in the study are openly available in the Hyperspectral Remote Sensing Scenes repository at https://www.ehu.eus/ccwintco/index.php?title=Hyperspectral_Remote_Sensing_Scenes (accessed on 23 August 2024).

Acknowledgments

Special thanks to Yan Cao, Dongting Yan, Junting Guo, Yang Meng, and Bingxu Chen for their support of this work.

Conflicts of Interest

The authors declare no conflict of interest.

References

Cao, X.; Yao, J.; Xu, Z.; Meng, D. Hyperspectral Image Classification with Convolutional Neural Network and Active Learning. IEEE Trans. Geosci. Remote Sens. 2020, 58, 4604–4616. [Google Scholar] [CrossRef]
Yao, J.; Hong, D.; Li, C.; Chanussot, J. SpectralMamba: Efficient Mamba for Hyperspectral Image Classification. arXiv 2024, arXiv:2404.08489. [Google Scholar] [CrossRef]
Li, S.; Song, W.; Fang, L.; Chen, Y.; Ghamisi, P.; Benediktsson, J. Deep Learning for Hyperspectral Image Classification: An Overview. IEEE Trans. Geosci. Remote Sens. 2019, 57, 6690–6709. [Google Scholar] [CrossRef]
Yang, J.; Wu, C.; Du, B.; Zhang, L. Enhanced Multiscale Feature Fusion Network for HSI Classification. IEEE Trans. Geosci. Remote Sens. 2021, 59, 10328–10347. [Google Scholar] [CrossRef]
Feng, W.; Huang, W.; Ren, J. Class Imbalance Ensemble Learning Based on the Margin Theory. Appl. Sci. 2018, 8, 815. [Google Scholar] [CrossRef]
Paoletti, M.; Haut, J.; Plaza, J.; Plaza, A. Deep learning classifiers for hyperspectral imaging: A review. ISPRS J. Photogramm. Remote Sens. 2019, 158, 279–317. [Google Scholar] [CrossRef]
Tao, C.; Pan, H.; Li, Y.; Zou, Z. Unsupervised Spectral-patial Feature Learning with Stacked Sparse Autoencoder for Hyperspectral Imagery Classification. IEEE Geosci. Remote Sens. Lett. 2015, 12, 2438–2442. [Google Scholar] [CrossRef]
He, Z.; Liu, H.; Wang, Y.; Hu, J. Generative Adversarial Networks-Based Semi-Supervised Learning for Hyperspectral Image Classification. Remote Sens. 2017, 9, 1042. [Google Scholar] [CrossRef]
Garcia, S.; Zhang, Z.; Altalhi, A.; Alshomrani, S.; Herrera, F. Dynamic ensemble selection for multi-class imbalanced datasets. Inf. Sci. 2018, 445–446, 22–37. [Google Scholar] [CrossRef]
Sun, T.; Jiao, L.; Feng, J.; Liu, F.; Zhang, X. Imbalanced Hyperspectral Image Classification Based on Maximum Margin. IEEE Geosci. Remote Sens. Lett. 2015, 12, 522–526. [Google Scholar] [CrossRef]
Feng, W.; Dauphin, G.; Huang, W.; Quan, Y.; Liao, W. New Margin-Based Subsampling Iterative Technique In Modified Random Forests For Classification. Knowl.-Based Syst. 2019, 182, 104845. [Google Scholar] [CrossRef]
Zhu, J.; Fang, L.; Ghamisi, P. Deformable Convolutional Neural Networks for Hyperspectral Image Classification. IEEE Geosci. Remote Sens. Lett. 2018, 15, 1254–1258. [Google Scholar] [CrossRef]
Roy, S.K.; Haut, J.M.; Paoletti, M.E.; Dubey, S.R.; Plaza, A. Generative Adversarial Minority Oversampling for Spectral-patial Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–15. [Google Scholar] [CrossRef]
Li, Z.; Liu, F.; Yang, W.; Peng, S.; Zhou, J. A Survey of Convolutional Neural Networks: Analysis, Applications, and Prospects. IEEE Trans. Neural Netw. Learn. Syst. 2022, 33, 6999–7019. [Google Scholar] [CrossRef]
Krichen, M. Convolutional Neural Networks: A Survey. Computers 2023, 12, 151. [Google Scholar] [CrossRef]
Taye, M.M. Theoretical Understanding of Convolutional Neural Network: Concepts, Architectures, Applications, Future Directions. Computation 2023, 11, 52. [Google Scholar] [CrossRef]
Zhang, S.; Tong, H.; Xu, J.; Maciejewski, R. Graph convolutional networks: A comprehensive review. Comput. Soc. Netw. 2019, 6, 11. [Google Scholar] [CrossRef]
Wu, F.; Souza, A.; Zhang, T.; Fifty, C.; Yu, T.; Weinberger, K. Simplifying Graph Convolutional Networks. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; Chaudhuri, K., Salakhutdinov, R., Eds.; Proceedings of Machine Learning Research (PMLR). Volume 97, pp. 6861–6871. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, Long Beach, CA, USA, 4–9 December 2017; pp. 6000–6010. [Google Scholar]
Sellami, A.; Farah, M.; Riadh Farah, I.; Solaiman, B. Hyperspectral imagery classification based on semi-supervised 3-D deep neural network and adaptive band selection. Expert Syst. Appl. 2019, 129, 246–259. [Google Scholar] [CrossRef]
Ghaderizadeh, S.; Abbasi-Moghadam, D.; Sharifi, A.; Zhao, N.; Tariq, A. Hyperspectral Image Classification Using a Hybrid 3D-2D Convolutional Neural Networks. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 7570–7588. [Google Scholar] [CrossRef]
Ding, Y.; Zhang, Z.; Zhao, X.; Hong, D.; Cai, W.; Yang, N.; Wang, B. Multi-scale receptive fields: Graph attention neural network for hyperspectral image classification. Expert Syst. Appl. 2023, 223, 119858. [Google Scholar] [CrossRef]
Chen, Y.; Wei, M.; Chen, Y. A method based on hybrid cross-multiscale spectral-spatial transformer network for hyperspectral and multispectral image fusion. Expert Syst. Appl. 2025, 263, 125742. [Google Scholar] [CrossRef]
Niu, D.; Zhang, X.; Li, L.; Zhou, Y. HSI-SSTRANS: Hyperspectral Image Classification with Spectral and Space Transformer. In Proceedings of the IGARSS 2023—2023 IEEE International Geoscience and Remote Sensing Symposium, Pasadena, CA, USA, 16–21 July 2023; pp. 7625–7628. [Google Scholar] [CrossRef]
Ahmad, M.; Ghous, U.; Usama, M.; Mazzara, M. WaveFormer: Spectral–Spatial Wavelet Transformer for Hyperspectral Image Classification. IEEE Geosci. Remote Sens. Lett. 2024, 21, 1–5. [Google Scholar] [CrossRef]
Hu, S.; Gao, F.; Zhou, X.; Dong, J.; Du, Q. Hybrid Convolutional and Attention Network for Hyperspectral Image Denoising. IEEE Geosci. Remote Sens. Lett. 2024, 21, 1–5. [Google Scholar] [CrossRef]
Tong, Y.; Quan, Y.; Feng, W.; Dauphin, G.; Wang, Y.; Wu, P.; Xing, M. Multi-Scale Feature Extraction and Total Variation Based Fusion Method For HSI and Lidar Data Classification. In Proceedings of the 2021 IEEE International Geoscience and Remote Sensing Symposium IGARSS, Brussels, Belgium, 11–16 July 2021; pp. 5433–5436. [Google Scholar] [CrossRef]
Ahmad, M.; Distefano, S.; Khan, A.M.; Mazzara, M.; Li, C.; Li, H.; Aryal, J.; Ding, Y.; Vivone, G.; Hong, D. A comprehensive survey for Hyperspectral Image Classification: The evolution from conventional to transformers and Mamba models. Neurocomputing 2025, 644, 130428. [Google Scholar] [CrossRef]
Zhang, Z.; Jiang, F.; Zhong, C.; Ma, Q. HorD2CN: High-order deformable differential convolution network for hyperspectral image classification. Expert Syst. Appl. 2026, 296, 129198. [Google Scholar] [CrossRef]
Feng, W.; Gao, X.; Dauphin, G.; Quan, Y. Rotation XGBoost Based Method for Hyperspectral Image Classification with Limited Training Samples. In Proceedings of the 2023 IEEE International Conference on Image Processing (ICIP), Kuala Lumpur, Malaysia, 8–11 October 2023; pp. 900–904. [Google Scholar] [CrossRef]
Quan, Y.; Zhong, X.; Feng, W.; Chan, J.C.W.; Li, Q.; Xing, M. SMOTE-Based Weighted Deep Rotation Forest for the Imbalanced Hyperspectral Data Classification. Remote Sens. 2021, 13, 464. [Google Scholar] [CrossRef]
Tong, F.; Zhang, Y. Exploiting Spectral–Spatial Information Using Deep Random Forest for Hyperspectral Imagery Classification. IEEE Geosci. Remote Sens. Lett. 2022, 19, 1–5. [Google Scholar] [CrossRef]
Zhang, X.; Wang, Y.; Zhang, N.; Xu, D.; Luo, H.; Chen, B.; Ben, G. Spectral–Spatial Fractal Residual Convolutional Neural Network with Data Balance Augmentation for Hyperspectral Classification. IEEE Trans. Geosci. Remote Sens. 2021, 59, 10473–10487. [Google Scholar] [CrossRef]
Liu, W.; Zhang, H.; Ding, Z.; Liu, Q.; Zhu, C. A comprehensive active learning method for multiclass imbalanced data streams with concept drift. Knowl.-Based Syst. 2021, 215, 106778. [Google Scholar] [CrossRef]
William, I.V.; Krawczyk, B. Multi-class imbalanced big data classification on Spark. Knowl.-Based Syst. 2020, 212, 106598. [Google Scholar]
Abdi, L.; Hashemi, S. To combat multi-class imbalanced problems by means of over-sampling and boosting techniques. Soft Comput. 2015, 19, 3369–3385. [Google Scholar] [CrossRef]
Janicka, M.; Lango, M.; Stefanowski, J. Using Information on Class Interrelations to Improve Classification of Multiclass Imbalanced Data: A New Resampling Algorithm. Int. J. Appl. Math. Comput. Sci. 2019, 29, 769–781. [Google Scholar] [CrossRef]
Freund, Y.; Schapire, R. A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 1997, 55, 119–139. [Google Scholar] [CrossRef]
Wang, S.; Yao, X. Multiclass Imbalance Problems: Analysis and Potential Solutions. IEEE Trans. Syst. Man Cybern. Part B (Cybern.) 2012, 42, 1119–1130. [Google Scholar] [CrossRef]
Breiman, L. Bagging Predictors. Mach. Learn. 1996, 24, 123–140. [Google Scholar] [CrossRef]
Liu, X.Y.; Zhou, Z.H. Ensemble Methods for Class Imbalance Learning. In Imbalanced Learning: Foundations, Algorithms, and Applications; He, H., Ma, Y., Eds.; Wiley-IEEE Press: Hoboken, NJ, USA, 2013; pp. 61–82. [Google Scholar]
Zhou, Z.H. Ensemble Methods: Foundations and Algorithms; CRC Press: Boca Raton, FL, USA, 2025. [Google Scholar]
Zhou, Z.H.; Feng, J. Deep forest. Natl. Sci. Rev. 2019, 6, 74–86. [Google Scholar] [CrossRef]
Liu, F.T.; Ting, K.M.; Zhou, Z.H. Isolation forest. In Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, Pisa, Italy, 15–19 December 2008; IEEE: Pisa, Italy, 2008; pp. 413–422. [Google Scholar]
Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed.; Springer: New York, NY, USA, 2009. [Google Scholar]
Hong, D.; Han, Z.; Yao, J.; Gao, L.; Zhang, B.; Plaza, A.; Chanussot, J. SpectralFormer: A Transformer-based hyperspectral image classifier. arXiv 2022, arXiv:2107.02988. [Google Scholar]
Sun, L.; Zhao, G.; Zheng, Y.; Wu, Z. Spectral–Spatial Feature Tokenization Transformer for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–14. [Google Scholar] [CrossRef]

Figure 1. Feature importance distributions computed on the Indian, Pavia University, and Salinas hyperspectral datasets using the RF and DRF models.

Figure 2. Classification maps of Ground Truth on hyperspectral data Indian, Pavia University and Salinas, and their explainations.

Figure 3. Classification maps of Boosting, Bagging, RF, DRF on hyperspectral data Indian.

Figure 4. Classification maps of Boosting, Bagging, RF, DRF on hyperspectral data Pavia University.

Figure 5. Classification maps of Boosting, Bagging, RF, DRF on hyperspectral data Salinas.

Figure 6. Performances (number of trees) measured by AA on hyperspectral datasets.

Figure 7. Performances (number of trees) measured by OA on hyperspectral datasets.

Figure 8. Performances (number of trees) measured by M-recall on hyperspectral datasets.

Figure 9. Performances (number of features) measured by AA on hyperspectral datasets.

Figure 10. Performances (number of features) measured by OA on hyperspectral datasets.

Figure 11. Performances (number of features) measured by m-recall on hyperspectral datasets.

Figure 12. Confusion matrices of DRF, RF_DRF, DRF_DRF on hyperspectral data Indian.

Figure 13. Confusion matrices of DRF, RF_DRF, DRF_DRF on hyperspectral data Pavia University.

Figure 14. Confusion matrices of DRF, RF_DRF, DRF_DRF on hyperspectral data Salinas.

Table 1. Data information.

	Indian Pines AVRIS			University of Pavia ROSIS				Salinas
		Train.	Test		Train.	Test		Train.	Test
1	Alfalfa	23	23	Asphalt	331	6300	Brocoli_green_weeds_1	100	1909
2	Corn-notill	428	1000	Meadows	932	17,717	Brocoli_green_weeds_2	186	3540
3	Corn-mintill	249	581	Gravel	104	1995	Fallow	98	1878
4	Corn	71	166	Trees	153	2911	Fallow_rough_plow	69	1325
5	Grass-pasture	144	339	Painted metal sheets	67	1278	Fallow_smooth	133	2545
6	Grass-trees	219	511	Bare Soil	251	4778	Stubble	197	3762
7	Grass-pasture-mowed	14	14	Bitumen	66	1264	Celery	178	3401
8	Hay-windrowed	143	335	Self-Blocking Bricks	184	3498	Grapes_untrained	563	10,708
9	Oats	10	10	Shadows	47	900	Soil_vinyard_develop	310	5893
10	Soybean-notill	291	681				Corn_senesced green_weeds	163	3115
11	Soybean-mintill	736	1719				Lettuce_romaine_4wk	53	1015
12	Soybean-clean	177	416				Lettuce_romaine_5wk	96	1831
13	Wheat	61	144				Lettuce_romaine_6wk	45	871
14	Woods	379	886				Lettuce_romaine_7wk	53	1017
15	Buildings-Grass	115	271				Vinyard	363	6905
	Trees-Drives						untrained
16	Stone-Steel-Towers	46	47				Vinyard_vertical_trellis	90	1717
Total		3106	7143		2135	40,641		2697	51,432

Table 2. Per-class recall rates (in %) on the Indian dataset, produced by Boosting, Bagging, RF, DRF, and their hybrid variants: RF_Boosting, RF_Bagging, R_RF, RF_DRF, DRF_Boosting, DRF_Bagging, DRF_RF, and DRF_DRF.

Class	Boosting	Bagging	RF	DRF	RF				DRF
Class	Boosting	Bagging	RF	DRF	RF_Boosting	RF_Bagging	RF_RF	RF_DRF	DRF_Boosting	DRF_Bagging	DRF_RF	DRF_DRF
1	0	0	65.22	69.57	0	0	65.22	65.22	0	69.57	69.57	56.52
2	40	43	77.3	77.1	33.7	43.3	76.4	77.1	76.6	76.3	76.4	75.8
3	0	26.33	59.21	60.41	0	27.88	60.76	61.27	0	60.93	61.1	63.51
4	0	0	57.23	54.82	0	0	61.45	53.61	0	57.83	57.83	58.43
5	20.06	66.67	89.68	92.04	71.68	69.03	91.74	90.86	0	90.56	90.56	90.86
6	99.8	97.46	97.06	96.67	92.56	97.85	96.87	97.06	0	96.48	96.67	96.28
7	0	0	57.14	57.14	0	0	57.14	57.14	0	0	57.14	57.14
8	94.03	93.43	98.81	98.51	85.97	95.22	98.51	98.21	0	98.51	98.81	98.21
9	0	0	50	50	0	0	50	50	0	0	50	60
10	0	24.38	79.59	79.44	0	24.96	79.74	80.62	82.09	80.47	80.18	81.79
11	93.31	82.78	89.41	89.76	94.94	83.25	89.99	89.53	89.88	89.94	89.94	89.3
12	0	28.37	68.51	70.19	0	27.64	69.71	70.91	0	70.19	69.95	70.91
13	0	92.36	88.89	89.58	0	92.36	90.28	88.89	0	89.58	89.58	90.97
14	93.57	93.68	95.94	96.28	89.39	93.23	96.28	96.61	96.28	96.28	96.28	96.16
15	0	11.07	52.4	54.98	0	14.02	53.14	53.14	0	55.35	54.98	55.35
16	0	48.94	91.49	91.49	0	53.19	93.62	95.74	0	91.49	91.49	91.49
OA	52.16	60.8	82.57	82.99	52.71	61.4	83.07	83.06	52.12	82.88	83.06	83.2
AA	27.55	44.28	76.12	76.75	29.27	45.12	76.93	76.62	21.55	70.22	76.9	77.05
F_measure	26.29	46.96	80.75	81.69	26.25	47.56	81.25	81.3	18.86	71.06	81.62	81.25
G_mean	0	0	74.16	74.83	0	0	75.02	74.57	0	0	75.06	75.33
M_recall	0	0	50	50	0	0	50	50	0	0	50	55.35
Runtime	159.46	119.04	3.63	58.78	559.74	859.27	22.09	342.25	745.80	707.00	284.83	4323.52

Table 3. Per-class recall rates (in %) on the University of Pavia (ROSIS) dataset, produced by Boosting, Bagging, RF, DRF, and their hybrid variants: RF_Boosting, RF_Bagging, R_RF, RF_DRF, DRF_Boosting, DRF_Bagging, DRF_RF, and DRF_DRF.

Class	Boosting	Bagging	RF	DRF	RF				DRF
Class	Boosting	Bagging	RF	DRF	RF_FE_Boosting	RF_FE_Bagging	RF_FE_RF	RF_FE_DRF	DRF_FE_Boosting	DRF_FE_Bagging	DRF_FE_RF	DRF_FE_DRF
1	92.65	93.52	92.19	92.32	93.92	93.6	92.29	92.14	92.22	92.22	92.19	92.22
2	95.59	97.31	97.15	97.27	96.78	97.16	96.87	96.61	97.23	97.23	97.23	97.23
3	0	21.55	57.39	58.55	1.5	13.73	60.15	60.85	58.65	58.65	58.5	58.65
4	67.67	79.7	86.57	86.5	67.78	81.9	86.91	86.71	86.71	86.71	86.64	86.71
5	96.4	96.56	98.83	98.51	94.13	96.32	99.14	99.06	98.75	98.75	99.06	98.75
6	22.96	28.07	59.42	58.31	24.4	31.06	62.75	62.35	58.83	58.83	58.87	58.83
7	0	0	77.14	78.88	0	0	78.72	76.9	78.32	78.32	78.16	78.32
8	92.48	88.05	85.53	85.33	90.85	89.65	85.45	84.56	85.91	85.91	86.05	85.91
9	98.33	99	99	99.11	99.67	99.11	99.89	100	99	99	100	99
OA	76.75	79.79	87.71	87.73	77.54	80	88.22	87.93	87.82	87.82	90.2	88.57
AA	62.9	67.08	83.69	83.86	63.23	66.95	84.68	84.35	83.96	83.96	94.2	85.1
F_measure	62.76	70.26	86.04	86.24	67.93	70.56	86.69	86.26	86.3	86.3	94.08	86.81
G_mean	0	0	82.16	82.36	0	0	83.41	83.08	82.47	82.47	93.64	84.05
m_recall	0	0	57.39	58.31	0	0	60.15	60.85	58.65	58.65	62.09	62.31
Runtime	54.11	51.44	1.65	18.37	259.25	346.18	10.73	92.56	343.42	282.03	90.96	192.66

Table 4. Per-class recall rates (in %) on the Salinas dataset, produced by Boosting, Bagging, RF, DRF, and their hybrid variants: RF_Boosting, RF_Bagging, R_RF, RF_DRF, DRF_Boosting, DRF_Bagging, DRF_RF, and DRF_DRF.

Class	Boosting	Bagging	RF	DRF	RF				DRF
Class	Boosting	Bagging	RF	DRF	RF_Boosting	RF_Bagging	RF_RF	RF_DRF	DRF_Boosting	DRF_Bagging	DRF_RF	DRF_DRF
1	0	97.28	99.48	99.48	93.14	97.43	99.48	99.48	0	99.48	99.48	99.48
2	99.72	98.64	99.77	99.75	98.47	98.59	99.83	99.75	0	99.75	99.75	99.75
3	0	86.79	95.37	95.85	0	86.79	96.96	95.9	0	95.74	95.74	95.69
4	0	91.62	99.62	99.62	0	92.45	99.62	99.55	0	99.55	99.62	99.47
5	0	95.01	96.9	96.74	0	95.6	96.94	96.9	0	96.78	96.78	96.9
6	99.12	99.15	99.68	99.73	99.34	99.15	99.71	99.73	100	99.71	99.71	99.73
7	99.18	99	99	99.12	99.5	99.26	99.12	99.29	0	99.06	99.15	99.24
8	60.63	76.97	83.83	84.67	91.64	77.83	84.53	84.85	84.63	84.81	84.81	84.41
9	97.18	97.69	99.13	99.1	98.42	97.68	99.03	99.08	99.13	99.1	99.1	99.08
10	0	72.62	89.6	89.76	94.96	72.36	89.66	89.47	0	89.7	89.7	89.66
11	0	86.21	91.23	92.71	0	86.5	92.51	93.89	0	93.79	93.3	93.1
12	0	95.3	98.63	98.96	0	95.58	98.96	98.74	0	98.91	98.91	98.96
13	0	95.18	95.64	95.87	0	95.18	95.87	95.87	0	95.98	95.98	96.33
14	0	94.2	96.85	96.66	0	95.28	96.76	96.95	0	96.76	96.76	96.66
15	62.81	53.98	62.2	62.04	0	53.43	61.81	61.65	62.52	62.42	62.43	62.82
16	95.22	94.93	97.5	97.55	44.44	94.99	97.15	97.32	0	97.44	97.5	97.73
OA	56.04	85.03	89.93	90.15	61.67	85.23	90.12	90.15	44.69	90.24	90.24	90.22
AA	38.37	89.66	94.03	94.22	44.99	89.88	94.25	94.28	21.64	94.31	94.29	94.31
F_measure	35.59	89.19	93.87	94.04	41.88	89.4	94.08	94.12	18.76	94.13	94.12	94.15
G_mean	0	88.69	93.47	93.67	0	88.88	93.68	93.71	0	93.77	93.75	93.78
M_recall	0	53.98	62.2	62.04	0	53.43	61.81	61.65	0	62.42	62.43	62.82
Runtime	102.81	92.59	5.94	102.87	494.50	371.18	31.25	579.96	681.52	662.66	273.86	633.64

Table 5. Paired t-test statistics (t-values and p-values) for the OA metric of DRF_RF and DRF_DRF compared with Boosting, Bagging, RF, DRF, RF_Boosting, RF_Bagging, RF_RF, and RF_DRF.

t_value	Boosting	Bagging	RF	DRF	RF_Boosting	RF_Bagging	RF_RF	RF_DRF
DRF_RF	3.52	2.24	2.98	7.79	3.60	2.24	−0.60	0.00
DRF_DRF	3.67	2.31	3.58	1.58	3.77	2.31	2.45	1.58
p_value	Boosting	Bagging	RF	DRF	RF_Boosting	RF_Bagging	RF_RF	RF_DRF
DRF_RF	0.072	0.155	0.105	0.049	0.069	0.155	0.539	0.776
DRF_DRF	0.067	0.147	0.070	0.256	0.064	0.148	0.134	0.255

Table 6. Paired t-test statistics (t-values and p-values) for the AA metric of DRF_RF and DRF_DRF compared with Boosting, Bagging, RF, DRF, RF_Boosting, RF_Bagging, RF_RF, and RF_DRF.

t_value	Boosting	Bagging	RF	DRF	RF_Boosting	RF_Bagging	RF_RF	RF_DRF
DRF_RF	3.96	2.23	3.05	3.38	4.26	2.25	−0.97	0.04
DRF_DRF	4.11	2.28	2.67	1.54	4.45	2.29	1.80	1.94
p_value	Boosting	Bagging	RF	DRF	RF_Boosting	RF_Bagging	RF_RF	RF_DRF
DRF_RF	0.058	0.155	0.093	0.077	0.051	0.154	0.434	0.970
DRF_DRF	0.054	0.151	0.117	0.264	0.047	0.149	0.214	0.192

Table 7. Paired t-test statistics (t-values and p-values) for the m-recall metric of DRF_RF and DRF_DRF compared with Boosting, Bagging, RF, DRF, RF_Boosting, RF_Bagging, RF_RF, and RF_DRF.

t_value	Boosting	Bagging	RF	DRF	RF_Boosting	RF_Bagging	RF_RF	RF_DRF
DRF_RF	15.53	2.52	1.32	1.72	15.53	2.56	−0.51	−0.56
DRF_DRF	24.97	2.51	2.40	2.49	24.97	2.55	2.19	1.97
p_value	Boosting	Bagging	RF	DRF	RF_Boosting	RF_Bagging	RF_RF	RF_DRF
DRF_RF	0.058	0.155	0.093	0.077	0.051	0.154	0.434	0.970
DRF_DRF	0.054	0.151	0.117	0.264	0.047	0.149	0.214	0.192

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lian, J.; Feng, W.; Wang, Q.; Dong, Y.; Dauphin, G.; Bai, J. A Deep Random Forest Model with Symmetry Analysis for Hyperspectral Image Data Classification Based on Feature Importance. Symmetry 2025, 17, 2172. https://doi.org/10.3390/sym17122172

AMA Style

Lian J, Feng W, Wang Q, Dong Y, Dauphin G, Bai J. A Deep Random Forest Model with Symmetry Analysis for Hyperspectral Image Data Classification Based on Feature Importance. Symmetry. 2025; 17(12):2172. https://doi.org/10.3390/sym17122172

Chicago/Turabian Style

Lian, Jie, Wei Feng, Qing Wang, Yuhang Dong, Gabriel Dauphin, and Jian Bai. 2025. "A Deep Random Forest Model with Symmetry Analysis for Hyperspectral Image Data Classification Based on Feature Importance" Symmetry 17, no. 12: 2172. https://doi.org/10.3390/sym17122172

APA Style

Lian, J., Feng, W., Wang, Q., Dong, Y., Dauphin, G., & Bai, J. (2025). A Deep Random Forest Model with Symmetry Analysis for Hyperspectral Image Data Classification Based on Feature Importance. Symmetry, 17(12), 2172. https://doi.org/10.3390/sym17122172

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Deep Random Forest Model with Symmetry Analysis for Hyperspectral Image Data Classification Based on Feature Importance

Abstract

1. Introduction

2. Related Works

2.1. Adaboost

2.2. Bagging

2.3. Random Forest

2.4. Deep Forest

3. Proposed Method

3.1. Feature Importance in Deep Random Forests

3.2. Double Ensemble Classification Based on Deep Random Forests

4. Experimental Study

4.1. Evaluative Performance Metrics

4.2. Data Information

4.3. Results and Analysis

4.4. Parameter Analysis

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI