An Adaptive Multi-Splitting Multivariate Decision Tree for Multi-Class Classification Applied on High-Resolution and Hyperspectral Remote Sensing Images

Wang, Quan; Zheng, Zheng; Lei, Hao; Wang, Fei; Zhang, Zitong; Zou, Xiaowu; Nie, Feiping

doi:10.3390/rs18111790

Open AccessArticle

An Adaptive Multi-Splitting Multivariate Decision Tree for Multi-Class Classification Applied on High-Resolution and Hyperspectral Remote Sensing Images

by

Quan Wang

¹

,

Zheng Zheng

²,

Hao Lei

^1,*,

Fei Wang

¹

,

Zitong Zhang

¹,

Xiaowu Zou

¹ and

Feiping Nie

³

¹

State Key Laboratory of Human-Machine Hybrid Augmented Intelligence, Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University, Xi’an 710049, China

²

Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100190, China

³

School of Artificial Intelligence, Optics and Electronics (iOPEN), Northwestern Polytechnical University, Xi’an 710072, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2026, 18(11), 1790; https://doi.org/10.3390/rs18111790

Submission received: 8 April 2026 / Revised: 16 May 2026 / Accepted: 28 May 2026 / Published: 1 June 2026

(This article belongs to the Special Issue From Pixels to Spectra: Towards Generalizable Large Models for Hyperspectral Remote Sensing)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

The proposed adaptive multi-splitting multivariate decision tree achieves strong multi-class classification performance by directly optimizing a multi-class sample separation criterion at each node, eliminating the need for decomposition schemes and reducing class imbalance problem.
The model maintains compact tree structures and competitive computational efficiency compared to existing decision trees, as validated on synthetic data, noisy RGB scene images, and hyperspectral remote sensing data.

What are the implications of the main finding?

The elimination of decomposition schemes and adaptive splitting mechanism offer a practical solution for multi-class tasks where traditional methods suffer from imbalance or fragmented node splits, potentially improving robustness in real-world applications like remote sensing.
The combination of good classification accuracy with compact tree structures and efficiency suggests that the model holds good potential for resource-constrained or large-scale classification scenarios, where both interpretability and computational speed are important.

Abstract

In remote sensing data analysis, multi-class classification plays a critical role in distinguishing multiple pattern types, and decision trees are particularly well-suited for this task due to their computational efficiency and interpretability. Existing decision tree approaches often suffer from suboptimal handling of multi-class problems, vulnerability to class imbalance, or degraded generalization ability. To address these limitations, this paper proposes an adaptive multi-splitting multivariate decision tree designed explicitly for multi-class classification. The core of our approach is an effective homogeneity cluster discovery strategy that directly optimizes a multi-class sample separation criterion at each node, eliminating dependency on decomposition schemes and mitigating the associated class imbalance problem. This is coupled with an adaptive splitting mechanism that dynamically chooses between multi-splitting and bi-splitting at each node based on local data geometry and class labels. Experimental evaluations on a synthetic multi-class dataset, noisy remote sensing RGB scene image datasets demonstrate that the proposed model outperforms existing decision tree methods in classification accuracy and F1 score with compact tree structures and maintains competitive computational efficiency. In the remote sensing hyperspectral image classification application, the proposed model improves overall accuracy by up to 6.99% over the baseline deep learning model on the highly class-imbalanced Indian Pines dataset. This work provides a flexible and effective multivariate decision tree classifier, which can improve multi-class classification performance while keeping high efficiency.

Keywords:

multivariate decision tree; multi-class classification; remote sensing image classification

1. Introduction

Multi-class classification is one of the key techniques in the machine learning, data mining, and pattern recognition domains, with widespread applications ranging from image recognition, text classification, medical diagnosis [1], fault detection, and remote sensing image analysis [2]. In recent years, deep learning methods have achieved remarkable progress across various multi-class classification tasks, largely due to their powerful feature extraction capabilities. However, existing end-to-end deep learning approaches still face several challenges at the classification mechanism level, such as lack of decision interpretability, noise sensitivity, insufficient ability to handle class imbalance.

Among the numerous classification models available, decision trees remain highly attractive due to their intuitive structure, high computational efficiency, and strong interpretability. Notably, decision trees can inherently handle multi-class problems within a single model, a property that is particularly valuable in remote sensing applications where model transparency is often required for validation and trust. Classical univariate decision trees like ID3 [3], C4.5 [4] and CART [5], recursively partition example data based on a single attribute at an internal node. Nevertheless, when faced with complex intrinsic patterns and strong attribute interactions, such trees tend to overfit and exhibit limited generalization ability.

To enhance the discriminative power of decision trees, research has shifted towards multivariate decision trees. Unlike univariate trees that produce axis-parallel splits, multivariate trees utilize multiple attributes to form splitting criterion, thereby allowing the formation of complex decision boundaries. Early multivariate trees used linear splits while still relying on impurity measures as in univariate decision trees to select the optimal split. For example, OC1 [6] combines deterministic hill climbing with randomization to find the optimal linear combination coefficients that minimize impurity. Some methods first employ Fisher’s linear discriminant [7], Linear Discriminant Analysis (LDA) [8], or Householder reflection [9] to generate one or multiple coefficient vector or transformation matrices for linear combination construction and subsequently apply an impurity measure to determine the best oblique split. Although these methods often yield more compact and accurate models, searching for an optimal hyperplane at each splitting node remains computationally intensive. Moreover, impurity measures may fail to adequately capture the underlying geometric structure of data, as noted in the literature [10].

Researchers have integrated supervised linear classifiers into the splitting process of multivariate trees. For example, Cline [11] uses heuristic distance-based strategies to construct oblique hyperplanes, while MDT-DevM [12] formulates a deviation model as a linear programming problem to derive a hyperplane at each node. Several methods, including BTS [13], STree [14,15], LROTree [16], and NDT-IFTSVM [17], employ Support Vector Machine (SVM) or variants as splitting functions within binary tree frameworks. Manwani and Sastry [10] proposed Geometric Decision Tree (GDT), which incorporates geometric considerations by adopting Multisurface Proximal Support Vector Machine (MPSVM) [18] to obtain an angle bisector of two clustering hyperplanes as an oblique split rule. Zhang et al. [19] later uses such trees as base learners in an ensemble model. However, many of these multivariate techniques, including Cline [11], MDT-DevM [12], and LROTree [16], were originally designed for binary classification and require adaptation via One-vs-All (OvA) or One-vs-One (OvO) schemes for multi-class problems, which can introduce class imbalance or high computational overhead. Methods such as SVM1-ODT [20] and MOCTSVM [21] also build on SVM concepts but are formulated as a mixed-integer programming optimization problem; however, they often incur substantially higher training time.

Concurrently, research has evolved along other avenues to overcome the limitations of univariate decision trees. VM-DT [22] combines multiple split evaluation measures via voting. Jin et al. [23] proposed a Sampling-based decision tree Classification Rule Mining (SCRM) method for big-data scenarios. Some studies such as MVDT [24] and reference [25] combine univariate decision trees with OvA, while others use unsupervised geometric heuristics such as PCA (MDT2 [26]) or K-means with K = 2 (BDTKS [27]) to construct oblique splits. Another line of work investigates multi-way splitting, which can produce more compact and interpretable trees compared to binary splits [28]. In the multivariate version of CRUISE [28], LDA is adopted to split an internal node into multiple subnodes. More recently, Zhao et al. [29] proposed a BoostTree, which integrates Gradient Boosting Machine (GBM) into a single decision tree, accommodating both bi-splitting and multi-splitting.

Despite these advancements, a significant gap remains, particularly in the context of remote sensing image analysis. Remote sensing data often exhibit high dimensionality and varying levels of noise due to atmospheric effects [30], sensor limitations, and illumination conditions. Moreover, class imbalance is a pervasive issue in land cover and land use classification, where rare classes such as wetlands or specific crop types are underrepresented. Existing multivariate trees largely prioritize binary splits and face a persistent trade-off between expressiveness, computational cost, and overfitting, especially in such complex multi-class scenarios. Methods tailored for multi-class classification often rely on external OvA or OvO schemes or internal grouping heuristics, which may not optimally capture the native multi-class geometry. Meanwhile, sophisticated optimization-based trees are often caught in a dilemma between computational efficiency and handling larger datasets, while ensemble-embedded trees frequently sacrifice the inherent interpretability and simplicity that make decision trees attractive.

There is a pressing need for a decision tree that natively and efficiently handles multi-class data, adaptively chooses split complexity, and maintains a favorable balance between accuracy, simplicity, and computational efficiency, especially for noisy remote sensing image classification tasks where both robustness and interpretability are critical. The BDTKS method [27], which employs K-means clustering with

k = 2

at each node, captures intrinsic data structure but inherits the limitations of binary splitting: It is inflexible and often disrupts natural multi-class distributions, leading to excessively deep trees, overfitting, and non-compact models. To address this, the present study introduces an adaptive multi-splitting multivariate decision tree called the G-means Multivariate Decision Tree (GMDT) for multi-class classification tasks.

The novelty of GMDT lies in a set of targeted engineering refinements:

First, replacing K-means with G-means [31] at each splitting node to automatically determine the number of child clusters without requiring a prespecified k, thereby enabling adaptive multi-way splits.

Second, adding Z-score standardization to the original G-means clustering to improve reliability on overlapping clusters.

Third, constraining the maximum number of clusters

k_{m a x}

to the number of classes c, which prevents over-fragmentation and encourages early isolation of pure Gaussian clusters.

Fourth, retaining a K-means bi-splitting fallback for multi-class Gaussian clusters that G-means (with

k_{m a x} = c

) will not split further.

These refinements are built upon existing components—G-means clustering [31], K-means-based splitting [27], and multi-way splits [28,29]—but their specific combination and adaptation for multi-class decision tree induction is novel. Node splitting via G-means provides a homogeneity cluster discovery strategy for adaptive multi-splitting multivariate partitions that directly optimizes a multi-class sample separation criterion without relying on OvO or OvA decomposition, thus avoiding inherent class imbalance problems. GMDT dynamically selects multi-splitting of multiple possible branch sizes or bi-splitting at each node based on local data geometry and class labels, allowing for early discovery of pure single-class clusters and simplification of complex multi-class overlapping clusters where necessary. Consequently, GMDT has good performance with compact model structure and competitive running efficiency and offers a robust solution to noisy remote sensing hyperspectral image classification, where model interpretability [32] and resistance to class imbalance and noise are essential.

2. Methods

The BDTKS method [27], which employs K-means clustering to capture intrinsic data structure, has demonstrated considerable effectiveness. However, its mandatory binary splitting at each node can be restrictive in multi-class settings. A single linear hyperplane often fails to separate complex multi-class data into two well-separated subsets. Consequently, the tree tends to grow excessively deep, increasing the risk of over-fitting and yielding redundant and non-compact models. Multi-way splitting has been explored in earlier works such as CRUISE [28] and BoostTree [29], but those approaches either rely on linear discriminant analysis or require a predefined number of splits.

To address these limitations while staying within a decision tree framework, we propose an adaptive multi-splitting multivariate decision tree model for multi-class classification. The core idea is to replace the binary K-means splitting of BDTKS with G-means clustering [31], which automatically determines the number of child nodes at each splitting node based on a Gaussianity test. To adapt G-means for supervised decision tree induction, we introduce three modifications: (i) Z-score standardization to handle overlapping clusters, (ii) a user-specified upper bound

k_{m a x}

(set to the number of classes c) to prevent over-fragmentation, and (iii) integration as a node-splitting subroutine within a recursive tree-growing process. Furthermore, because G-means will not split a Gaussian cluster even if it remains multi-class, we retain the K-means bi-splitting step from BDTKS as a fallback. The resulting model is referred to as the G-means Multivariate Decision Tree (GMDT).

The remainder of this section is organized as follows: First, a review and defect rectification of G-means are provided; second, the GMDT model generation process is described; finally, the method for class label prediction is presented.

2.1. G-Means Review and Defect Rectification

G-means [31], abbreviated from Gaussian-means, is a clustering algorithm designed to achieve a final clustering state where the data within each cluster are drawn from a Gaussian distribution. The algorithm adaptively determines the number of clusters by alternatively splitting every existing non-Gaussian cluster into two and performing K-means to update the clusters until all resulting clusters pass a test that the cluster has the Gaussian characteristic. To assess whether the data in a cluster follow a Gaussian distribution, G-means employs the Anderson–Darling statistic test [33] with a prescribed significance level

α

.

Although G-means clustering offers the advantage of not requiring prior knowledge of the exact number of clusters, it also has certain limitations. Here, we identify two main deficiencies and propose corresponding adjustments.

First, the original G-means algorithm [31] does not emphasize data preprocessing and operates directly on the raw input data. While this approach performs well when clusters are well-separated, it tends to yield unrealistic results (Figure 1b) when clusters overlap as shown in Figure 1a. To address this issue, we recommend applying Z-score standardization before clustering. The Z-score standardization can be expressed as

\hat{x} \leftarrow (x - \bar{x}) ⊘ σ_{x} .

(1)

Here,

\hat{x}

is the standardized example point,

x

is the example point in the raw training set,

\bar{x}

is the sample mean,

σ_{x}

is the standard deviation vector, and ⊘ represents the element-wise division. The final cluster centroids should then be transformed back to the original scale via inverse transformation as follows:

c_{i}^{x} \leftarrow c_{i} ⊙ σ_{x} + \bar{x}, i = 1, \dots, k,

(2)

where

c_{i}^{x}

is the cluster centroid in the raw space that

x

lies in,

c_{i}

is the cluster centroid in the standardized space, and ⊙ represents the element-wise multiplication. Z-score standardization helps to normalize the data to a common scale, thereby improving the reliability of distance-based similarity measures and promoting coherent grouping of points that truly belong to the same cluster. As a result, integrating Z-score standardization enables G-means to discover and concentrate on the challenging overlapping clusters, as shown in Figure 1c.

Second, the original G-means algorithm [31] iteratively splits clusters until each one passes a normality test. This can lead to excessive fragmentation in regions where clusters overlap, as shown in Figure 1b,c. To mitigate this issue, we introduce a user-specified parameter

k_{\max}

, representing the maximum allowable number of clusters. When an approximate cluster count is known in advance, this parameter can be set accordingly. In cases where no prior knowledge is available,

k_{\max}

defaults to infinity, effectively reducing the algorithm to the original G-means behavior without such a constraint.

Based on these refinements, we develop an adjusted G-means incorporating Z-score standardization and the

k_{m a x}

constraint, whose resulting effect is shown in Figure 1d. Since the core framework remains unchanged, we continue to refer to it as G-means. The complete procedure is summarized in Algorithm 1, with the rectangular frame indicating the specific adjustments made to the original G-means algorithm [31].

Algorithm 1 G-means:

{c_{1}^{x}, \dots, c_{k}^{x}} \leftarrow Gmeans (x_{i} ∣ i = 1, 2, \dots, n; α, k_{\max})

2.2. G-Means Multivariate Decision Tree (GMDT) Model Generation

The objective of this study is to develop a supervised classification model based on a decision tree structure. We posit that the distribution patterns of data samples within the same class reflect their intrinsic characteristics, which in turn play a critical role in accurately classifying new instances of the class. Given the prevalence of Gaussian distributions in many real-world scenarios, we assume that samples from a single class typically conform to a Gaussian distribution in most cases. G-means clustering is particularly suitable for this context, as it identifies clusters that follow a Gaussian distribution. Therefore, we integrate G-means into the decision tree framework to detect single-class sample clusters as early as possible, thereby promoting a more compact tree structure.

In practice, actual distribution patterns of classes can be categorized into six cluster types listed in Figure 2. Non-Gaussian clusters can be further decomposed via G-means in Algorithm 1 into multiple Gaussian sub-clusters, and occasionally into non-Gaussian sub-clusters. The transformation relationships among these cluster types, derived from actual class distributions, are illustrated in Figure 2. After applying G-means recursively, the resulting clusters will become either single-class Gaussian clusters or multi-class Gaussian clusters (which may be non-overlapping or overlapping). The former ones correspond to the ideal case, readily assignable to their true class labels, whereas the latter ones, being impure, do not reveal clear classification rules and thus require further node bi-splitting. The proposed GMDT classification model is built by recursively performing adaptive multi-way node splitting via G-means (as described in Section 2.2.1), supplemented with a node bi-splitting strategy (introduced in Section 2.2.2).

2.2.1. Adaptive Multi-Splitting Multivariate Partition

Adaptive multi-splitting multivariate partition is designed to handle multi-class non-Gaussian clusters, specifically the cluster types iv and vi illustrated in Figure 2. This method employs G-means of Algorithm 1 to perform node splitting, as depicted in Figure 3. A key advantage of G-means is its ability to automatically determine the optimal number k of sub-clusters without requiring prior specification. As a result, G-means enhances the flexibility of local structures within the decision tree.

Figure 4 provides an example in which a multi-class non-overlapping non-Gaussian cluster (type iv) is partitioned by G-means into several single-class Gaussian sub-clusters (type i), all of which become leaf nodes. This represents the most ideal splitting scenario. In other cases, however, some of the resulting sub-clusters may remain multi-class and require further splitting. This is particularly challenging when the cluster at the current splitting node belongs to a multi-class overlapping non-Gaussian type (type vi). In such situations, G-means may generate sub-clusters of any of the six cluster types, as shown in Figure 3b.

When a multi-class overlapping non-Gaussian cluster (type vi) emerges, G-means tends to generate an excessive number of sub-clusters if the parameter

k_{\max}

is set to infinity. A more effective strategy is to first separate the easily identifiable Gaussian sub-clusters, while deferring the handling of multi-class overlapping regions—which are difficult to partition into Gaussian sub-clusters—to subsequent local analysis. This suggests that setting

k_{\max}

to a finite positive integer rather than infinity is beneficial. Since our GMDT is a supervised classification model with known class labels for training samples, we can conveniently leverage the total number of classes c to specify the parameter

k_{\max}

in the G-means algorithm.

In a word, adaptive multi-way node splitting discovers the easily identifiable Gaussian sub-clusters as early as possible by the following G-means procedure:

{c_{1}^{x}, \dots, c_{k}^{x}} \leftarrow Gmeans (x_{i} ∣ i = 1, 2, \dots, n; α, k_{\max} = c),

(3)

whose detail steps follow Algorithm 1. However, G-means with

k_{m a x} = c

will not split a Gaussian cluster further, even if it is still multi-class. Therefore, a subsequent bi-splitting step is designed to handle such cases.

2.2.2. Node Bi-Splitting for Multi-Class Gaussian Clusters

Node bi-splitting is designed as a supplementary mechanism to handle multi-class Gaussian clusters (including types iii and v). G-means does not perform further splitting on Gaussian clusters, regardless of whether they contain multiple classes. To split data into more approximately pure clusters, the node bi-splitting procedure is activated if a Gaussian cluster encompasses more than one class.

We perform K-means with

k = 2

to split the multi-class Gaussian clusters, as is shown in Figure 5. This approach follows the same practice adopted in BDTKS [27]. However, in our method, this step as a component is integrated into the GMDT framework to address the issue of incomplete splitting that arises with multi-class Gaussian clusters. We employ the centroids of the principal direction partitioned sample subsets to serve as the initial centers of K-means. The detail procedure of the node bi-splitting is summarized in Algorithm 2. Given a multi-class Gaussian cluster data points

{x ∣ x \in c^{x}}

, the node bi-splitting procedure is represented as

{c_{1}^{x}, c_{2}^{x}} \leftarrow BiSplitting (x ∣ x \in c^{x}) .

(4)

Figure 5. Local model structures produced by node bi-splitting. The Roman numerals refer to the cluster types in Figure 2. (a) The possibly produced local structures for a cluster of type iii. (b) The possibly produced local structures for a cluster of type v.

Algorithm 2 Node bi-splitting:

{c_{1}^{x}, c_{2}^{x}} \leftarrow BiSplitting (x ∣ x \in c^{x})

Input: Multi-class Gaussian cluster data points

{x ∣ x \in c^{x}}

Output: Clustering results with 2 clusters

1:: Standardize data: $\hat{x} \leftarrow (x - \bar{x}) ⊘ σ_{x} ∣ x \in c^{x}$ .
2:: Perform PCA to obtain the principal component corresponding to the largest eigenvalue: $w \leftarrow PCA (\hat{x} ∣ x \in c^{x})$ .
3:: Project data: $p (\hat{x}) \leftarrow w^{⊤} \hat{x} ∣ x \in c^{x}$ .
4:: Obtain median: $m \leftarrow Median (p (\hat{x}) ∣ x \in c^{x})$ .
5:: Generate initial centers: $initialC \leftarrow [c_{1}^{\hat{x}}, c_{2}^{\hat{x}}]$ , where $c_{1}^{\hat{x}} \leftarrow Mean (\hat{x} ∣ p (\hat{x}) > m, x \in c^{x})$ , $c_{2}^{\hat{x}} \leftarrow Mean (\hat{x} ∣ p (\hat{x}) \leq m, x \in c^{x})$ .
6:: Split cluster: $c_{1}^{\hat{x}}, c_{2}^{\hat{x}} \leftarrow Kmeans (\hat{x} ∣ x \in c^{x}, initialC)$ .
7:: Get final centroids: $C \leftarrow {c_{1}^{x}, c_{2}^{x}}$ , where $c_{1}^{x} \leftarrow c_{1}^{\hat{x}} ⊙ σ_{x} + \bar{x}$ , $c_{2}^{x} \leftarrow c_{2}^{\hat{x}} ⊙ σ_{x} + \bar{x}$ .
8:: return Two clusters with centers $C = {c_{1}^{x}, c_{2}^{x}}$ .

2.2.3. GMDT Model Generation Algorithm

Our proposed GMDT model, similar to conventional decision tree frameworks, is constructed by recursively partitioning the sample point set at each node into subsets, which subsequently become either leaf nodes or new splitting nodes. The key distinction lies in its node-splitting strategy: The G-means clustering method is subtly integrated to enable adaptive multi-way splitting, while bi-splitting is retained as a supplementary mechanism. By incorporating G-means, the GMDT model effectively isolates single-class Gaussian clusters into leaf nodes at earlier stages, thereby promoting a more compact structure.

To fully delineate the GMDT generation algorithm, we elaborate on several additional details. First, the conditions for forming a leaf node in the GMDT model are defined as follows: (1) all samples at the node belong to the same class, or (2) the number of samples is lower than a predefined threshold

m i n_s a m p l e s_s p l i t

. Notably, by virtue of condition (1), single-class non-Gaussian clusters (type ii) produced by G-means can also directly become leaf nodes. Second, imposing a constraint on the maximum number of sub-clusters generated by G-means (specifically, setting

k_{\max} = c

) enhances both model conciseness and generalization. Under this constraint, certain clusters may remain non-Gaussian after applying G-means. If a non-Gaussian cluster contains samples from only one class, it will directly become a leaf node in the next recursion without further partitioning. Conversely, if a non-Gaussian cluster encompasses multiple classes, deferring its detailed analysis to the next recursion—rather than subdividing it immediately—proves more beneficial for generalization.

Based on the aforementioned descriptions, the procedure for generating the GMDT model is accordingly summarized in Algorithm 3.

Algorithm 3 GMDT generation:

genGMDT ((x, y) ∣ x \in c^{x}; α, m i n_s a m p l e s_s p l i t)

Input: Labeled training set

{(x, y) ∣ x \in c^{x}}

, where

c^{x}

is the centroid and

y \in {l_{1}, \dots, l_{c}}

.
Parameters: Significance level

α

and

m i n_s a m p l e s_s p l i t

Output: Tree model with centroid sets

{C}

and leaf labels

{l}

1:: Sample set at current node: $S \leftarrow {(x, y) ∣ x \in c^{x}}$ .
2:: if S is pure or $| S | < m i n_s a m p l e s_s p l i t$ then
3:: Turn into leaf node: mark the current node as leaf node.
4:: Store leaf label: $l \leftarrow l_{q^{*}}$ , where $q^{*} = arg max_{q} {\sum_{i = 1}^{| S |} δ (y = l_{q}) ∣ q = 1, 2, \dots, c}$ .
5:: else
6:: Perform G-means: ${c_{1}^{x}, \dots, c_{k}^{x}} \leftarrow Gmeans (x ∣ x \in c^{x}; α, k_{\max} = c)$ .
7:: if $k = 1$ then
8:: Conduct node bi-splitting: ${c_{1}^{x}, c_{2}^{x}} \leftarrow BiSplitting (x ∣ x \in c^{x}$ ).
9:: $k \leftarrow 2$ .
10:: end if
11:: Store centroid set: $C \leftarrow {c_{1}^{x}, \dots, c_{k}^{x}}$ .
12:: for $j = 1$ to k do
13:: $c^{x} \leftarrow c_{j}^{x}$ .
14:: $genGMDT ((x, y) ∣ x \in c^{x}; α)$ .
15:: end for
16:: end if

2.3. Class Label Prediction

Upon generation of the GMDT model, we can leverage it to predict class labels for testing instances. Since all stored centroid sets have been inversely transformed back into the original feature space of the training data, we can directly compute the distance from a test example to the centroids at any splitting node during inference.

Given a test instance

x

, it is passed through the GMDT model by computing, at each splitting (non-leaf) node, the Euclidean distances between

x

and the centroids. The instance is then routed to the child node

j^{*}

corresponding to the closest centroid, determined as follows:

j^{*} = arg min_{j} {∥ x - c_{j} ∥_{2} ∣ j = 1, 2, \dots, k} .

(5)

This process continues until a leaf node is reached. The label l associated with this leaf node is assigned as the predicted class of

x

, that is

y_{pre} (x) = l .

(6)

3. Results

All experiments were implemented in Python 3.11 and carried out on a computational server equipped with a 64GB RAM and an Intel(R) Core(TM) i7-8700 CPU running at 3.20 GHz. We first evaluated the performance of our GMDT on synthetic multi-class data to provide a visual assessment, and subsequently validated its practical effectiveness in the context of noisy remote sensing RGB image scene classification and noisy remote sensing hyperspectral image classification. For all these experiments regarding GMDT, in the training stage the training set is used to build models and the validation set is used to tune the significance level

α

and the parameter

m i n_s a m p l e s_s p l i t

from the 25 combinations of

α = 0.01, 0.025, 0.05, 0.1, 0.15

,

m i n_s a m p l e s_s p l i t = 2, 5, 10, 15, 20

, while in the testing stage the testing set is used to predict the classification results through the optimal GMDT model and evaluate its performance.

3.1. Simulation Experiments on Synthetic Multi-Class Dataset

3.1.1. Synthetic Dataset

To visually and quantitatively assess the proposed GMDT, we first constructed a 2D synthetic multi-class dataset comprising 12 classes with varied distributions. Ten of those classes are Gaussian clusters, one class follows heavy tailed distribution, and another one comes from copula distribution. Each Gaussian clusters contains 200 example points of standard deviation 0.7. The heavy tailed distribution class is generated from multivariate Student’s t-distribution with degrees of freedom = 3, incorporating 1000 example points whose mean and covariance are

[0, 0]

and

[2, 0.8; 0.8, 1]

. The copula distribution class is generated by first sampling 900 points from a bivariate Gaussian distribution with mean [1, −1] and correlation 0.8, then transforming these points to uniform marginals using the normal Cumulative Distribution Function (CDF), and finally applying the inverse CDFs of target distributions: exponential with scale = 2 for the first feature and gamma with shape a = 2 and scale = 1 for the second feature. Training/validation/test splits were 80%/10%/10%. As shown in Figure 6a,b, some of the classes overlap with each other.

3.1.2. Comparison Baselines and Configurations

Five representative and state-of-the-art decision tree baselines were used as comparison methods, including C4.5 [4], CART [5], MDT2 [26], BDTKS [27], and STree [14,15]. Among them, C4.5 and CART are univariate decision trees, whereas MDT2, BDTKS, and STree are multivariate decision trees. In addition, a linear SVM with a OvA strategy [34] (called SVM directly for brevity) and Random Forest (RaF) [35] were also compared as classifiers of a different category. For all five of the comparison decision tree models as well as GMDT, the

m i n_s a m p l e s_s p l i t

parameter was uniformly tuned from

{2, 5, 10, 15, 20}

, and the larger value was preferred if the same performance was achieved. For MDT2, the purity threshold parameter

δ

was tuned over

{0.6, 0.7, 0.8, 0.9, 1.0}

, and for BDTKS, the parameter

λ

was selected from

{0.0001, 0.001, 0.01, 0.1, 1}

, with the smaller

δ

and the larger

λ

value more preferred. For STree, we parameterized the SVM algorithm in it with the same default settings used for STree-default in [14]. For SVM, the parameter C was tuned from

{0.01, 0.1, 1, 10, 100}

. RaF used the default settings in the scikit-learn 1.8.0 package. Among all the tuned methods, the model with the configuration leading to the highest validation accuracy was used for evaluating the classification performance eventually.

3.1.3. Performance Comparison

As shown in Figure 7, GMDT produced the cleanest classification regions, with fewer misclassifications around overlapping class regions. This means that GMDT extracts more discriminative information from the training data. In Figure 7, numerical labels stand for training class centers; only the sample points incorrectly predicted by the eight classifiers are plotted, and those correctly classified points are omitted to maintain visual clarity. Quantitatively, GMDT achieves a classification accuracy (ACC) of 0.8231 and a macro F1-score (F1) of 0.8310, tying with the outstanding classifier RaF—the second-best among the eight classifiers—in ACC and surpassing it in F1 by +1.01%. Since this synthetic dataset is class-imbalanced, SVM (ACC 0.6333, F1 0.5552) and STree (ACC 0.6256, F1 0.7061), both using an OvA strategy, clearly demonstrate their shortcomings. Thus, both visual analysis and quantitative metrics demonstrate the performance superiority of GMDT.

In addition, GMDT produced a compact tree model. It has a tree depth of 7 with 215 split nodes, compared to C4.5 (depth 118, 613 split nodes), CART (19, 193), MDT2 (9, 250), BDTKS (13, 178), and STree (6, 13). Although STree appears even more compact, its classification performance is markedly inferior, achieving an ACC of 0.6256 and an F1 of 0.7061—both roughly 10 to 20 percentage points lower than those of GMDT.

3.1.4. Ablation Study of $k_{m a x}$

We additionally conduct ablation experiments on the synthetic 12-class dataset to investigate the influence of the parameter

k_{m a x}

in the G-means procedure within GMDT. We vary

k_{m a x}

to take

\frac{c}{2}, c, 2 c, 4 c, 8 c, 16 c

, where

c = 12

in our case, and record the corresponding metrics of GMDT, including ACC, F1, the maximum depth

N_{d p}

, the number of split nodes

N_{s p}

, and the number of leaf nodes

N_{l f}

. All the results are summarized in Table 1. When

k_{m a x}

is set too small (

\frac{c}{2}

), GMDT loses some of its adaptive multi-splitting capability, leading to more complex tree and lower classification performance (ACC 0.8051, F1 0.8060). As

k_{m a x}

increases to c, performance improves significantly (ACC 0.8231, F1 0.8310), indicating that allowing more sub-clusters per split helps separate Gaussian-like class components. The best performance (ACC = 0.8308, F1 = 0.8403) is achieved at

k_{m a x} = 4 c

. Further increasing

k_{m a x}

beyond

4 c

leads to a decrease in performance and may cause over-fragmentation. Although

k_{m a x} = c

grants GMDT a next-best but acceptable performance, considering the easiest availability, we still set

k_{m a x} = c

in GMDT in all the subsequent experiments.

3.2. Noisy Remote Sensing RGB Image Scene Classification

We applied our proposed GMDT classifier to noisy remote sensing image scene classification to validate its efficacy. The following will describe datasets, state feature extraction procedure, show the GMDT classification performance, and compare comprehensively to other decision tree models.

3.2.1. Datasets

We collected eight remote sensing RGB scene image datasets, including “AID” [36], “MASATI” [37], “PatternNet” [38], “RSC11” [39], “RSI-CB256” [40], “RSSCN7” [41], “UCM” [42], and “WHU-RS19” [43]. For each raw dataset, we first eliminated duplicate images since the presence of a large number of identical images in the training set will increase the risk of overfitting. All images were then resized to uniform dimensions of

224 \times 224

pixels to comply with the network design standards. A multitude of factors such as sensor characteristics, signal transmission, and environmental conditions can lead to noise in remote sensing images, and Gaussian noise is the most common type [44]. Therefore, Gaussian noise with zero mean and standard deviation of 25 was added to each color channel of any RGB image to form the noisy remote sensing scene image datasets we used. Each dataset was subsequently split into

70 %

training,

10 %

validation, and

20 %

testing sets using stratified random sampling. Information about the used remote sensing scene RGB image datasets is provided in Table 2.

3.2.2. Feature Extraction

To extract informative features from the image instances, we adapted the ResNet18 architecture [45] by removing its final fully connected layer and inserting two new fully connected layers: one with 32 output units and another with c output units, where c denotes the number of classes. This adapted version is referred to as MR18 (MR18). We retained the pre-trained weights from the original ResNet18 as initial weights for the unaltered layers, while the weights of the newly added layers were randomly initialized. The model was trained using cross-entropy loss on our training set to obtain the optimized parameters. Training was conducted for 10 epochs with a batch size of 16, using the Adam optimizer with an initial learning rate of 0.001. The model corresponding to the best validation accuracy was saved as the final MR18 model.

Using the trained MR18 model, we extracted 32-dimensional features from the 32-output-unit fully connected layer for the training, validation, and testing sets.

3.2.3. GMDT Classification

We applied our proposed GMDT to conduct classification based on the extracted 32-dimensional features for all eight remote sensing RGB scene image datasets. Figure 8 shows the classification accuracy values of MR18 and MR18 + GMDT.

The experimental outcomes shown in Figure 8 are analyzed and explained as follows. Our MR18 + GMDT generally performs better than MR18 in most cases, achieves comparable performance in a few instances, and shows weaker results in certain specific scenarios. Specifically, MR18 + GMDT attains notably higher classification accuracy than MR18 on five out of the eight datasets. The remaining three datasets include “AID”, “PatternNet”, and “WHU-RS19”. MR18 + GMDT exhibits marginally better or very similar results relative to MR18 on the first two datasets, while MR18 + GMDT underperforms MR18 on “WHU-RS19”. We attribute the phenomenon to the intrinsic characteristics of the datasets. Since “AID” is moderately imbalanced, with class sizes ranging from 154 to 294 training samples, MR18 + GMDT offers only a slight improvement in classification accuracy over MR18. For “PatternNet”, which features perfect class balance and a large number of training samples, MR18 + GMDT maintains a classification performance similar to that of MR18. As for “WHU-RS19”, GMDT fails to fully realize its potential due to the near-perfect class balance and the limited number of training samples. GMDT requires multiple categories with sufficiently large sample sizes to enable the effective separation of pure Gaussian clusters, rather than having very few samples across all categories. Overall, these findings suggest that although deep learning-based feature extraction is essential, GMDT still plays a valuable role in enhancing classification performance, particularly under conditions of extreme class imbalance and moderate training sample sizes.

The minority-class recall values have been recorded to demonstrate the superiority of applying GMDT in class imbalance situations. The five class imbalance datasets are “AID”, “MASATI”, “RSC11”, “RSI-CB256”, and “WHU-RS19”. Table 3 reports the minority-class recall values achieved by MR18 and MR18 + GMDT on these five datasets. Notably, for highly imbalanced datasets such as “AID”, “MASATI”, and “RSI-CB256”, MR18 + GMDT yields a substantial improvement in minority-class recall over MR18. In contrast, for those slightly class-imbalanced datasets such as “RSC11” and “WHU-RS19”, the gain in minority-class recall vanishes. Nevertheless, GMDT still brings some classification accuracy improvement on “RSC11” which is not entirely composed of small sample classes. Overall, GMDT proves effective in highly imbalanced multi-class classification scenarios.

We further explored the parameter sensitivity of GMDT regarding two key parameters

α

and

m i n_s a m p l e s_s p l i t

. On both the small “UCM” and the large “PatternNet” datasets, we varied

α = 0.01, 0.025, 0.05, 0.1, 0.15

and

m i n_s a m p l e s_s p l i t = 2, 5, 10, 15, 20

. For each combination, a GMDT model was trained on the training set and evaluated on the testing set using classification accuracy. The results are visualized as heatmaps in Figure 9. From the heatmap on the “UCM” dataset, classification accuracy remains relatively stable on

α

, with fluctuations confined to within 1.5 percentage points for different

α

values, while it shows relatively significant changes across various

m i n_s a m p l e s_s p l i t

values, which may exceed 1.5 percentage points. On the “PatternNet” dataset, the classification accuracy is generally high (mostly above 0.97) across all parameter combinations, indicating that GMDT remains effective on this larger and more complex dataset. However, the heatmap reveals a clear sensitivity to

m i n_s a m p l e s_s p l i t

: Smaller values such as 2 or 5 consistently yield higher accuracy, while larger values lead to a slight but noticeable drop. In contrast, the influence of

α

appears secondary. In summary, the large dataset “PatternNet” shows a more stable performance than the small dataset “UCM”, and

α

has a milder impact on both datasets than

m i n_s a m p l e s_s p l i t

.

3.2.4. Comparison to Other Decision Tree Classifiers

Since our proposed GMDT is an improved decision tree, we focus on comparison with other single decision tree classifiers. We presents a comprehensive evaluation of the five comparison decision tree models mentioned in Section 3.1.2 using multiple metrics criteria. All these implementations were based on the same 32-dimensional features extracted using MR18. For brevity, “MR18+” was removed in the subsequent writing of this section. As for the metrics criteria, ACC and F1 were adopted to evaluate the classification performance, while the number of split nodes

N_{s p}

, the number of leaf nodes

N_{l f}

, and the maximum depth

N_{d p}

were employed to reflect the complexity of those decision tree models. Additionally, we also reported training time

T_{t r}

and prediction time on the entire testing dataset

T_{p r}

to directly compare computational efficiency.

We begin with a comparative analysis of classification performance metrics. As shown in Table 4, the classification accuracy and macro F1-score of all six decision tree models are presented. The proposed GMDT generally achieves the best or nearly best performance on both metrics among the six decision tree methods, suggesting its potential for better generalization ability. It is worth noting that our GMDT behaves better than other compared decision tree methods for those datasets with many more classes contained, “AID” and “PatternNet”, for example. Friedman tests conducted at a significance level of 0.05 on both classification accuracy and macro F1-score indicate statistically significant differences among the six methods. Subsequent Nemenyi post hoc tests with a significance level of 0.05 were conducted. The critical difference value along with the average ranking values are summarized in Figure 10, revealing that GMDT is significantly superior to C4.5, CART, and MDT2 in terms of both classification accuracy and macro F1-score. Although GMDT does not show significant superiority over all benchmark methods according to the Nemenyi test, it consistently leads the pack on the mean value across all datasets in both ACC (0.9010) and F1 (0.8967), which can be demonstrated in Table 4.

We then concentrate on comparing the complexity of decision tree models. Table 5 displays

N_{s p}

,

N_{l f}

, and

N_{d p}

for each decision tree model at the parameter value that achieves the highest classification accuracy. C4.5 and CART use a greedy search to select the optimal attribute and cut point based on impurity measures at each node. As a result, they typically require fewer split nodes (

N_{s p}

) and and generate fewer leaf nodes (

N_{l f}

) compared to MDT2 and BDTKS, which rely on unsupervised heuristic splitting. However, the greedy approach in C4.5 and CART can lead to overfitting. This may, in turn, result in poor classification performance on unseen data. Our GMDT clearly exhibits a tendency toward relatively smaller

N_{s p}

values among the multivariate decision trees and even comparable or smaller

N_{s p}

than C4.5 and CART. This is primarily attributed to its early separation of pure Gaussian clusters. Owing to its multi-splitting nature, GMDT often generates more leaf nodes per split, yet the total number

N_{l f}

of leaf nodes remains comparable or even lower in some cases thanks to its reduced

N_{s p}

. Regarding

N_{d p}

, GMDT yields less depth than the five compared decision tree methods on nearly all of the eight datasets. Since STree adopts excellent supervised SVM learning and impurity measures to find the best oblique split at each node, it tends to have a compact model. However, GMDT still yields smaller

N_{d p}

than STree on six datasets. In summary, compared to the other decision tree models evaluated, our GMDT exhibits a quite compact structure characterized by a shallower maximum depth, fewer split nodes, and a moderate number of leaf nodes.

To further investigate the efficiency of these decision tree methods, we compare their training time

T_{t r}

and prediction time

T_{p r}

. The recorded time for all six decision tree methods is shown in Table 6. In terms of training time

T_{t r}

, univariate C4.5 and CART generally require considerably more time than the four multivariate decision trees, particularly when the training sample size is large—“PatternNet”, for example. Our proposed GMDT takes slightly longer time to train than BDTKS, as it performs multiple runs of K-means at each multi-splitting node. Nevertheless, across all eight datasets, GMDT requires at most about five times the training time of BDTKS. The added training cost is acceptable given the gains in tree compactness and classification accuracy, and it remains far lower than that of univariate C4.5/CART on large datasets. Overall, the training time of GMDT is acceptable and remains comparable to that of both BDTKS and STree. As for prediction time

T_{p r}

, univariate C4.5 and CART hold a natural advantage but at the cost of inferior prediction performance. Similarly, MDT2 needs less prediction time among the four multivariate decision trees but provides relatively poor classification results. Although GMDT tends to demand a bit more prediction time than BDTKS, STree, and GMDT, it is worth noting that even the dataset with the longest prediction time

T_{p r}

takes less than two seconds, and the differences in prediction time among BDTKS, STree, and GMDT are relatively small.

3.3. Noisy Remote Sensing Hyperspectral Image Classification

3.3.1. HSI Datasets and Preprocessing

We also applied our GMDT classifier to noisy remote sensing hyperspectral image (HSI) classification, which is one of typical multi-class classification scenarios. Indian Pines (“IN”) and Pavia University (“UP”) were two HSI datasets adopted in our experiments. The “IN” dataset includes 16 mutually exclusive land-cover classes with a highly imbalanced number of samples per class. It is a HSI with 145 × 145 pixels and 200 spectral bands ranging from 0.4 to 2.5 µm, including the visible and infrared spectral ranges. The “UP” dataset includes 9 mutually exclusive land-cover classes with a moderately imbalanced number of samples per class. It is an HSI with 610 × 340 pixels and 103 spectral bands (after removing noisy bands), ranging from 0.43 to 0.86 µm, covering the visible to near-infrared spectral range.

Gaussian white noise with zero mean and a standard deviation of

σ = 0

(no noise),

σ = 0.1

, and

σ = 0.2

was added to the raw HSI dataset to form the noisy ones used in our experiments. A total of 10,249 and 42,776 pixel samples represent different classes in the “IN” and the “UP” datasets, respectively, and the rest represent the background. Among the total 10,249 or 42,776 non-background pixel samples, the ratios of the training, validation, and testing sets were set to

10 %

,

10 %

, and

80 %

. We performed five iterations of stratified random sampling to attain the three sets to carry out five runs of the following experiments.

3.3.2. Experimental Settings

We employed an attention-based adaptive spectral–spatial kernel improved ResNet (A²S²K-ResNet) [46] for HSI classification as a baseline comparison method. Our HSI classification solution (A²S²K-ResNet + GMDT) is to apply the proposed GMDT classifier after extracting the output features before the fully connected layer of softmax classifier in A²S²K-ResNet, considering its fine ability to capture discriminant spectral–spatial features. First, A²S²K-ResNet was trained with the following settings: 10 epochs, batch size of 32, a learning rate of 0.001, spatial patch size of

7 \times 7

, and 8 kernels. Second, 8-dimensional features of both the training and validation sets extracted from A²S²K-ResNet were used to generate a optimal GMDT model. Finally, evaluation results were obtained with the optimal GMDT model, A²S²K-ResNet inferred 8-dimensional features, and true labels of the testing set. For a more comprehensive comparison, we additionally evaluated RaF, XGBoost [47], and SpectralFormer [48] under the same training, validation, and testing configuration. For both RaF and XGBoost, the validation set was used to tune the number of base learners from

{50, 100, 200}

. As for SpectralFormer, the corresponding configuration was 300 epochs for “IN” and 600 epochs for “UP”, with 3 band patches for both datasets. The spatial patch size was kept the same

7 \times 7

for all aforementioned methods.

3.3.3. Experimental Results

Visualization of the classification maps for “IN” and “UP” is shown in Figure 11 and Figure 12, including the ground truth and those generated from a certain one of the five runs of RaF, XGBoost, SpectralFormer, A²S²K-ResNet, and A²S²K-ResNet + GMDT. The results indicate that A²S²K-ResNet + GMDT yields a better HSI classification effect than the other methods.

Table 7 and Table 8 present a comprehensive quantitative comparison on the five methods on the two HSI datasets, with overall accuracy (OA), average accuracy (AA), statistical kappa coefficient (Kappa), mean training time over five runs (mean

T_{t r}

), and mean entire testing set prediction time over five runs (mean

T_{p r}

).

A²S²K-ResNet + GMDT consistently achieves the best or near-best accuracy across almost all scenarios. This demonstrates the effectiveness of applying GMDT after obtaining deep features. Our A²S²K-ResNet + GMDT exhibits the highest OA, AA, and Kappa values among the five methods on the “IN” dataset whether Gaussian noise is added or not. Under clean conditions (

σ = 0

), our proposed method significantly outperforms all others, improving OA by +6.99% over the base A²S²K-ResNet (0.9272 vs. 0.8573) and by +3.67% over SpectralFormer (0.8905). The gain in AA is even more dramatic (+26.72% over base A²S²K-ResNet), indicating better per-class performance. Under noisy conditions (

σ = 0.1, 0.2

), the proposed method maintains superior performance. While base A²S²K-ResNet already shows strong resistance to noise, adding GMDT further boosts accuracy. At

σ = 0.2

, the proposed method achieves 0.9019 OA, comfortably ahead of SpectralFormer (0.7262) and XGBoost (0.3790). As for the larger “UP” dataset, A²S²K-ResNet + GMDT still achieves the highest scores under clean conditions (

σ = 0

), marginally but consistently improving over the already excellent A²S²K-ResNet. Under noise (

σ = 0.1, 0.2

), both A²S²K-ResNet and our method demonstrate remarkable robustness. However, our method slightly underperforms the base A²S²K-ResNet at

σ = 0.1

(OA, 0.9681 vs. 0.9705) and

σ = 0.2

(OA, 0.9604 vs. 0.9610). The differences are within the standard deviation ranges, suggesting statistical equivalence. This indicates that on a larger dataset with a weaker class imbalance property, application of GMDT yields diminishing returns but does not cause degradation.

A critical advantage of the proposed method is its efficiency, particularly in prediction time. A²S²K-ResNet + GMDT needs slightly more training time than A²S²K-ResNet because recomputing the features from the A²S²K-ResNet model and the training set and then training the GMDT model demands additional time; however, it requires less testing time, even if extra time is still needed for extracting features from the testing set. On both the “IN” and “UP” datasets, total training time increases by less than 30 s. This is a reasonable trade-off for the substantial accuracy gains seen on the “IN” dataset. SpectralFormer, in contrast, has extremely high training times (12,228 s on clean “UP”), making it impractical for large-scale or real-time applications with limited resources. The proposed method is significantly faster than both SpectralFormer and the base A²S²K-ResNet during inference. This roughly

2 \times

speedup over the base A²S²K-ResNet is a major achievement, suggesting that GMDT effectively distills or streamlines the decision process, making the combined model more deployable in latency-sensitive scenarios.

Our A²S²K-ResNet + GMDT effectively mitigates the class imbalance problem, particularly for extremely small sample-size categories. Figure 13 illustrates the class distribution and confusion matrices of both A²S²K-ResNet and A²S²K-ResNet + GMDT on the “IN” dataset under varying levels of Gaussian noise (

σ = 0, 0.1, 0.2

). The results show that A²S²K-ResNet tends to overlook categories with few samples (such as classes “1”, “4”, “7’, “9”, “13”, and “16”). In contrast, the application of GMDT successfully avoids this issue, indicating that GMDT alleviates the class imbalance problem to a certain extent.

4. Discussion

The proposed GMDT introduces an adaptive multi-splitting multivariate decision tree that directly optimizes a multi-class separation criterion via G-means clustering without relying on OvA or OvO decomposition schemes. Our results across HSI datasets and synthetic, remote sensing RGB image scene classification provide several insights into its advantages, boundary conditions, and trade-offs.

4.1. Interpretation of Key Findings

Adaptive multi-splitting of GMDT is effective. Unlike conventional bi-splitting multivariate trees such as BDTKS and MDT2, GMDT can partition a node into multiple child nodes (

k \geq 2

) determined automatically by the G-means algorithm. This allows the model to isolate pure or near-pure Gaussian clusters early in the tree, preventing unnecessary deep branching. The compact tree structures observed (shallower depth, fewer split nodes) directly stem from this property. For example, on the synthetic dataset, GMDT achieved depth 7 versus C4.5’s depth 118.

GMDT mitigates the class imbalance problem. By avoiding OvA/OvO schemes, GMDT does not artificially create binary problems where majority classes dominate. Instead, multi-way splits can separate minority-class clusters as distinct child nodes. This is clearly seen in the HSI results: On the highly imbalanced Indian Pines dataset, GMDT dramatically improved AA from 0.5704 to 0.8376 (

σ = 0

) and significantly boosted recall on small-sample classes (class “1”, “4”, “7”, “9”, “13”, and “16”). The minority-class recall gains on RGB scene image datasets (“AID”, “MASATI”, and “RSI-CB256”) also support this.

GMDT is robust to noise and Gaussian cluster assumption. GMDT assumes that single-class samples often follow a Gaussian distribution, enabling G-means to identify them. Even when this assumption is violated (for example, heavy-tailed or copula distributions in synthetic data), GMDT remains competitive. The addition of Z-score standardization and the

k_{m a x} = c

constraint helped prevent over-fragmentation under noise, as seen in the HSI experiments where GMDT maintained high OA even at

σ = 0.2

.

Figure 13. Class distribution and the corresponding confusion matrices of A²S²K-ResNet and A²S²K-ResNet + GMDT when different levels of Gaussian noise (

σ = 0, 0.1, 0.2

) are added on the “IN” dataset.

Figure 13. Class distribution and the corresponding confusion matrices of A²S²K-ResNet and A²S²K-ResNet + GMDT when different levels of Gaussian noise (

σ = 0, 0.1, 0.2

) are added on the “IN” dataset.

GMDT is computationally efficient. GMDT requires more training time than BDTKS (at most about

5 \times

) but far less than univariate trees on large datasets. More importantly, prediction time is roughly halved compared to the base deep learning model (A²S²K-ResNet) while improving accuracy. This suggests that GMDT possesses an advantage for deployment in resource-constrained or real-time remote sensing systems.

4.2. Comparison with Prior Work

GMDT differs from existing multivariate decision trees in three critical aspects.

GMDT is multi-splitting without decomposition. While trees like OC1 [6], STree [14,15], and LROTree [16] rely on binary splits and OvA/OvO for multi-class problems, GMDT natively supports multi-way splits. This avoids the class imbalance introduced by OvA and the computational overhead of OvO. GMDT significantly outperforms C4.5, CART, and MDT2 according to the Nemenyi test results, and ranks first in mean ACC/F1 across eight datasets—above STree (which uses OvA SVM-based splits) and BDTKS.

GMDT produces an adaptive number of child nodes. The exsisting multivariate CRUISE [28] uses LDA to split into multiple subnodes, but the number of subnodes is preassigned. In our GMDT, G-means adaptively chooses k based on the Anderson–Darling test, which better reflects the intrinsic cluster structure. This is particularly beneficial when classes have varying numbers of Gaussian components.

GMDT works by integrating with deep features. Unlike earlier multivariate trees that operate directly on raw pixels or handcrafted features, GMDT is arranged to cooperate with high-level deep features. This combination (A²S²K-ResNet + GMDT) proved highly robust to Gaussian noise in HSI data, outperforming SpectralFormer [48], RaF [35], and XGBoost [47] by large margins.

4.3. Limitations

Despite its strengths, GMDT has several limitations that must be acknowledged.

First, GMDT depends on an effective feature extractor. GMDT itself does not perform end-to-end feature learning. It remains challenging if raw data are not well-represented (for example, highly overlapping multi-class clusters in the original feature space). It requires a feature extractor that is less powerful yet still effective at capturing semantic features.

Second, the Gaussian assumption may fail. While many real-world phenomena approximately follow Gaussian distributions, some classes with multimodal or highly skewed distributions may not. GMDT relies on G-means to decompose non-Gaussian clusters, but some multi-class non-Gaussian clusters may not be adequately split. The bi-splitting fallback (K-means with

k = 2

) only provides a linear partition, which may be insufficient for complex manifolds.

Third, parameter sensitivity problems exist. Although significance level

α

had mild impact,

m i n_s a m p l e s_s p l i t

notably affected performance, especially on small datasets. Too small a value risks overfitting; too large may stop splitting prematurely.

Fourth, computational cost on very large datasets is a hidden issue. While acceptable in our experiments, the multiple K-means runs within G-means can become expensive when the number of samples at a node is huge. This is less of an issue after deep feature extraction, as feature dimensionality is typically reduced, but remains a consideration for raw high-dimensional HSI data [49].

4.4. Future Directions

Based on these limitations, several future research directions can be further explored.

Lightweight feature extraction. The current pipeline uses a ResNet or A²S²K-ResNet, which is computationally heavy. Integrating GMDT with lightweight architectures such as MobileNet, EfficientNet, or unsupervised feature extractors could enable on-board processing for satellite or drone-based real-time classification.

Ensemble variants. GMDT as a single tree is interpretable but may have higher variance than ensembles. A natural extension is a GMDT-based random forest or boosting method, where each base learner is an adaptive multi-splitting tree.

Handling of extremely overlapping clusters. For nodes where G-means returns only one cluster and the node remains highly impure, the current bi-splitting may fail. Future work could incorporate a hybrid strategy: First attempt multi-splitting with a relaxed Gaussian test, and if that fails, use a non-linear boundary learned from local data to handle overlapping multi-class Gaussian clusters (types iii and v).

5. Conclusions

This paper proposes GMDT, an adaptive multi-splitting multivariate decision tree model designed for multi-class classification tasks. By integrating G-means clustering with Z-score standardization and a user-defined

k_{m a x}

(set to the number of classes), GMDT automatically determines the number of child nodes at each internal node, isolating pure or near-pure Gaussian clusters early. A bi-splitting fallback handles multi-class Gaussian clusters that remain after G-means.

Extensive experiments on synthetic data, eight noisy RGB remote sensing scene datasets, and two noisy HSI datasets demonstrated that GMDT has several merits. First, it achieves superior or competitive classification accuracy and macro F1-score compared to state-of-the-art univariate and multivariate decision trees as well as ensemble methods like RaF and XGBoost in several settings. Second, it produces more compact trees (shallower depth, fewer split nodes) than most decision tree baselines, enhancing interpretability and reducing overfitting risk. Third, it mitigates class imbalance without requiring OvA/OvO decomposition, as evidenced by large gains in minority-class recall and average accuracy on highly imbalanced datasets (“AID”, “MASATI”, “RSI-CB256”, and “Indian Pines”). Fourth, it maintains acceptable training time and reduces prediction time compared to deep learning baselines (about

2 \times

faster than A²S²K-ResNet), making it suitable for latency-sensitive or resource-constrained applications. Fifth, it exhibits robustness to additive Gaussian noise, particularly when combined with deep features extracted from attention-based spectral–spatial kernels in the HSI classification application.

Nonetheless, GMDT has its limitations. It depends on an effective feature extractor, assumes Gaussian-like cluster structures, struggles with highly overlapping multi-class clusters, and requires tuning of

m i n_s a m p l e s_s p l i t

. Future work will focus on lightweight deep feature extractors, ensemble variants, and nonlinear bi-splitting fallback strategies. Overall, GMDT offers a practical, interpretable, and efficient solution for multi-class classification in remote sensing and other domains where class imbalance and model transparency are critical.

Author Contributions

Conceptualization, Q.W. and H.L.; methodology, Q.W., Z.Z. (Zheng Zheng), and H.L.; software, Q.W. and Z.Z. (Zitong Zhang); validation, Q.W., Z.Z. (Zheng Zheng), and H.L.; formal analysis, Q.W.; investigation, Q.W.; resources, Q.W.; data curation, Q.W. and X.Z.; writing—original draft preparation, Q.W.; writing—review and editing, F.N.; visualization, Q.W. and Z.Z. (Zitong Zhang); supervision, F.W. and F.N.; project administration, Q.W. and F.W.; funding acquisition, Q.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China of grant number 62306231, the Key Research and Development Program of Shaanxi Province of grand number 2025NC-YBXM-224, and the Science and Technology Project of China Huaneng Group Co., Ltd. of grand number HNKJ23-HF67.

Data Availability Statement

Data derived from public domain resources, including remote sensing RGB scene image datasets (AID https://captain-whu.github.io/AID/, accessed on 12 March 2026, MASATI https://www.iuii.ua.es/datasets/masati/index.html, accessed on 29 March 2026, PatternNet https://www.kaggle.com/datasets/abidhasanrafi/patternnet, accessed on 29 March 2026, RSC11 https://aistudio.baidu.com/datasetdetail/52227, accessed on 29 March 2026, RSI-CB256 https://aistudio.baidu.com/datasetdetail/52487, accessed on 29 March 2026, RSSCN7 https://sites.google.com/site/qinzoucn/download, accessed on 29 March 2026, UCM https://www.kaggle.com/datasets/abdulhasibuddin/uc-merced-land-use-dataset?resource=download, accessed on 11 March 2026, WHU-RS19 https://captain-whu.github.io/BED4RS/, accessed on 29 March 2026) and HSI datasets (Indian Pines and Pavia University downloaded from https://www.ehu.eus/ccwintco/index.php/Hyperspectral_Remote_Sensing_Scenes, accessed on 1 April 2026). The code supporting this work is available at: https://github.com/bhxspring/GMDT, accessed on 1 April 2026.

Acknowledgments

We are grateful to Xuetao Zhang at Xi’an Jiaotong University for suggesting the application of our method to the remote sensing domain. We thank the anonymous reviewers for their constructive comments and suggestions, which helped us to improve the manuscript significantly. During the preparation of this manuscript/study, the authors used DeepSeek V3 for the purposes of language polishing. The authors have reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

Yu, K.; Lian, J.; Bi, Y.; Liang, J.; Xue, B.; Zhang, M. A genetic programming approach with adaptive region detection to skin cancer image classification. J. Autom. Intell. 2024, 3, 240–249. [Google Scholar] [CrossRef]
Ranjan, P.; Ankur; Kumar, R. A Residual Pyramid-GAN Approach for Hyperspectral Image Classification. Ann. Data Sci. 2026. [Google Scholar] [CrossRef]
Quinlan, J.R. Induction of Decision Trees. Mach. Learn. 1986, 1, 81–106. [Google Scholar] [CrossRef]
Quinlan, J.R. C4.5: Programs for Machine Learning; Morgan Kaufmann Publishers Inc.: San Mateo, CA, USA, 1993. [Google Scholar]
Breiman, L.; Friedman, J.H.; Stone, C.J.; Olshen, R.A. Classification and Regression Trees; Chapman & Hall/CRC: Boca Raton, FL, USA, 1984. [Google Scholar]
Murthy, S.K.; Kasif, S.; Salzberg, S. A system for induction of oblique decision trees. J. Artif. Intell. Res. 1994, 2, 1–32. [Google Scholar] [CrossRef]
López-Chau, A.; Cervantes, J.; López-García, L.; García-Lamont, F. Fisher’s decision tree. Expert Syst. Appl. 2013, 40, 6283–6291. [Google Scholar] [CrossRef]
Li, X.; Sweigart, J.R.; Teng, J.T.C.; Donohue, J.M.; Thombs, L.A.; Wang, S.M. Multivariate decision trees using linear discriminants and tabu search. IEEE Trans. Syst. Man Cybern. Part A 2003, 33, 194–205. [Google Scholar] [CrossRef]
Wickramarachchi, D.C.; Robertson, B.L.; Reale, M.; Price, C.J.; Brown, J. HHCART: An oblique decision tree. Comput. Stat. Data Anal. 2016, 96, 12–23. [Google Scholar] [CrossRef]
Manwani, N.; Sastry, P.S. Geometric Decision Tree. IEEE Trans. Syst. Man Cybern. Part B 2012, 42, 181–192. [Google Scholar] [CrossRef] [PubMed]
Amasyali, M.F.; Ersoy, O. Cline: A new decision-tree family. IEEE Trans. Neural Netw. 2008, 19, 356–363. [Google Scholar] [CrossRef]
Engür, E.; Soylu, B. A linear multivariate decision tree with branch-and-bound components. Neurocomputing 2024, 576, 127354. [Google Scholar] [CrossRef]
Fei, B.; Liu, J. Binary tree of SVM: A new fast multiclass training and classification algorithm. IEEE Trans. Neural Netw. 2006, 17, 696–704. [Google Scholar] [CrossRef]
Montañana, R.; Gámez, J.A.; Puerta, J.M. STree: A Single Multi-class Oblique Decision Tree Based on Support Vector Machines. In Advances in Artificial Intelligence—19th Conference of the Spanish Association for Artificial Intelligence, Malaga, Spain, 22–24 September 2021; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2021; Volume 12882, pp. 54–64. [Google Scholar] [CrossRef]
Montañana, R.; Gámez, J.A.; Puerta, J.M. ODTE - An ensemble of multi-class SVM-based oblique decision trees. Expert Syst. Appl. 2025, 273, 126833. [Google Scholar] [CrossRef]
Rodrigo, E.G.; Alfaro, J.C.; Aledo, J.A.; Gámez, J.A. Label ranking oblique trees. Knowl. Based Syst. 2024, 296, 111882. [Google Scholar] [CrossRef]
Xian, J.; Rezvani, S.; Yang, D. A New Decision Tree Based on Intuitionistic Fuzzy Twin Support Vector Machines. IEEE Trans. Intell. Transp. Syst. 2024, 25, 19810–19819. [Google Scholar] [CrossRef]
Mangasarian, O.L.; Wild, E.W. Multisurface Proximal Support Vector Machine Classification via Generalized Eigenvalues. IEEE Trans. Pattern Anal. Mach. Intell. 2006, 28, 69–74. [Google Scholar] [CrossRef] [PubMed]
Zhang, L.; Suganthan, P.N. Oblique Decision Tree Ensemble via Multisurface Proximal Support Vector Machine. IEEE Trans. Cybern. 2015, 45, 2165–2176. [Google Scholar] [CrossRef]
Zhu, H.; Murali, P.; Phan, D.T.; Nguyen, L.M.; Kalagnanam, J. A Scalable MIP-based Method for Learning Optimal Multivariate Decision Trees. In Proceedings of the Advances in Neural Information Processing Systems, Virtual, 6–12 December 2020. [Google Scholar]
Blanco, V.; Japón, A.; Puerto, J. Multiclass optimal classification trees with SVM-splits. Mach. Learn. 2023, 112, 4905–4928. [Google Scholar] [CrossRef]
Loyola-González, O.; Ramírez-Sáyago, E.; Medina-Pérez, M.A. Towards improving decision tree induction by combining split evaluation measures. Knowl. Based Syst. 2023, 277, 110832. [Google Scholar] [CrossRef]
Jin, C.; Li, F.; Ma, S.; Wang, Y. Sampling scheme-based classification rule mining method using decision tree in big data environment. Knowl. Based Syst. 2022, 244, 108522. [Google Scholar] [CrossRef]
Guan, X.; Liang, J.; Qian, Y.; Pang, J. A multi-view OVA model based on decision tree for multi-classification tasks. Knowl. Based Syst. 2017, 138, 208–219. [Google Scholar] [CrossRef]
Yan, J.; Zhang, Z.; Lin, K.; Yang, F.; Luo, X. A hybrid scheme-based one-vs-all decision trees for multi-class classification tasks. Knowl. Based Syst. 2020, 198, 105922. [Google Scholar] [CrossRef]
Wang, F.; Wang, Q.; Nie, F.; Yu, W.; Wang, R. Efficient tree classifiers for large scale datasets. Neurocomputing 2018, 284, 70–79. [Google Scholar] [CrossRef]
Wang, F.; Wang, Q.; Nie, F.; Li, Z.; Yu, W.; Ren, F. A linear multivariate binary decision tree classifier based on K-means splitting. Pattern Recognit. 2020, 107, 107521. [Google Scholar] [CrossRef]
Kim, H.; Loh, W.Y. Classification trees with unbiased multiway splits. J. Am. Stat. Assoc. 2001, 96, 589–604. [Google Scholar] [CrossRef]
Zhao, C.; Wu, D.; Huang, J.; Yuan, Y.; Zhang, H.; Peng, R.; Shi, Z. BoostTree and BoostForest for Ensemble Learning. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 8110–8126. [Google Scholar] [CrossRef]
Liu, G.; Qiu, J.; Huang, J.; Yuan, Y. GLGF-CR: A Gated Local-Global Fusion approach for cloud removal in real-world remote sensing. Pattern Recognit. 2026, 172, 112319. [Google Scholar] [CrossRef]
Hamerly, G.; Elkan, C. Learning the k in k-means. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–13 December 2003; pp. 281–288. [Google Scholar]
Ranjan, P.; Nandal, A.; Agarwal, S.; Kumar, R. A Dive into Generative Adversarial Networks in the World of Hyperspectral Imaging: A Survey of the State of the Art. Remote Sens. 2026, 18, 196. [Google Scholar] [CrossRef]
Stephens, M.A. EDF statistics for goodness of fit and some comparisons. J. Am. Stat. Assoc. 1974, 69, 730–737. [Google Scholar] [CrossRef]
Fan, R.; Chang, K.; Hsieh, C.; Wang, X.; Lin, C. LIBLINEAR: A Library for Large Linear Classification. J. Mach. Learn. Res. 2008, 9, 1871–1874. [Google Scholar]
Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Xia, G.; Hu, J.; Hu, F.; Shi, B.; Bai, X.; Zhong, Y.; Zhang, L.; Lu, X. AID: A Benchmark Data Set for Performance Evaluation of Aerial Scene Classification. IEEE Trans. Geosci. Remote Sens. 2017, 55, 3965–3981. [Google Scholar] [CrossRef]
Gallego, A.J.; Pertusa, A.; Gil, P. Automatic Ship Classification from Optical Aerial Images with Convolutional Neural Networks. Remote Sens. 2018, 10, 511. [Google Scholar] [CrossRef]
Zhou, W.; Newsam, S.D.; Li, C.; Shao, Z. PatternNet: A Benchmark Dataset for Performance Evaluation of Remote Sensing Image Retrieval. arXiv 2017, arXiv:1706.03424. [Google Scholar] [CrossRef]
Zhao, L.; Tang, P.; Huo, L. Feature significance-based multibag-of-visual-words model for remote sensing image scene classification. J. Appl. Remote Sens. 2016, 10, 035004. [Google Scholar] [CrossRef]
Li, H.; Dou, X.; Tao, C.; Wu, Z.; Chen, J.; Peng, J.; Deng, M.; Zhao, L. RSI-CB: A Large-Scale Remote Sensing Image Classification Benchmark Using Crowdsourced Data. Sensors 2020, 20, 1594. [Google Scholar] [CrossRef] [PubMed]
Zou, Q.; Ni, L.; Zhang, T.; Wang, Q. Deep Learning Based Feature Selection for Remote Sensing Scene Classification. IEEE Geosci. Remote Sens. Lett. 2015, 12, 2321–2325. [Google Scholar] [CrossRef]
Yang, Y.; Newsam, S.D. Bag-of-visual-words and spatial extensions for land-use classification. In Proceedings of the 18th ACM SIGSPATIAL International Symposium on Advances in Geographic Information Systems, San Jose, CA, USA, 2–5 November 2010; pp. 270–279. [Google Scholar] [CrossRef]
Xia, G.S.; Yang, W.; Delon, J.; Gousseau, Y.; Sun, H. Structural High-resolution Satellite Image Indexing. In Proceedings of the ISPRS TC VII Symposium—100 Years ISPRS, Vienna, Austria, 5–7 July 2010. [Google Scholar]
Zhang, X.; Li, Y.; Feng, X.; Hua, J.; Yue, D.; Wang, J. Application of Multiple-Optimization Filtering Algorithm in Remote Sensing Image Denoising. Sensors 2023, 23, 7813. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Roy, S.K.; Manna, S.; Song, T.; Bruzzone, L. Attention-Based Adaptive Spectral-Spatial Kernel ResNet for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2021, 59, 7831–7843. [Google Scholar] [CrossRef]
Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar] [CrossRef]
Hong, D.; Han, Z.; Yao, J.; Gao, L.; Zhang, B.; Plaza, A.; Chanussot, J. SpectralFormer: Rethinking Hyperspectral Image Classification With Transformers. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–15. [Google Scholar] [CrossRef]
Ranjan, P.; Hans, S.; Ismail, S. Next-Gen Imaging: The Power of Hyperspectral Data and Autoencoders. In International Conference on Artificial Intelligence and Speech Technology, New Delhi, India, 27–28 November 2025; Communications in Computer and Information Science; Springer: Cham, Switzerland, 2025; Volume 2390, pp. 29–41. [Google Scholar]

Figure 1. Illustration of G-means clustering results not conforming to reality on a synthetic dataset with 10 real clusters. (a) Real clusters. (b) Original G-means [31]. (c) G-means incorporating Z-score standardization. (d) Adjusted G-means incorporating Z-score standardization and the

k_{m a x} = c

constraint.

Figure 1. Illustration of G-means clustering results not conforming to reality on a synthetic dataset with 10 real clusters. (a) Real clusters. (b) Original G-means [31]. (c) G-means incorporating Z-score standardization. (d) Adjusted G-means incorporating Z-score standardization and the

k_{m a x} = c

constraint.

Figure 2. Transformation relationships among cluster types derived from actual class distribution.

Figure 3. Local structures produced by adaptive multi-way node splitting. The Roman numerals refer to the cluster types in Figure 2. (a) The possibly produced local structures for a cluster of type iv. (b) The possibly produced local structures for a cluster of type vi.

Figure 4. An instance of adaptive multi-way node splitting. The stars represent the centers of sub-clusters.

Figure 6. Training set and testing set of 2D synthetic 12-class data.

Figure 7. Incorrect labels predicted by eight classifiers on 2D synthetic 12-class data.

Figure 8. Classification accuracy values of MR18 + GMDT compared with MR18.

Figure 9. GMDT sensitivity heatmaps in classification accuracy and macro F1 score. (a) Accuracy sensitivity for “UCM”. (b) Accuracy sensitivity for “PatternNet”.

Figure 10. Critical difference diagrams on classification accuracy (ACC) and macro F1-score.

Figure 11. Classification maps for the “IN” dataset with Gaussian noise of

σ = 0.1

. (a) Ground truth. (b) RaF. (c) XGBoost. (d) SpectralFormer. (e) A²S²K-ResNet. (f) A²S²K-ResNet + GMDT.

Figure 11. Classification maps for the “IN” dataset with Gaussian noise of

σ = 0.1

. (a) Ground truth. (b) RaF. (c) XGBoost. (d) SpectralFormer. (e) A²S²K-ResNet. (f) A²S²K-ResNet + GMDT.

Figure 12. Classification maps for the “UP” dataset with no Gaussian noise (

σ = 0

). (a) Ground truth. (b) RaF. (c) XGBoost. (d) SpectralFormer. (e) A²S²K-ResNet. (f) A²S²K-ResNet + GMDT.

Figure 12. Classification maps for the “UP” dataset with no Gaussian noise (

σ = 0

). (a) Ground truth. (b) RaF. (c) XGBoost. (d) SpectralFormer. (e) A²S²K-ResNet. (f) A²S²K-ResNet + GMDT.

Table 1. Performance comparison under different values of

k_{m a x}

. The best results for each metric are highlighted in bold.

Table 1. Performance comparison under different values of

k_{m a x}

. The best results for each metric are highlighted in bold.

Metric	$k_{\max} = \frac{c}{2}$	$k_{\max} = c$	$k_{\max} = 2 c$	$k_{\max} = 4 c$	$k_{\max} = 8 c$	$k_{\max} = 16 c$
ACC	0.8051	0.8231	0.7897	0.8308	0.7923	0.7872
F1	0.8060	0.8310	0.8080	0.8403	0.7936	0.7903
$N_{d p}$	7	7	8	5	7	6
$N_{s p}$	242	215	468	126	475	460
$N_{l f}$	389	389	585	221	575	601

Table 2. Information about the remote sensing scene RGB image datasets used.

Dataset Name	Raw Image Size	Classes	Training	Minority	Majority	Validation	Testing
AID	600 × 600	30	6999	154	294	1000	2000
MASATI	512 × 512	7	5172	213	1252	739	1478
PatternNet	256 × 256	38	21,278	559	560	3040	6080
RSC11	512 × 512	11	861	56	99	124	247
RSI-CB256	256 × 256	35	17,178	138	932	2454	4909
RSSCN7	400 × 400	7	1960	280	280	280	560
UCM	256 × 256	21	1469	69	70	210	420
WHU-RS19	600 × 600	19	703	35	43	101	201

Table 3. Minority-class recall values on the class imbalance datasets.

Dataset	Majority − Minority	MR18	MR18 + GMDT
AID	140	0.8864	0.9318
MASATI	1039	0.4754	0.5902
RSC11	43	1.0000	0.9375
RSI-CB256	794	0.8750	0.9630
WHU-RS19	8	1.0000	0.9000

Notes: The better results in terms of minority-class recall are in bold.

Table 4. Classification accuracy and macro F1-score of decision tree models. The data in the parentheses represent the ranking values.

	Datasets	C4.5	CART	MDT2	BDTKS	STree	GMDT
ACC	AID	0.7510 (5)	0.7400 (6)	0.7720 (4)	0.8270 (3)	0.8325 (2)	0.8585 (1)
	MASATI	0.9039 (4)	0.9046 (3)	0.8999 (6)	0.9134 (2)	0.9229 (1)	0.9019 (5)
	PatternNet	0.9497 (4)	0.9372 (5)	0.9270 (6)	0.9645 (3)	0.9683 (2)	0.9722 (1)
	RSC11	0.8300 (5)	0.8057 (6)	0.8340 (4)	0.8462 (3)	0.8543 (2)	0.8664 (1)
	RSI-CB256	0.9232 (5)	0.9122 (6)	0.9340 (4)	0.9574 (3)	0.9576 (2)	0.9654 (1)
	RSSCN7	0.8554 (5)	0.8286 (6)	0.8589 (4)	0.8679 (3)	0.8875 (1)	0.8696 (2)
	UCM	0.8119 (4)	0.8000 (5)	0.7881 (6)	0.8310 (3)	0.8667 (2)	0.8786 (1)
	WHU-RS19	0.7811 (6)	0.8159 (4)	0.8060 (5)	0.8856 (3)	0.9005 (1)	0.8955 (2)
	mean	0.8508 (4.750)	0.8430 (5.125)	0.8525 (4.875)	0.8866 (2.875)	0.8988 (1.625)	0.9010 (1.750)
F1	AID	0.7501 (5)	0.7388 (6)	0.7721 (4)	0.8254 (3)	0.8294 (2)	0.8584 (1)
	MASATI	0.8614 (5)	0.8722 (3)	0.8586 (6)	0.8816 (2)	0.8925 (1)	0.8670 (4)
	PatternNet	0.9501 (4)	0.9378 (5)	0.9271 (6)	0.9647 (3)	0.9684 (2)	0.9725 (1)
	RSC11	0.8346 (5)	0.8145 (6)	0.8365 (4)	0.8489 (3)	0.8566 (2)	0.8701 (1)
	RSI-CB256	0.9004 (5)	0.8884 (6)	0.9148 (4)	0.9446 (2)	0.9428 (3)	0.9539 (1)
	RSSCN7	0.8557 (5)	0.8296 (6)	0.8593 (4)	0.8677 (3)	0.8885 (1)	0.8699 (2)
	UCM	0.8141 (4)	0.8102 (5)	0.7958 (6)	0.8332 (3)	0.8704 (2)	0.8814 (1)
	WHU-RS19	0.7962 (6)	0.8285 (4)	0.8159 (5)	0.8912 (3)	0.9048 (1)	0.9005 (2)
	mean	0.8453 (4.875)	0.8400 (5.125)	0.8475 (4.875)	0.8822 (2.750)	0.8942 (1.750)	0.8967 (1.625)

Notes: The best evaluation results among the six decision tree models are shown in bold.

Table 5. Criteria reflecting the complexity of decision tree models: the number of split nodes (

N_{s p}

), the number of leaf nodes (

N_{l f}

), and the maximum depth (

N_{d p}

).

Table 5. Criteria reflecting the complexity of decision tree models: the number of split nodes (

N_{s p}

), the number of leaf nodes (

N_{l f}

), and the maximum depth (

N_{d p}

).

Criteria	Datasets	C4.5	CART	MDT2	BDTKS	STree	GMDT
$N_{s p}$	AID	389	342	990	936	272	249
	MASATI	121	82	639	337	20	503
	PatternNet	390	308	2198	1063	366	222
	RSC11	30	34	162	34	35	33
	RSI-CB256	484	624	2142	1155	338	348
	RSSCN7	64	47	116	244	21	195
	UCM	40	46	310	176	72	95
	WHU-RS19	22	23	148	60	40	9
$N_{l f}$	AID	390	343	991	937	273	321
	MASATI	122	83	640	338	21	569
	PatternNet	391	309	2199	1064	367	671
	RSC11	31	35	163	35	36	84
	RSI-CB256	485	625	2143	1156	339	1465
	RSSCN7	65	48	117	245	22	288
	UCM	41	47	311	177	73	141
	WHU-RS19	23	24	149	61	41	46
$N_{d p}$	AID	42	21	13	15	15	7
	MASATI	31	19	13	14	7	12
	PatternNet	33	27	15	18	20	9
	RSC11	15	11	10	8	9	5
	RSI-CB256	40	30	15	20	18	8
	RSSCN7	17	14	8	13	9	9
	UCM	18	13	11	13	12	8
	WHU-RS19	9	10	10	10	9	3

Notes: The best evaluation results among the six decision tree models are shown in bold.

Table 6. Training and prediction time of all tested decision tree models on all of the training and testing data.

	Datasets	C4.5	CART	MDT2	BDTKS	STree	GMDT
$T_{t r}$ (s)	AID	$2.99 \times 10^{3}$	$3.44 \times 10^{3}$	$6.14 \times 10^{0}$	$4.25 \times 10^{1}$	$1.31 \times 10^{1}$	$1.73 \times 10^{2}$
	MASATI	$1.26 \times 10^{3}$	$1.96 \times 10^{3}$	$2.27 \times 10^{0}$	$2.44 \times 10^{1}$	$5.92 \times 10^{0}$	$9.34 \times 10^{1}$
	PatternNet	$1.34 \times 10^{4}$	$3.68 \times 10^{4}$	$1.45 \times 10^{1}$	$6.68 \times 10^{1}$	$4.23 \times 10^{1}$	$2.74 \times 10^{2}$
	RSC11	$8.87 \times 10^{1}$	$7.92 \times 10^{1}$	$7.20 \times 10^{- 1}$	$4.63 \times 10^{0}$	$4.33 \times 10^{- 1}$	$1.60 \times 10^{1}$
	RSI-CB256	$8.52 \times 10^{3}$	$2.77 \times 10^{4}$	$1.23 \times 10^{1}$	$6.75 \times 10^{1}$	$3.43 \times 10^{1}$	$3.47 \times 10^{2}$
	RSSCN7	$2.43 \times 10^{2}$	$3.15 \times 10^{2}$	$1.06 \times 10^{0}$	$1.12 \times 10^{1}$	$7.93 \times 10^{- 1}$	$4.37 \times 10^{1}$
	UCM	$2.27 \times 10^{2}$	$2.02 \times 10^{2}$	$1.49 \times 10^{0}$	$9.76 \times 10^{0}$	$1.51 \times 10^{0}$	$3.21 \times 10^{1}$
	WHU-RS19	$6.35 \times 10^{1}$	$4.86 \times 10^{1}$	$7.19 \times 10^{- 1}$	$4.00 \times 10^{0}$	$5.93 \times 10^{- 1}$	$9.45 \times 10^{0}$
$T_{p r}$ (ms)	AID	$2.49 \times 10^{1}$	$6.98 \times 10^{0}$	$2.00 \times 10^{2}$	$2.19 \times 10^{2}$	$3.13 \times 10^{2}$	$2.96 \times 10^{2}$
	MASATI	$1.29 \times 10^{1}$	$3.99 \times 10^{0}$	$1.24 \times 10^{2}$	$1.32 \times 10^{2}$	$2.85 \times 10^{1}$	$1.35 \times 10^{2}$
	PatternNet	$5.29 \times 10^{1}$	$1.99 \times 10^{1}$	$6.29 \times 10^{2}$	$5.82 \times 10^{2}$	$1.13 \times 10^{3}$	$1.23 \times 10^{3}$
	RSC11	$1.97 \times 10^{0}$	$9.97 \times 10^{- 1}$	$1.96 \times 10^{1}$	$1.99 \times 10^{1}$	$1.00 \times 10^{1}$	$2.79 \times 10^{1}$
	RSI-CB256	$4.79 \times 10^{1}$	$1.10 \times 10^{1}$	$5.31 \times 10^{2}$	$4.82 \times 10^{2}$	$9.25 \times 10^{2}$	$1.13 \times 10^{3}$
	RSSCN7	$1.99 \times 10^{0}$	$1.99 \times 10^{0}$	$5.09 \times 10^{1}$	$7.38 \times 10^{1}$	$8.98 \times 10^{0}$	$6.28 \times 10^{1}$
	UCM	$2.96 \times 10^{0}$	$9.97 \times 10^{- 1}$	$3.74 \times 10^{1}$	$4.69 \times 10^{1}$	$3.69 \times 10^{1}$	$6.68 \times 10^{1}$
	WHU-RS19	$9.97 \times 10^{- 1}$	$0.00 \times 10^{0}$	$1.40 \times 10^{1}$	$2.09 \times 10^{1}$	$1.20 \times 10^{1}$	$2.69 \times 10^{1}$

Table 7. Comparison of RaF, XGBoost, SpectralFormer, A²S²K-ResNet, and A²S²K-ResNet + GMDT on the “IN” dataset.

Noise	Method	OA	AA	Kappa	Mean $T_{tr}$ (s)	Mean $T_{pr}$ (s)
$σ = 0$	RaF	0.7883 ± 0.0058	0.6338 ± 0.0097	0.7546 ± 0.0065	19.28	0.00
	XGBoost	0.7747 ± 0.0045	0.6239 ± 0.0126	0.7400 ± 0.0051	557.04	0.00
	SpectralFormer	0.8905 ± 0.0084	0.7958 ± 0.0120	0.8749 ± 0.0098	3463.50	29.15
	A²S²K-ResNet	0.8573 ± 0.0442	0.5704 ± 0.0696	0.8349 ± 0.0521	46.25	14.07
	A²S²K-ResNet + GMDT	0.9272 ± 0.0259	0.8376 ± 0.0509	0.9171 ± 0.0294	46.25 + 28.05 *	7.69
$σ = 0.1$	RaF	0.4453 ± 0.0074	0.1964 ± 0.0048	0.3153 ± 0.0089	22.26	0.00
	XGBoost	0.4787 ± 0.0030	0.2708 ± 0.0060	0.3780 ± 0.0038	689.60	0.00
	SpectralFormer	0.7729 ± 0.0299	0.6542 ± 0.0231	0.7395 ± 0.0347	3382.04	29.26
	A²S²K-ResNet	0.8951 ± 0.0296	0.5946 ± 0.0582	0.8790 ± 0.0346	46.00	14.63
	A²S²K-ResNet + GMDT	0.9203 ± 0.0170	0.8108 ± 0.0334	0.9096 ± 0.0191	46.00 + 16.30 *	7.68
$σ = 0.2$	RaF	0.3677 ± 0.0042	0.1369 ± 0.0042	0.2061 ± 0.0040	20.64	0.00
	XGBoost	0.3790 ± 0.0059	0.1764 ± 0.0080	0.2487 ± 0.0103	718.52	0.00
	SpectralFormer	0.7262 ± 0.0522	0.6150 ± 0.0416	0.6871 ± 0.0574	3413.43	29.16
	A²S²K-ResNet	0.8857 ± 0.0323	0.5785 ± 0.0537	0.8682 ± 0.0377	48.18	15.44
	A²S²K-ResNet + GMDT	0.9019 ± 0.0163	0.7903 ± 0.0441	0.8888 ± 0.0184	48.18 + 18.95 *	8.03

Notes: The best evaluation results among the five methods are in bold. * This means the A²S²K-ResNet training time plus the GMDT training time.

Table 8. Comparison of RaF, XGBoost, SpectralFormer, A²S²K-ResNet, and A²S²K-ResNet + GMDT on the “UP” dataset.

Noise	Method	OA	AA	Kappa	Mean $T_{tr}$ (s)	Mean $T_{pr}$ (s)
$σ = 0$	RaF	0.9493 ± 0.0026	0.9299 ± 0.0029	0.9321 ± 0.0035	74.07	0.00
	XGBoost	0.9635 ± 0.0009	0.9424 ± 0.0009	0.9513 ± 0.0013	380.63	0.00
	SpectralFormer	0.9794 ± 0.0018	0.9646 ± 0.0020	0.9727 ± 0.0	12,228.53	40.51
	A²S²K-ResNet	0.9897 ± 0.0062	0.9801 ± 0.0119	0.9864 ± 0.0083	132.25	41.40
	A²S²K-ResNet + GMDT	0.9928 ± 0.0020	0.9863 ± 0.0033	0.9904 ± 0.0027	132.25 + 32.70 *	23.50
$σ = 0.1$	RaF	0.7495 ± 0.0009	0.5744 ± 0.0023	0.6405 ± 0.0008	95.90	0.00
	XGBoost	0.7836 ± 0.0012	0.6347 ± 0.0043	0.6975 ± 0.0019	603.26	0.00
	SpectralFormer	0.9264 ± 0.0078	0.8984 ± 0.0078	0.9015 ± 0.0105	11,953.23	40.14
	A²S²K-ResNet	0.9705 ± 0.0041	0.9517 ± 0.0098	0.9608 ± 0.0055	132.70	41.97
	A²S²K-ResNet + GMDT	0.9681 ± 0.0031	0.9429 ± 0.0080	0.9577 ± 0.0041	132.70 + 25.50 *	22.36
$σ = 0.2$	RaF	0.5885 ± 0.0024	0.3089 ± 0.0014	0.3502 ± 0.0051	78.01	0.00
	XGBoost	0.6930 ± 0.0012	0.4564 ± 0.0012	0.5561 ± 0.0015	699.92	0.00
	SpectralFormer	0.8879 ± 0.0105	0.8382 ± 0.0119	0.8498 ± 0.0142	11,330.60	36.82
	A²S²K-ResNet	0.9610 ± 0.0034	0.9373 ± 0.0048	0.9481 ± 0.0046	136.90	43.61
	A²S²K-ResNet + GMDT	0.9604 ± 0.0046	0.9354 ± 0.0071	0.9473 ± 0.0061	136.90 + 25.29 *	23.49

* This means the A²S²K-ResNet training time plus the GMDT training time.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, Q.; Zheng, Z.; Lei, H.; Wang, F.; Zhang, Z.; Zou, X.; Nie, F. An Adaptive Multi-Splitting Multivariate Decision Tree for Multi-Class Classification Applied on High-Resolution and Hyperspectral Remote Sensing Images. Remote Sens. 2026, 18, 1790. https://doi.org/10.3390/rs18111790

AMA Style

Wang Q, Zheng Z, Lei H, Wang F, Zhang Z, Zou X, Nie F. An Adaptive Multi-Splitting Multivariate Decision Tree for Multi-Class Classification Applied on High-Resolution and Hyperspectral Remote Sensing Images. Remote Sensing. 2026; 18(11):1790. https://doi.org/10.3390/rs18111790

Chicago/Turabian Style

Wang, Quan, Zheng Zheng, Hao Lei, Fei Wang, Zitong Zhang, Xiaowu Zou, and Feiping Nie. 2026. "An Adaptive Multi-Splitting Multivariate Decision Tree for Multi-Class Classification Applied on High-Resolution and Hyperspectral Remote Sensing Images" Remote Sensing 18, no. 11: 1790. https://doi.org/10.3390/rs18111790

APA Style

Wang, Q., Zheng, Z., Lei, H., Wang, F., Zhang, Z., Zou, X., & Nie, F. (2026). An Adaptive Multi-Splitting Multivariate Decision Tree for Multi-Class Classification Applied on High-Resolution and Hyperspectral Remote Sensing Images. Remote Sensing, 18(11), 1790. https://doi.org/10.3390/rs18111790

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Adaptive Multi-Splitting Multivariate Decision Tree for Multi-Class Classification Applied on High-Resolution and Hyperspectral Remote Sensing Images

Highlights

Abstract

1. Introduction

2. Methods

2.1. G-Means Review and Defect Rectification

2.2. G-Means Multivariate Decision Tree (GMDT) Model Generation

2.2.1. Adaptive Multi-Splitting Multivariate Partition

2.2.2. Node Bi-Splitting for Multi-Class Gaussian Clusters

2.2.3. GMDT Model Generation Algorithm

2.3. Class Label Prediction

3. Results

3.1. Simulation Experiments on Synthetic Multi-Class Dataset

3.1.1. Synthetic Dataset

3.1.2. Comparison Baselines and Configurations

3.1.3. Performance Comparison

3.1.4. Ablation Study of k m a x

3.2. Noisy Remote Sensing RGB Image Scene Classification

3.2.1. Datasets

3.2.2. Feature Extraction

3.2.3. GMDT Classification

3.2.4. Comparison to Other Decision Tree Classifiers

3.3. Noisy Remote Sensing Hyperspectral Image Classification

3.3.1. HSI Datasets and Preprocessing

3.3.2. Experimental Settings

3.3.3. Experimental Results

4. Discussion

4.1. Interpretation of Key Findings

4.2. Comparison with Prior Work

4.3. Limitations

4.4. Future Directions

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

3.1.4. Ablation Study of $k_{m a x}$