Next Article in Journal
Single Machine Scheduling Problems: Standard Settings and Properties, Polynomially Solvable Cases, Complexity and Approximability
Next Article in Special Issue
Algorithmic Complexity vs. Market Efficiency: Evaluating Wavelet–Transformer Architectures for Cryptocurrency Price Forecasting
Previous Article in Journal
Enhanced Driver Fatigue Classification via a Novel Residual Polynomial Network with EEG Signal Analysis
Previous Article in Special Issue
Comparative Analysis of Real-Time Detection Models for Intelligent Monitoring of Cattle Condition and Behavior
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Review

A Survey of Six Classical Classifiers, Including Algorithms, Methodological Characteristics, Foundational Variants, and Recent Advances

by
Ali Hussein Alshammari
1,2,*,
Gergely Bencsik
1 and
Almashhadani Hasnain Ali
2
1
Department of Data Science and Engineering, Faculty of Informatics, Eötvös Loránd University, 1053 Budapest, Hungary
2
Ministry of Higher Education and Scientific Research, Baghdad 10001, Iraq
*
Author to whom correspondence should be addressed.
Algorithms 2026, 19(1), 37; https://doi.org/10.3390/a19010037
Submission received: 11 December 2025 / Revised: 23 December 2025 / Accepted: 29 December 2025 / Published: 1 January 2026
(This article belongs to the Special Issue Machine Learning for Pattern Recognition (3rd Edition))

Abstract

Classification is a core supervised learning task in data analysis, and six classical classifier families (k-Nearest Neighbors, Support Vector Machine, Decision Tree, Random Forest, Logistic Regression, and Naïve Bayes) remain widely used in practice and underpin many subsequent variants. Although both single-family and multi-classifier surveys exist, there is still a gap for a method-centered study that, within a coherent framework, combines algorithmic representations for training and prediction, methodological characteristics, an explicit methodological comparison of the foundational variants within each family, and method-oriented advances published between 2020 and 2025. The survey is organized around a fixed set of performance-related perspectives, including accuracy, hyperparameter tuning, scalability, class imbalance, behavior in high-dimensional settings, decision-boundary complexity, interpretability, computational efficiency, and multiclass handling. It highlights strengths, weaknesses, and trade-offs across the six families and their variants, helping researchers and practitioners select or extend classification approaches. It also outlines future research directions arising from the limitations across the examined methods.

1. Introduction

In supervised learning, classification refers to learning a mapping from the feature representation of an instance to one class in a finite label set, typically returning a predicted label or class probability that generalizes beyond the training data [1,2]. Six classical approaches, namely k-Nearest Neighbors (KNN), Support Vector Machine (SVM), Decision Tree (DT), Random Forest (RF), Logistic Regression (LR), and Naïve Bayes (NB), have long underpinned predictive modelling across a wide range of application domains [3,4]. Despite the prominence of deep learning, this review focuses on these six classical classifiers, which remain highly competitive and widely used, particularly for structured datasets and data-constrained settings [5,6], due to their interpretability [7], computational efficiency, and ease of deployment [8]. They continue to serve as baseline models in healthcare [9], finance [10], natural language processing [11], computer vision [12], cybersecurity [13], geospatial analysis [14], and industrial monitoring [15], and in several of these domains also as components of hybrid pipelines. Each approach rests on its own methodological foundation: KNN performs instance-based learning via distance metrics without explicit training; SVM maximizes a separating margin; DT partitions the feature space by impurity reduction; RF aggregates randomized trees to improve generalization; LR models class probabilities in a parametric framework; and NB applies conditional independence assumptions for simplicity and efficiency. Taken together, these learning principles encode distinct inductive biases across families and lead to complementary failure modes [16,17], motivating targeted refinements that aim to mitigate such weaknesses and extend the applicability of each family. As a group, these six classifiers form a compact, non-redundant panel that spans essential learning paradigms and hypothesis-space geometries in classical supervised learning. The survey literature is rich yet fragmented: many reviews study a single family [18,19,20]; others organize methods by application domain or task, often mixing classical and deep-learning approaches [11,13], or by methodological pipeline [21]; broad ML or DL overviews emphasize taxonomies rather than a uniform methodological analysis with consistent reporting [22,23]. Consequently, there is still a gap for a method-centered survey that, within a coherent scheme, brings together canonical algorithmic representations for training and prediction, methodological characteristics examined under shared perspectives that cover accuracy, hyperparameter tuning, scalability, class imbalance, behavior in high-dimensional settings, decision-boundary complexity, interpretability, computational efficiency, and multiclass handling, foundational variants, and method-oriented advances published between 2020 and 2025, while tracing how each classifier has adapted to emerging challenges and how these developments redistribute the methodological trade-offs, and enabling explicit methodological comparisons. Overall, the review balances comprehensiveness with depth and provides a unified methodological account that clarifies capabilities, typical behaviors under varied data conditions, and limitations that motivate future research. The paper is organized as follows: Section 2 describes the review methodology; Section 3 presents the six classification approaches, methodological characteristics, and foundational variants; Section 4 reviews research by classifier family within the review window; and Section 5 outlines future research directions.

2. Methodology

This section describes the search and selection procedure used to identify the recent studies analyzed in Section 4. We targeted methodological work on the six classical classifier families considered in this review, published between 2020 and 2025. Studies were eligible if they proposed an extension or variant whose primary aim was to address at least one of the methodological perspectives defined in the introduction. We searched the IEEE Xplore, ScienceDirect, SpringerLink, and MDPI. To broaden coverage, we used Google Scholar for backward and forward citation chasing. We retained only Scopus Q1 or Q2 journals. For each approach and database, we ran all-field queries, pairing each approach with {new, variant, novel, optimization, modification, enhancement}. We aggregated the hit counts across the term-specific queries to obtain the totals shown in Table 1. Because a single article can match multiple term queries or appear in more than one database–field combination, these totals may include overlap and were not treated as identification totals.
After obtaining these aggregated hit sets, we first worked within each database: for each classifier we removed duplicate records and, where necessary, residual non-journal or out-of-window items that had slipped past the database filters. We then carried out a brief title-level pass that removed records whose titles indicated non-classification targets or plainly application-only studies with no method-development signal; ambiguous cases proceeded to screening. For each classifier, we finally merged the cleaned lists from all databases into a single pool and removed any remaining duplicates. Table 2 reports the resulting per-classifier identification pools entering screening.
The screening phase consisted of two stages (title/abstract, then full text) according to the inclusion and exclusion criteria below:
Inclusion: English-language journal article in a Scopus Q1 or Q2 title; a clear methodological contribution to one of the six approaches addressing at least one target perspective; empirical evaluation on named datasets with standard metrics; full text accessible via subscription or legal open access.
Exclusion: conference-only versions; surveys, tutorials, or book chapters; purely empirical applications without methodological novelty; studies not addressing the target perspectives; incremental tweaks without substantive algorithmic change; non-stand-alone hybrids or ensembles lacking a distinct methodological contribution; non-English texts; insufficient methodological detail; studies not centered on the six approaches.
Full-text eligible papers were scored using a 10-point rubric (Novelty 0–3; Methodological soundness 0–2; Evaluation rigor 0–2; Reproducibility/availability 0–1; Clarity/interpretability 0–1; Impact/generality 0–1). To ensure balanced and comparable depth across the six classifier families, while keeping the synthesis feasible within manuscript length, we used an a priori target allocation of 5–6 representative studies per family. This target functions as a coverage-control strategy rather than a quality threshold. All full-text eligible papers were scored; when more than six papers were eligible for a family, we prioritized higher-scoring studies while ensuring coverage of distinct methodological issues. When rubric scores were tied, citation counts were used only as tiebreakers. Applying this rule uniformly prevents families with larger publication volumes from dominating the synthesis and supports consistent per-family reporting, yielding KNN 6, SVM 6, DT 5, RF 5, LR 5, and NB 5 (32 studies total).

3. Classification Approaches

For each classifier family, we first restate the canonical formulation, then summarize salient methodological properties, and finally outline the cornerstone variants that define the family’s development.

3.1. Support Vector Machines (SVM)

In 1992, Boser, Guyon, and Vapnik proposed the modern Support Vector Machine (SVM) formulation [24], building on Vapnik and Chervonenkis’ theory. An SVM finds the separating hyperplane that maximizes the margin between two classes, where the margin is the shortest orthogonal distance between two parallel class-bounding hyperplanes that touch the support vectors. The separating hyperplane is
w T x + b = 0  
where w R p is the weight vector and b R is the bias term. For a new sample x , prediction depends on
  y ^ = s i g n w T x + b , y ^       + 1   , 1
SVMs address three main data scenarios. Given a training set D = { ( x i , y i ) } i = 1 n with y i { + 1 , 1 } , the hard-margin SVM (linearly separable data) maximizes the margin by solving
min w , b 1 2   w 2   s . t .     y i     w T   x i + b     1   ,   i = 1   , , n
For overlapping or noisy classes, the soft-margin SVM introduces slack variables ξ i 0 and solves
min w , b , ξ   1 2   w 2 + C   i = 1 n ξ i   , s . t .       y i   w T   x i + b   1 ξ i   ,     ξ i     0   ,   i = 1 , , n .
where C > 0 is the regularization parameter controlling the trade-off between margin width and violations. To enable nonlinear separation, kernelized SVM replaces dot products with kernel evaluations K ( x i , x j ) [25] and optimizes the dual
max α i = 1 n α i   1 2     i = 1 n j = 1 n α i α j y i y j     K x i , x j s . t .   i = 1 n α i y i = 0 ,       0   α i     C   ,         i = 1 , ,   n
Here, α i [ 0 , C ] are Lagrange multipliers and nonzero values indicate support vectors. Beyond their theoretical appeal, SVMs are known for strong generalization performance, although training can become costly for very large datasets. The SVM procedure is described in Algorithm 1.
Algorithm 1 Support Vector Machine (Kernel Soft-Margin SVM)
Require: Training set D = { ( x 1 , y 1 ) , , ( x n , y n ) } ,   y i { + 1 , 1 } ; regularization parameter C ;
kernel function K ( , ) ; tolerance ε .
Ensure: Support vectors S , their weights α , and bias b .
1: Function SVM ( D , C , K )
2: Compute the kernel similarities between all training pairs to form the kernel matrix.
3: Solve the kernelized dual optimization problem in (5) to obtain α 1 , , α n .
4: Select support vectors: S = { i : α i > ε } .
5: Compute the bias b using any support vector with 0 < α i < C ; if several exist,
average the bias values.
6: Return { x i } i S , { α i } i S , b .
7: End function
Prediction: for a new point x , compute the decision score and assign the label using (2).

3.1.1. Characteristics

Support Vector Machines are very sensitive to hyperparameter tuning, especially the regularization parameter C and parameters that are specific to the kernel. The value of C controls the balance between maximizing the margin and allowing misclassifications, and different kernel parameter settings can change the decision boundary and the model’s ability to make predictions. This dependence shows how important hyperparameter optimization is. In practice, cross-validated grid search (usually coarse-to-fine) is the norm, and although more advanced methods can cut costs, parameter search can take a long time on big problems [26,27,28,29]. Linear SVMs offer some degree of transparency, because their learned coefficients and bias can be inspected to assess which features contribute most to the classification, whereas kernel SVM predictions depend on the combined influence of many support vectors through the kernel function, making the contribution of individual input features much less direct. Empirical work, particularly in clinical and public-health applications, therefore often treats SVM models with nonlinear kernels as non-interpretable and highlights that this opacity is problematic when decisions must be clearly explained to clinicians, domain experts, and policymakers [30,31]. Another limitation arises in the context of class imbalance: in the standard soft-margin SVM, a single, class-independent misclassification cost and symmetric hinge loss encourage a decision boundary that shifts toward the majority class, effectively treating many minority examples as noise [32,33]. Consequently, minority-class instances concentrated near this skewed boundary are more likely to be misclassified and are vulnerable to label changes under small perturbations of their features. In addition to these problems, the ability of standard SVM training and inference to work on large datasets is still a big worry. The number of support vectors, and thus the non-zero Lagrange multipliers, often grows roughly linearly with the size of the training set [34], and because each prediction evaluates the kernel against all support vectors, inference costs increase substantially, restricting efficiency in high-volume or time-sensitive settings. On the positive side, SVMs can handle very high-dimensional feature spaces by using the kernel trick: data are implicitly mapped to a high-dimensional feature space while learning depends only on kernel inner products. This improves the chance of separability without making the theoretical generalization bounds depend explicitly on the number of input dimensions; instead, they are controlled by margin-based quantities in the feature space [35]. At the same time, kernel evaluation still incurs computation over the original features, so per-evaluation cost depends on the chosen kernel and input dimension [36]. When data can be separated by a straight line, SVMs build the best maximum-margin hyperplane, while kernelized SVMs allow nonlinear boundaries in higher-dimensional feature spaces, making it easier to separate data in more complex and realistic situations [37,38].

3.1.2. Foundational Variants

Classical Support Vector Machines (SVM) are limited by their lack of interpretability, high sensitivity to class imbalance, and increased computational demands as the number of support vectors grows. Table 3 summarizes key foundational variants introduced in earlier studies to mitigate these limitations.
Taken together, these variants show how early SVM research traded off simplicity, sparsity, and problem-specific tailoring. Some formulations simplify margin control or optimization but lose either separate control of errors and capacity or the sparsity of the standard soft-margin solution, others embed feature selection and asymmetric error costs to improve interpretability or class balance at the cost of extra hyperparameters, and structural changes reshape the optimization for specific regimes, increasing class coupling and overall algorithmic complexity.

3.2. K-Nearest Neighbors (KNN)

K-Nearest Neighbors (KNN) applies the nearest-neighbor decision rule, whose large-sample properties were analyzed in a seminal study by Cover and Hart, under the heuristic that nearby points in feature space tend to share the same class [45]. In its classical form, training consists of storing the labeled instances without fitting explicit model parameters, so most computation is deferred to classification time [46]. For prediction, features are typically scaled (min–max or z-score) [47], distances to all stored points are computed using Euclidean or Manhattan measures for continuous features [48], or Hamming distance for categorical features [49], and the label is assigned by majority vote among the k nearest neighbors [45]. The KNN procedure is described in Algorithm 2.
Algorithm 2 Classical k-Nearest Neighbors (KNN)
Require: Training set D = x 1 , y 1 , x 2 , y 2 , , x n , y n ; neighborhood size k ; distance measure d ( , ) ;
 scaling rule.
Ensure: Predicted class y ^ for a query instance x .
1: Function KNN ( D , x , k )
2: Store all labeled instances in D .
3: Scale/standardize x and the stored data using the same rule.
4: For each x i , y i D   do
5: Compute distance d ( x , x i ) .
6: End for
7: Select the k smallest distances and their labels.
8: Assign y ^ by majority vote among those labels.
9: return y ^
10: End function

3.2.1. Characteristics

The unique features of this algorithm determine how well it performs across different learning situations. KNN decisions depend strongly on the number of neighbors, which is the main hyperparameter in the algorithm, making the algorithm more difficult to tune. If there aren’t enough neighbors, the classifier becomes sensitive to noise and idiosyncrasies in the training data, whereas with too many neighbors, locally meaningful patterns may be lost because the decision boundary is effectively oversmoothed. Because of this sensitivity, tuning techniques are used to select suitable k-values, and more recent methods even learn k in a data-driven way that reflects the underlying data distribution [50,51]. Also, when KNN is used in situations with many features describing the data instances, its classification error can decrease much more slowly with sample size as the dimension grows, so it may not perform as well as expected in high-dimensional spaces [52]. The increased number of dimensions makes distances between points less clear, so they appear to be the same distance apart [53]. This makes the high-dimensional geometrical representation look more uniform, and the differences between local neighborhoods fade away [54]. Even though this is a problem, KNN can produce a decision boundary that wraps around the data in a flexible way, because the class label at each query point is determined by the labels of nearby training points, so the boundary can move locally in the feature space and follow very irregular or nonlinear class borders without assuming a fixed global shape, which makes the method locally adaptive. When you pick the right k value and have many points, theoretical analyses show that KNN can achieve very low misclassification error in a broad range of problems, so these complex boundaries do not necessarily come at the cost of predictive accuracy [55,56]. In addition, KNN is clear: each classification decision can be traced back to the training instances that made it, and it is explained because the final label is based on the votes of those instances, reflecting that the predictions are clear and don’t depend on abstract coefficients [57,58]. Still, the majority-vote system is heavily influenced by the class with the most members, effectively drowning out the voices of minority classes, so that in areas of the feature space where sample distributions are uneven, this bias causes KNN to incorrectly classify instances from minority classes and emphasizing that standard KNN decisions are driven by local class frequencies, which under severe imbalance fail to reflect the true positive propensity of the minority class [59,60]. Lastly, KNN is a nonparametric method: it does not impose a fixed functional form on the class distributions or decision boundary and does not rely on a parameter vector whose structure is fixed in advance; instead, predictions are driven directly by the labels of nearby training points. As more samples populate a region of the feature space, the local neighbor votes become more stable and the classifier’s empirical error can decrease, reflecting how nonparametric nearest-neighbor rules benefit from larger datasets [55]. At the same time, this instance-based, storage-heavy design makes nearest-neighbor search the main computational bottleneck, especially in large and high-dimensional datasets [61]. These traits show both the strengths and weaknesses of KNN, which is why the algorithm is still being studied, used in applications, and carefully tweaked in modern machine learning research.

3.2.2. Foundational Variants

Classical KNN is sensitive to the choice of k and the distance metric, and it requires storing and searching the full training set at prediction time. Foundational KNN variants therefore evolved along three main lines: prototype editing or condensation to shrink or clean the reference set and sharpen boundary behavior, vote reweighting to improve decisions in overlaps by reflecting unequal neighbor influence, and adaptive locality or metric learning to reduce sensitivity to neighborhood size and distance choice by aligning neighborhoods with the data structure. Table 4 presents foundational KNN variants from earlier work that were developed in response to these limitations.
Viewed collectively, these variants show how early KNN work traded simplicity for gains in efficiency and robustness. Editing/condensation reduces the stored reference set and can sharpen boundary behavior, but may remain sensitive to data order and to the chosen k , with worst-case retention still large. Distance-weighted and fuzzy voting improves decisions in overlap regions by giving nearer neighbors greater influence or using membership degrees, at the cost of additional design choices and tuning. Adaptive k selection and metric learning reduce sensitivity to neighborhood size and distance choice, improving discrimination while requiring offline search or costly training.

3.3. Decision Tree

A decision tree is a hierarchical structure generated by recursive partitioning of class-labeled training samples using splitting criteria. At each node, split-quality measures choose the most useful feature and its best threshold to make leaf nodes as pure as possible. The process starts with the root containing the whole training set and continues branching to increase class similarity until a stopping condition is met (pure nodes, maximum depth, or a minimum-sample requirement). The leaf nodes define hyper-rectangular regions in feature space and assign a class label by majority vote. For a new instance, the model traverses from the root by comparing feature values with thresholds until reaching a labeled leaf [68,69]. Foundational induction algorithms include ID3 [70], C4.5 [68], and CART [69], each using a distinct split-quality measure.
ID3 employs Information Gain (IG) based on Shannon entropy. For samples D at node N ,
H D = i = 1 m p i   log 2 p i , p i = D i D
where m is the number of classes and p i is the class- i proportion. For splitting on feature A with v distinct values,
H D | A = j = 1 v D j D H D j
Thus
I G D , A = H D H D | A .
C4.5 uses the Gain Ratio (GR), that refines IG by normalizing through Split Information:
S p l i t I n f o D | A = j = 1 v D j D log 2 D j D
G a i n R a t i o D , A = I G ( D , A ) S p l i t I n f o ( D | A )
Accordingly, the split with the highest Gain Ratio is selected. CART instead selects the binary split that maximizes impurity reduction using the Gini Index. For a set D ,
G i n i D = 1 i = 1 m p i 2
For a candidate split A producing D 1 and D 2 ,
G i n i D , A = | D 1 | | D | G i n i D 1 + | D 2 | D G i n i D 2
Split A is chosen if it gives the largest decrease between G i n i ( D ) and G i n i ( D , A ) compared to other options. The DT procedure is described in Algorithm 3.
Algorithm 3 Classical Decision Tree Induction (DT)
Require: Training set D = { ( x 1 , y 1 ) , , ( x n , y n ) } ; feature set J ; split-quality measure Q (IG, GR, or Gini); stopping rules.
Ensure: Decision tree T .
1: Function  DT   ( D , J )
2: If   D is empty then return a leaf with a default class.
3: If all labels in D identical then return that leaf.
4: If  J is empty or a stopping rule holds then return a leaf with the majority class in D .
5: Select best split (feature/threshold) by Q .
6: Create node and partition D into { D j } .
7: For each D j do attach DT ( D j , J { feature } ) .
8: Return node.
9: End function

3.3.1. Characteristics

For each prediction, decision trees create a unique path from the root to a leaf, turning complex decisions into simple if-then rules and allowing practitioners to see exactly which features and thresholds led to the result and whether the domain’s rules were followed. From another perspective, this transparency is central to the trade-off between how simple a tree is and how well it predicts, since larger trees are often more accurate but harder to interpret. Also, how complex and easy to understand the model is, can be controlled by a small set of structural hyperparameters, such as limits on depth, the minimum number of samples required to split an internal node or form a leaf, or constraints on the number of leaves, which effectively bound the size and shape of the tree. However, tightening these limits produces smaller trees that are less prone to overfitting but may sacrifice some fit to the training data, while loosening them yields larger trees that can improve training accuracy but are harder to interpret and more likely to overfit; in practice, cost-complexity pruning and related regularized impurity reduction criteria are used to tune this accuracy-complexity trade-off [71,72]. In the feature space, classical decision trees partition the input into axis-aligned rectangular regions by recursively splitting internal nodes on a single feature to increase the purity of the resulting child nodes, with each leaf corresponding to one such region. Taken together, these local regions induce a piecewise, highly non-linear decision boundary, allowing tree classifiers to approximate complex decision regions that cannot be separated by a single linear hyperplane [73]. On the other hand, decision trees, unlike parametric models, do not employ a fixed mathematical formula to encapsulate feature interactions; rather, they derive splits from the data to maximize impurity-based purity criteria, so that nonlinear interactions are captured by the sequence of splits along each root–to–leaf path instead of through a pre-specified parameter vector [71]. This complexity-based flexibility allows classical axis-aligned decision trees (such as CART and C4.5) to adapt to structural properties of the underlying model, including sparsity, smoothness and interaction structure [74]. At the same time, growing very deep trees in this purely data-driven way substantially increases the risk of overfitting, so practical implementations rely on strategies to control tree size and improve generalization [75]. Datasets with many dimensions are harder for decision trees to handle, because accessing many sparse features at each node is computationally expensive and a large number of non-informative or noisy attributes can degrade predictive performance [76]. As the tree grows deeper, the samples available in individual nodes become sparse, which makes split selection more variable and less reliable [77]. These effects tend to produce unnecessarily large and complex trees whose decision boundaries overfit the training data, so their behavior on unseen examples becomes unstable and less trustworthy [78]. Also, because impurity-based split measures naturally favor majority classes that reduce impurity the most, decision trees are especially sensitive to class imbalance, where the gain from isolating a small minority is limited [79]. Because of this imbalance, classical trees often achieve high overall accuracy by focusing on the majority class, while the decision regions for minority classes become small, fragmented and inaccurate, so rare but important categories are systematically under-detected [80,81].

3.3.2. Foundational Variants

Classical decision trees yield transparent rule-based models, but they can become deep and unstable in the presence of many weak or noisy predictors, and their training cost can rise sharply on large or high-dimensional datasets. Foundational decision-tree variants therefore developed along three complementary lines: statistical-test-based splitting with multiway partitions to favor shallower structures, streaming induction to support high-speed or unbounded data via incremental statistics, and disk-resident or parallelizable construction to scale training through presorting and level-wise scans. Table 5 summarizes representative variants and their trade-offs.
Taken together, these variants show how early work extended classical trees by prioritizing different bottlenecks. Significance-based splitting promotes shallower, more interpretable trees, but incurs heavier per-node statistical testing that can become difficult in high-dimensional settings. Streaming-oriented induction maintains incremental counts and uses concentration bounds to trigger splits, enabling continuous learning at speed, yet it depends on heuristic thresholds and can be less effective when dimensionality is high. Disk-resident and parallelizable designs improve scalability by restructuring training around presorted attribute lists and sequential, level-wise scans, but training cost remains essentially linear in the number of attributes, which still constrains very high-dimensional regimes.

3.4. Logistic Regression (LR)

First introduced by Cox in 1958, logistic regression is a statistical technique used mainly for binary classification tasks within the generalized linear model framework [86,87]. It uses the logistic (sigmoid) function to estimate the probability p that a new instance belongs to the target class and assigns class 1 when this probability exceeds a decision threshold, usually 0.5. The model is formulated through log-odds, where the odds are p / ( 1 p ) and their logarithm (logit) yields a linear predictor. For one predictor,
l o g i t p = l o g p 1 p = β 0 + β 1 X 1
A general form with multiple predictors is given by
l o g i t p = β 0 + i = 1 n β i X i
where X i are predictors, β i are coefficients, and β 0 is the intercept. The predicted probability is obtained by the inverse logit,
p = exp β 0 + i = 1 n β i X i 1 + e x p β 0 + i = 1 n β i X i
Coefficients are estimated iteratively by maximizing the log-likelihood,
l β = i = 1 N y i log p i + 1 y i l o g ( 1 p i )
where y i { 0,1 } are labels, p i are model probabilities, and N is the number of samples. The binary model was later extended to multiclass problems by using one logit per class and SoftMax, yielding multinomial logistic regression [88]. The LR procedure is described in Algorithm 4.
Algorithm 4 Binary Logistic Regression (LR)
Require: Training set D = x 1 , y 1 , , x N , y N
,   y i { 0,1 } ; convergence tolerance ε ; decision threshold τ (usually 0.5).
Ensure: Coefficients β .
1: function LR ( D , ε )
2: Initialize β .
3: Repeat
4: Compute p i for all samples using (15).
5: Evaluate l ( β ) using (16).
6: Update β to increase l ( β ) .
7: Until change in β   or   l ( β )   is   below   ε .
8: Return β
9: End function
Prediction: for a new point x , compute p by (15); assign class 1 if p τ , else class

3.4.1. Characteristics

Logistic regression exhibits a range of advantages and limitations that become evident under different modeling contexts and application scenarios. First, this model does not include any intrinsic regularization hyperparameters. It estimates the weight vector by maximizing the likelihood, and at prediction time it typically applies a fixed classification decision threshold, usually 0.5. The tuning settings primarily affect training stability and speed, and provided the optimizer converges to the same maximum-likelihood solution, do not change the fitted model [89]. However, in practice, stopping optimization early or not driving the loss fully to its minimum can behave similarly to regularization, because it prevents coefficients from becoming too extreme and thus shifts the bias–variance balance. These optimization settings include the optimizer type (either Iterative Reweighted Least Squares or gradient descent), the maximum number of iterations, and the convergence tolerance threshold. Second, the model’s mapping from features to outcomes has a fixed functional form: it uses one coefficient per feature and an intercept. The fixed structure maintains model complexity once the feature set is defined, regardless of the addition of new data. Consequently, this model can achieve stable parameter estimates with relatively modest datasets when there are sufficient outcome events per parameter [90] and no (quasi) separation [91], and it exhibits predictable (excess-risk) convergence under maximum likelihood as the data size increases [92]. However, its rigid linear form introduces bias if the patterns between features are nonlinear [10]. Third, assigning one coefficient per feature becomes problematic as dimensionality grows relative to the number of observations [93]. With few observations per coefficient, or when correlated features lead to unstable estimation, such as multicollinearity (strong correlations among predictors that make it difficult for the model to distinguish their individual effects), small data fluctuations can cause large swings in the estimates [94]. Consequently, logistic regression requires sample sizes that increase with the number of parameters to maintain reliable performance; otherwise, variance inflation and overfitting can substantially reduce accuracy [90]. Fourth, fitted coefficients offer transparency in logistic regression. Each coefficient indicates how a one-unit change in its corresponding feature influences the predicted probability via a shift in the log-odds, allowing for the ranking of feature importance [95]. The clear impact of these weights supports transparent mappings from inputs to outputs, and the predicted probabilities can be interpreted as class-membership probabilities when the model is well calibrated [96]; formal uncertainty, however, requires confidence intervals or analogous measures rather than raw probability scores [97]. This simplicity can communicate model reasoning and audit feature effects, but it is based on linear relationships; interpretability may be reduced if the model is extended to capture interactions or curved patterns [98]. Fifth, in class-imbalanced scenarios, the maximum-likelihood fit reflects the base rate primarily through the intercept [99], so with a 0.5 cutoff the model often misses the rare class because few minority instances achieve probabilities above the threshold, while the majority class is easily labeled [100]. The lack of a built-in mechanism to correct this skew highlights the importance of adjusting the decision threshold and/or using class weighting or resampling to achieve balanced performance [101]. Finally, the linearity of the decision boundary follows from the fact that logistic regression is based on a linear function of the predictors. Any fixed probability level corresponds to a straight dividing surface in the feature space, and the conventional 0.5 threshold simply selects one such dividing line [102]. The absence of curvature in this boundary makes it effective when classes are separated along a single direction. However, while this linearity can enhance interpretability and facilitate efficient training, it also restricts the variety of decision boundaries the model can represent; when the true class structure is strongly nonlinear, this limited flexibility can reduce achievable classification accuracy relative to more expressive models [10].

3.4.2. Foundational Variants

As Classical logistic regression offers a transparent linear logit model but lacks built-in regularization, can become unstable with many or highly correlated predictors, is limited under strongly nonlinear effects, and has no dedicated mechanism for rare or heavily imbalanced events. Table 6 presents foundational variants that extend the model to address limitations in practice.
At a high level, those developments can be grouped into three main directions. Regularization-based extensions stabilize estimation in high-dimensional or collinear settings and can induce sparsity or grouping, at the cost of shrinkage bias and the need to tune penalty strength. Response-structure and bias-reduction extensions broaden the model to ordered or multi-class outcomes and adjust for small samples, separation, and rare events, but add parameters and rely on stronger assumptions about how predictor effects behave across categories or samples. Nonlinear extensions relax the strict linear logit by replacing linear predictors with smooth additive terms, improving flexibility when effects are curved while increasing computational burden and leaving performance sensitive to smoothing choices when the classical model is adequate.

3.5. Naïve Bayes

The Naïve Bayes classifier is a probabilistic classification method grounded in Bayes’ theorem. Bayes’ theorem updates the probability of a hypothesis given new evidence, and the posterior is given by
P H E = P E H × P ( H ) P ( E )
Here, P ( H E ) is the posterior of hypothesis H given evidence E , P ( H ) is the prior, and P ( E ) is a normalization constant. In classification, each observation X R n is a feature vector X = ( x 1 , , x n ) and belongs to one of m classes { C 1 , , C m } . For a new instance X , the model computes P ( C i X ) for each class and assigns the class with maximum posterior probability:
C * = a r g m a x 1 i m P C i X
Naïve Bayes assumes features are conditionally independent given the class, so the joint likelihood factorizes into a product of per-feature likelihoods. Although this assumption can be unrealistic, Naïve Bayes often achieves competitive predictive performance [110]. Three widely used variants arise from different feature-distribution assumptions: Gaussian NB for continuous features with class-conditional Gaussian densities [111]; Multinomial NB for discrete count features such as word frequencies in text; and Bernoulli NB for binary features, often outperforming Multinomial NB when presence is more informative than frequency [112]. The NB procedure is described in Algorithm 5.
Algorithm 5 Naïve Bayes (generic classical NB)
Require: Training set D = { ( x 1 , y 1 ) , , ( x N , y N ) } ,   y i { C 1 , , C m } ; choice of likelihood model (Gaussian, Multinomial, or Bernoulli).
Ensure: Class priors and feature-wise likelihood parameters.
1: Function NB   ( D )
2: Estimate class priors P ( C i ) from class frequencies.
3: For each class C i , estimate per-feature likelihood parameters P ( x j C i ) using the chosen variant.
4: Return priors and likelihood parameters.
5: End function
Prediction: for a new point X , compute a posterior score for each class using (17) under the independence assumption, then assign the class by (18).

3.5.1. Characteristics

From a training and hyperparameter perspective, the classical Naïve Bayes classifier does not involve extensive tuning over complex combinations of hyperparameters; in its probabilistic form, it is nearly hyperparameter-free, with the main practical adjustment in the discrete variants arising when the association of a particular feature with a class membership is unobserved [113]. In these instances, the model assumes non-zero probabilities by employing standard additive smoothing techniques to mitigate zero-probability issues, such as Laplace or Lidstone smoothing, where a smoothing constant is added to the observed counts and its magnitude controls how much probability mass is shifted toward unseen feature–class combinations [114,115]. In contrast, for Gaussian Naïve Bayes the discrete zero-probability issue does not arise; class-conditional likelihoods are parameterized by per-class means and variances estimated from the data, and in the classical formulation no additional hyperparameters are tuned beyond these distribution parameters [116]. By avoiding explicit feature interactions and computing class scores from per-feature likelihoods and class priors, the classical Naïve Bayes classifier reduces training to estimating class priors and per-feature likelihoods from simple counts, and prediction to summing these contributions for each class. This structure keeps time and memory requirements low and lets the method remain efficient as the number of features and classes grows, because once the per-class statistics have been estimated, predictions use only these stored quantities rather than revisiting the full training set [117]. At the same time, real data often exhibits substantial dependencies among predictors, so the conditional-independence assumption is violated and can lead to poor probability estimates and reduced accuracy in some settings. Nevertheless, empirical and analytical studies on text and other benchmark classification tasks still find classical Naïve Bayes to be simple and fast to train and test, and often to attain accuracy comparable to more sophisticated classifiers [118,119]. Also, as a simple parametric model with a fixed, finite set of class-prior and class-conditional parameters estimated directly from class–attribute frequency counts, the classical Naïve Bayes classifier can be fitted quickly even on large datasets [120], and it admits explicit finite-sample consistency results and error bounds for its decision rule under the standard conditional-independence assumptions [121]. Building on this efficiency, the same additive form also makes Naïve Bayes transparent: under the conditional-independence assumption, class scores (or log-odds) decompose into per-feature terms, so the contribution of each input feature to a prediction is explicit, and in Gaussian Naïve Bayes these feature-wise log terms can be inspected directly and used as model-intrinsic ground truth when evaluating local explanation methods [122]. Naïve Bayes is therefore considered an intrinsically interpretable classifier, with these per-feature contributions providing a built-in explanation of its predictions [123]. A known limitation concerns class imbalance: as the difference in training frequencies among classes grows, posterior predictions increasingly favor the majority class through the priors, so minority classes can receive persistently low posterior probabilities unless the class priors or misclassification costs are adjusted [124]. Empirical studies and surveys of Naïve Bayes on skewed and imbalanced datasets report that performance on rare classes degrades under such conditions and that rebalancing or cost-sensitive strategies are needed to improve recognition of minority classes [125]. Finally, in terms of decision geometry for the discrete Naïve Bayes model, the same factorization implies that the class posterior can be written in log-space as a sum of feature-wise terms plus a class-specific constant, that is, an affine scoring function in the feature space corresponding to linear decision surfaces between classes [126].

3.5.2. Foundational Variants

According to its methodological characteristics, Naïve Bayes is computationally light and almost hyperparameter free, yet its conditional-independence assumption and simple likelihood forms can misrepresent dependencies among predictors, skewed class frequencies, and complex continuous distributions. Several foundational variants, summarized in Table 7, were proposed to address these limitations while retaining the same basic probabilistic scoring scheme.
Within this family of variants, some approaches introduce limited dependencies or select a subset of predictors to reduce the impact of correlated or redundant features, but risk fragmenting counts and overfitting to the chosen dependency structure or feature subset. Another replaces simple parametric likelihoods with more flexible density estimates, better matching irregular continuous data while increasing prediction cost and sensitivity to bandwidth and other tuning choices. A further variant reweighs class and feature statistics to counter majority-class dominance in sparse, high-dimensional settings, adding modest computational overhead and moving slightly away from the most straightforward generative interpretation of naïve Bayes.

3.6. Random Forest (RF)

Breiman first formalized the Random Forest algorithm in 2001 [132]. It is an ensemble learning method that builds a group of decision trees and uses majority voting to produce a more reliable final prediction. Each tree is trained on a bootstrap resample of the data, so the same observation may appear more than once in a tree’s training set. At each internal node, a random subset of features is considered to find the split that gives the most information, typically evaluated by standard impurity measures such as the Gini index or entropy. After training, each tree predicts a class for a new instance, and the class receiving the most votes across trees is selected as output, reflecting the bagging (bootstrap aggregating) strategy. Breiman distinguished two variants: Forest-RI (Random Input Selection), which randomizes the input features at each split, and Forest-RC (Random Combination), which creates synthetic features at each node by linearly combining randomly chosen original features using coefficients in [−1, 1] then splitting on the best of these candidates. The RF procedure is described in Algorithm 6.
Algorithm 6 Random Forest
Require: Training set D ; number of trees T ; number of candidate features per split m ; split criterion Q (Gini or entropy); stopping rules.
Ensure: A forest of T decision trees { h 1 , , h T } .
1: Function RF ( D , T , m , Q )
2: Initialize an empty forest F .
3: For t = 1   to   T do
4: Draw a bootstrap sample D t   from   D .
5: Grow a decision tree h t   on   D t :
6: At each node, randomly select m features.
7: Select the best split by Q .
8: Repeat splitting until a stopping rule holds.
9: Add h t   to   F .
10: End for
11: Return F
12: End function
Prediction: for a new point x , collect votes from all trees in F and output the majority class.

3.6.1. Characteristics

Random Forest has unique features that explain both its consistent success in the real world and the problems that make it hard to use in practice. One of its most obvious strengths is that it works well in high-dimensional feature spaces, where the split search at each internal node is done on a random group of candidate features [133]. This random-subspace step reduces the number of feature evaluations required at each node and, in practice, acts as implicit regularization that often reduces variance and the risk of overfitting, even in very high-dimensional situations [134]. However, overall training and prediction cost still scale with the number of trees and with tree size (number of nodes) [135]. In addition to this computational benefit, Random Forest is more flexible because each decision tree makes a piecewise axis-aligned boundary from its feature splits. When these boundaries are combined through majority voting, they yield a highly flexible, data-adaptive decision surface (not restricted to a single global axis-aligned hyperplane) [136]. This flexibility enables Random Forests to capture higher-order feature interactions that a single tree may miss, improving predictive accuracy [137]. However, this predictive strength comes at the cost of interpretability. While each tree in the ensemble can be represented as a set of explicit rules, the aggregation of many trees through majority voting makes it harder to see global decision logic, which limits transparency at the ensemble level [138]. Another important feature is that Random Forest is nonparametric: its impurity-based splitting criteria and tree structure do not impose a fixed functional form between features and labels, but let data-driven partitions determine how the response depends on the inputs, which allows complex nonlinear relationships to be learned without specifying an explicit equation in advance [139]. But when there is an imbalance in the classes, the model can be weak because impurity criteria and the scarcity of minority examples at deeper nodes bias splits toward the majority, and the default majority-vote decision rule further reinforces this tendency [79]. As a result, the ensemble often favors the dominant class and misses rare but important patterns, unless remedies such as class weights, balanced subsampling, or adjusted decision thresholds are used to correct this bias [140,141]. Lastly, the model’s behavior is strongly influenced by its hyperparameters, including the number of trees in the ensemble, the number of input features randomly considered when choosing each split, the minimum number of samples required in a terminal node, and, in many implementations, explicit limits on maximum depth or maximum number of leaves [142]. Increasing the number of trees typically stabilizes predictions but also increases training and prediction time, with empirical studies showing that runtime grows approximately linearly with the forest size. Finer structure in the data can be captured by growing deeper trees or allowing smaller leaves, but if terminal nodes become too small the resulting partitions become overly specific to the training data and the ensemble becomes more prone to overfitting [19,135]. The size of the feature subset examined at each split controls how much extra randomness is injected into the ensemble: using fewer candidate features produces more random, less correlated trees, whereas using more features produces stronger but more similar trees, reflecting a bias–variance style trade-off [134].

3.6.2. Foundational Variants

Earlier performance analysis highlighted that classical Random Forest combines strong predictive accuracy in high-dimensional spaces with limitations in computational cost, reliance on axis-aligned impurity-based splits, and difficulty fully exploiting correlated predictors or avoiding split-selection bias. To provide historical context, Table 8 summarizes Random Forest variants that targeted some issues.
These variants point in three directions. Stronger randomization in feature and threshold choice or rotation of the feature space seeks greater ensemble diversity and variance reduction, often improving accuracy but adding preprocessing and leaving key settings tuned heuristically. Conditional-inference splits replace standard impurity-based rules to reduce variable-selection bias, at the cost of extra tests at each node. Oblique-split forests use linear combinations of features to better capture correlation and high-dimensional structure, yielding smoother, often more accurate boundaries while increasing training complexity and still relying on heuristically chosen hyperparameters.

4. Recent Research

This section reviews advances published during 2020–2025 for each classifier family, noting the targeted problem, the methodological contribution, the main results, and limitations. Each subsection concludes with a summary table for quick reference and key outcomes: Table 9 (SVM), Table 10 (KNN), Table 11 (DT), Table 12 (LR), Table 13 (NB), Table 14 (RF).

4.1. Recent Research on Support Vector Machine (SVM)

While standard SVM assigns equal weight to all samples, which biases the decision boundary toward the majority class and weakens minority recognition, the fuzzy support vector machine (FSVM) [147] addresses this by assigning membership weights that down-weight noisy samples and emphasize representative ones. Tao et al. [148] provided a new membership function based on two kernel-derived quantities, affinity and class probability, to improve discrimination under class imbalance. While affinity is measured by the distance of each sample to the center of the compact hypersphere enclosing the class, computed using the Support Vector Data Description (SVDD) model [149], the kernel k-nearest-neighbor (KNN) method [150] estimates class probability from the proportion of same-class neighbors, reflecting local label consistency. Although this integration strengthens classification, it increases sensitivity to hyperparameters including the SVDD regularization constant, the FSVM trade-off parameter, the neighborhood size, and the Gaussian kernel width governing similarity. Kernel-free SVMs avoid kernel selection and tuning by learning nonlinear separation surfaces directly in the feature space. Gao et al. [151] extend this idea with a quartic Double-Well-Potential SVM that captures stronger nonlinearities than quadratic models while remaining kernel-free. The model uses a fourth-degree polynomial created by nesting one quadratic term inside another, producing a more flexible decision surface. Its margin (G-margin) is defined as the sum of the perpendicular distances along the surface normal to the +1 and −1 boundaries, and separation is optimized by minimizing the reciprocal of this margin. The quartic formulation is transformed into a soft-margin quadratic programming problem by expressing the polynomial terms in vector form, then solved with an SMO-type iterative algorithm. Computation is simplified by dropping the rank one constraint which is a standard relaxation that improves tractability with competitive empirical accuracy. Evidence covers artificial and public benchmarks and modest credit data, and training time rises with size while no tests are reported for sparse high dimensional or truly large-scale data. In Wang et al. [152], the authors define the separating hyperplane of the L0/1-SVM using only the L0/1 support vectors, all of which lie exactly on the support (margin) hyperplanes. In their work, they used a nonconvex and discontinuous soft-margin loss function that counts the number of samples violating the margin. During the iterative optimization, only a dynamically updated active set of samples that are on or close to the margin is employed. The algorithm terminates when it reaches a mathematical state of the optimization variables that satisfies the optimality condition of this classifier problem (P-stationary condition). The resulting method, which achieves a reduced number of support vectors and greater computational efficiency, is applied to binary classification and, in the current work, is restricted to linear decision boundaries, with nonlinear kernel extensions suggested for future research. The Twin Support Vector Machine accelerates training compared to the standard SVM by solving two smaller quadratic programming problems instead of a single large one, yet this efficiency can heighten overfitting when data are noisy or limited. Francis et al. [153] addressed this by incorporating two regularization terms—ambient and intrinsic—into the original formulation. The first limits overall decision-surface complexity in the feature space, while the second preserves neighborhood structure among nearby samples. Their shared effect is a powerful generalization based on flatter boundaries while preserving computational realism. However, while the model outperforms competing SVM baselines, it uses a one-vs.-one strategy within a kernel-based framework for multiclass classification. To reduce the long training time of SVMs on large-scale datasets, Pimentel et al. [154] proposed two hybrid variants that integrate SphereSVM’s coreset construction [155] and Speed Up SVM’s weak-model selection strategy [156]. SphereSVM iteratively adds violator vectors and updates the enclosing ball to build a representative coreset; it selects the closest weighted point inside and the farthest point outside, and to satisfy optimality, it shifts the center toward the outside point while reducing the weight of the inside point. As a result, it amends a fixed-radius envelope capturing key boundary regions across classes. On the other hand, SU-SVM simplifies scaling by training weak models on small subsets and assigning samples based on how different their predictions are. The first variant, Fusion WSVM, uses SphereSVM coresets to train weak models and retains the most variable points for the final model, whereas CoreWeak SVM (CW SVM) trains directly on the coresets, eliminating the need for additional sampling to improve efficiency. Both variants reduce training time while maintaining accuracy close to that of the standard SVM, and their performance depends on heuristically fixed hyperparameters that may affect the trade-off between efficiency and accuracy. With the aim of improving accuracy through avoiding treating all attributes uniformly, Sowmya et al. [157] introduced the SHiP Vector Machine (Sophisticated High Performance Vector Machine). The model uses the mutual information between each feature and the target class to compute feature weights. Those weights, which indicate the relative importance of each feature, are embedded into the learning process via a weighted kernel transformation, enabling the classifier to focus on informative features while reducing the impact of less relevant ones. However, altering the feature space and enhancing the decision boundary improved accuracy on small datasets focused on binary Denial-of-Service attacks. Mainly, the recent SVM studies address two classical limitations: imbalance/noise sensitivity and computational cost. Imbalance is mitigated by sample weighting or membership mechanisms, but these have been reported to increase hyperparameter sensitivity (regularization, neighborhood, and similarity settings). Efficiency is improved by focusing optimization on margin-near samples, reducing effective support-vector dependence, or training on representative subsets, yet performance is stated to depend on heuristic or additional hyperparameters that govern the efficiency–accuracy trade-off. Kernel-free nonlinear formulations reduce reliance on kernel selection, but reported evidence remains limited for sparse high-dimensional and truly large-scale data. Key classical limitations that remain only partially resolved include nonlinear interpretability, multiclass complexity, and scalable validation beyond the studied data regimes. The SVM studies are summarized in Table 9.
Table 9. Recent research on Support Vector Machine (SVM).
Table 9. Recent research on Support Vector Machine (SVM).
StudyYearTargeted IssueKey Outcomes
Tao et al. [148]2020Class imbalanceTop mean ranks over baselines (Friedman–Holm significant); under 10:1 imbalance reaches G-Mean 0.9867, F-Measure 0.9835, AUC 0.9775; robust to outliers, border, and class noise but sensitive to hyperparameters.
Gao et al. [151]2021Nonlinear decision boundariesHighest accuracy on artificial data with advantage growing in higher dimensions; on benchmarks typically +0.2–1.75 percentage points over the second best; training ≤20 s yet ~1–2 orders slower, test time ~10−4 s per case; top AUC on some credit sets; lower accuracy variance than baselines.
Wang et al. [152]2022Training efficiencyOn 2-D synthetic data with label flips 0–20%, L0/1-SVM’s test accuracy decreases from ≈97% to ≈78% but remains slightly above competitors; on 14 real datasets it usually achieves the highest accuracy with the fewest support vectors and short training times (~0.57–14.26 s).
Francis et al. [153]2022Overfitting and generalizationAccuracy about 84.9%, 84.2% and 86.2%; corresponding precision/recall/F1 about 85.7/84.1/84.9, 85.1/83.3/84.2 and 85.0/86.7/85.9; generally lower or comparable false positive and false negative rates than the other evaluated SVM variants.
Pimentel et al. [154]2024Scalability and training timeOn the largest Car Sales sample, training is ~10× faster than full SVM (<2 h vs. 21 h); across simulated and real data, accuracy usually stays within ≤5 percentage points of full SVM while time gains grow with size, with reduced accuracy variance and a competitive time–accuracy trade-off.
Sowmya et al. [157]2025Improving accuracySHiP-RBF Achieves 96.44% and 90.12% accuracy on two intrusion-detection tasks

4.2. Recent Research on K-Nearest Neighbors (KNN)

Although fuzzy KNN improves robustness to boundary noise, its membership degree computation phase significantly increases runtime and memory overhead on large datasets. Among scalable fuzzy KNN approaches, Maillo et al. [158] mitigate these limitations by integrating the Hybrid Spill Tree (HS), which balances accuracy and speed by interleaving exact Metric Tree (MT) splits with approximate Spill Tree (SP) splits. They propose Local HS FKNN (LHS FKNN), in which each Spark partition builds its own HS index, computes fuzzy membership vectors locally, and merges the partial results into a global fuzzy training set. They also present Global Approximate HS FKNN (GAHS FKNN), which constructs an exact “TopTree” from a 0.2% random subsample to estimate the spill-tree overlap parameter. The TopTree is then broadcast to all workers to guide construction of a single HS index over the full dataset. Finally, at classification time both variants build an HS over the fuzzy training set and execute approximate KNN queries to accelerate prediction. Runtime gains were dataset dependent. GAHS can slow in very high dimensions while LHS benefits from more partitions, but overall, both fuzzy HS variants matched or exceeded crisp KNN baselines in accuracy. Gou et al. [159] introduced the representation coefficient-based k-nearest centroid neighbor (RCKNCN), an extension of KNN that selects the k nearest centroid neighbors (actual training samples) using Euclidean distance together with the nearest-centroid neighborhood (NCN) criterion, which accounts for proximity and the spatial distribution of neighbors. They subsequently solved a ridge-regularized least-squares problem to compute representation coefficients quantifying each neighbor’s contribution to reconstructing the test sample. Finally, they used the learned coefficients in a coefficient-weighted voting scheme to infer the sample’s class label. While evaluating across different recognition tasks, RCKNCN demonstrates robustness to the choice of k, including in small-sample and noisy settings. However, this comes with a higher per-query cost due to NCN search and solving the coefficient system, resulting in empirically longer runtimes than KNN/WKNN. Ma & Chi [160] introduce PEWM_G KNN, a K-nearest neighbors variant that revises the similarity metric, feature weighting, and voting scheme. First, Euclidean distance, which can be dominated by high-variance attributes, is replaced by the absolute Pearson-correlation similarity, comparing instances based on their linear co-variation. Second, each feature is assigned an entropy-derived weight based on its dispersion, thereby refining its contribution to the similarity measure. Finally, neighbor votes are weighted by a Gaussian kernel, so closer instances carry larger weights while more distant ones taper off smoothly. In general, the approach recorded an outperforming accuracy against the standard KNN, with clear gains in larger and more complex datasets, although the K-value is fixed after preliminary experiments. Liu et al. [161] propose a modified KNN classifier that learns the optimal feature-weight vector and neighborhood size k via PL-AOA and applies them within a standardized Euclidean distance. This meta-heuristic enhances parameter tuning by combining a Lévy-flight perturbation strategy, which draws step lengths from a heavy-tailed probability distribution to balance exploration and exploitation, with a parallel computation framework that evolves candidate subpopulations concurrently. Each subgroup evaluates its KNN configurations, then periodically communicates the best solutions among groups to guide the global search. Guided by arithmetic operators and an adaptive control parameter, PL-AOA improves convergence behavior and yields an adaptive KNN variant with higher accuracy in the reported experiments. However, as each feature adds an additional weight to be optimized, this increases the dimensionality of the search space with the number of features, which may obstruct scalability on high-dimensional datasets. To reduce KNN sensitivity to noisy or ambiguous data, Kiyak et al. [162] proposed the High-Level KNN (HLKNN) algorithm. In this approach, the neighborhood search is considered a hierarchical structure with two levels: first, it identifies the nearest neighbors of the query instance, and, in the second step, recovers their own nearest neighbors (high-level neighborhood). This framework enables HLKNN to catch both local closeness and more distant relationships among samples. Based on using the aggregated labels from both levels, the final classification is performed via majority voting. The approach improved accuracy in this computationally intensive two-level search neighborhood representation. For binary classification tasks, Lin [163] proposes a KNN variant based on mixed-integer linear programming (MILP) that jointly optimizes feature selection and the odd neighborhood size K to overcome the limitations of sequential tuning. KNN is reformulated as an optimization problem in which binary variables determine which features are active and which neighbors are selected, guided by their contributions to improving accuracy or recall on the training data. The approach is restricted to a predefined set of odd K values, allowing exploration of multiple neighborhood sizes. The MILP identifies the feature–K configuration that maximizes performance using either squared Euclidean distance on normalized data or Hassanat distance on the original scale. Solved once on the training data, it yields an optimized model applied uniformly to testing. Compared with the ensemble-based EA-KNN, which combines several KNNs under the Hassanat distance, the MILP-KNN achieved higher accuracy and recall, though with greater computational cost. Taking together, those studies handle three classical limitations: scalability of neighbor search, sensitivity to k, and noise/ambiguity in local neighborhoods. Scalability is improved by using distributed or approximate indexing and faster querying; however, reported runtime gains are data-dependent and can degrade in very high dimensions. Sensitivity to k and robustness are improved by revising neighbor selection, similarity metrics, and voting, and by jointly tuning k with feature weighting or feature selection; however, these improvements typically increase per-query computation and can raise overall computational cost, with reduced scalability as the number of features grows. The remaining limitations are only weakly addressed in the reviewed evidence, notably severe class-imbalance bias and the curse of dimensionality, which might be partially mitigated rather than resolved. A consolidated per-study summary for KNN appears in Table 10.
Table 10. Recent research on K-nearest neighbors (KNN).
Table 10. Recent research on K-nearest neighbors (KNN).
StudyYearTargeted IssueKey Outcomes
Maillo et al. [158]2020Scalability and efficiencyAverage accuracy 76.44–77.37% for k = 3, 5, 7 with GAHS-FkNN and LHS-FkNN; on average they outperform crisp kNN variants, with GAHS slowing on high-dimensional data and LHS gaining from increased partitions.
Gou et al. [159]2022Robustness to K selectionAverage accuracy 85.49%, 80.74%, 90.78%, and 87.17% over tabular, time-series, image, and noisy-attribute tasks; RCKNCN shows markedly improved robustness to the choice of k across these domains.
Ma & Chi [160]2022Similarity measure and feature weightingOutperforms standard KNN, with accuracy ranges roughly 88–100%, 89–95%, 81–88%, and 96–98%, and greater stability as dataset size increases (accuracy ranges approximate, read from Figures 9–12).
Liu et al. [161] 2022Hyperparameter optimizationAveraged over 30 runs on WSN-DS, kNNPL-AOA attains ACC 99.721%, DR 99.171%, and FPR 6.897%, and it achieves the highest ACC and DR among the four compared kNN-based models.
Kiyak et al. [162]2023Robustness to noisy/ambiguous dataAverage accuracy 81.01% versus 79.76% for standard KNN, with equal or better performance on 26/32 datasets; average precision 0.8129 vs. 0.7611 (>5% improvement) and F-score 0.8111 vs. 0.7779 (>3% improvement).
Lin [163]2024Feature and K optimizationAchieves higher accuracy than EA-KNN in 6/10 cases and higher or equal recall in 9/10 (with one tie) but is computationally heavier than EA-KNN.

4.3. Recent Research on Decision Tree (DT)

Traditional fuzzy decision trees, such as FRDT, may rely on globally defined fuzzy sets with limited discriminative ability, thereby restricting their capacity to represent the narrower data distributions found in deeper nodes. Cai et al. [164] proposed the fuzzy oblique decision tree (FODT) to address this drawback, while preserving the uncertainty-handling capability with more expressive splits. At each layer, combinations of locally recalculated fuzzy sets across multiple features are used to derive rules for the classes represented in that node’s data subset. The fuzzy support and fuzzy confidence measures are used to evaluate the candidate rules to retain one rule per class with the highest validity. During this iterative process, samples not covered by any rule are passed to an additional node and reprocessed in the next layer. Although the model employs a hyperparameter-controlled, membership-driven rule construction that must be optimized (e.g., via a genetic algorithm) to balance accuracy and tree size, it generally produces more compact trees and demonstrates superior classification accuracy. Wang et al. [165] addressed the challenge of splitting nodes that contain multiple classes in multivariate decision trees. In such cases, existing methods either produce shallow trees by splitting into one child per class or, in binary trees, force classes into two groups that are often not linearly separable. To overcome these limitations, they proposed a binary decision tree (BDTKS) that performs node splitting using K-means clustering to exploit the data’s intrinsic structure and subsequently introduced a centroid-based hyperplane transformation (A-BDTKS) to accelerate classification. Furthermore, within a node, the model uses a condition to determine whether to split and a threshold to control the class proportions. Outperforming performance can be prone to instability under certain data distributions due to clustering initialization. Dhebar & Dep. [166] sought to preserve the interpretability and with high accuracy through their proposed nonlinear decision tree (NLDT). In this decision tree, each internal node specifies a nonlinear split rule via a hierarchical bilevel optimization framework that leverages Genetic Algorithms at both the structural (upper) and parameter (lower) levels. The upper level aims to evolve polynomial-like structures under complexity constraints, while the lower level adjusts coefficients to maximize node purity. Although methodologically more involved, the resulting binary-splitting classifier achieved accuracy on par with or exceeding CART and SVM, while preserving a compact, interpretable structure. Loyola-González et al. [167] proposed the Voting-Method-Based Decision Tree (VM-DT) to address that no single split-evaluation measure consistently dominates across datasets. In VM-DT, voting methods aggregate the per-measure rankings of candidate splits. The contributing measures are selected greedily: subsets are expanded stepwise and screened with Wilcoxon signed-rank tests against individual measures, while the final comparisons are corroborated through Friedman + Finner and Bayesian analyses. Once the optimal subset of measures is fixed, the authors evaluated three VM-DT variants based on different voting methods: Borda Count, Single Transferable Vote, and Reciprocal Rank. Evaluations conducted on datasets from the UCI and KEEL repositories demonstrated that, within the C4.5 framework, the Borda- and STV-based VM-DT variants achieved the best average AUC and were top-ranked over individual split-evaluation measures, though at the cost of longer training time. Zhang et al. [168] observed that traditional decision trees select branch nodes based on relevance measures such as entropy or Gini index, which capture only the strength of association between features and target. On the other hand, they suggest a decision tree structure that gives priority to causal linkages, only growing the tree when a characteristic has a statistically significant causal effect on the target. The Hilbert–Schmidt Independence Criterion (HSIC), which was chosen for its low estimation bias and resilience to nonlinear interactions with small sample sizes, is used to build the causal decision tree. On several UCI datasets, Causal DT produced trees that were, on average, about 35% shallower while maintaining or somewhat improving accuracy. This design based on causality makes it easier to understand and fairer by stopping splits on sensitive or wrong attributes. But it costs more to compute because it needs HSIC tests at every node, especially when working with big datasets, in addition to the need of defining the significance threshold. Collectively, the Recent decision-tree studies tackle limits in split expressiveness, split-selection reliability, and tree size or overfitting control. They improve representational capacity by using more flexible split rules and locally adapted splitting, often yielding more compact trees and higher accuracy, but at the cost of extra hyperparameters and more complex optimization. To reduce dependence on any single impurity measure, some work aggregates multiple split criteria, improving performance but increasing training time. Other studies prioritize causal rather than purely associative splits to obtain shallower trees, yet this adds node-wise computation and requires setting a significance threshold. Remaining issues not directly targeted in the reviewed studies include high-dimensional and large-scale scalability, instability due to clustering initialization or sparse-node variability, and the classical class-imbalance bias of impurity-based splits. A tabulated digest of the Decision Tree papers is provided in Table 11.
Table 11. Recent research on Decision Tree (DT).
Table 11. Recent research on Decision Tree (DT).
StudyYearTargeted IssueKey Outcomes
Cai et al. [164]2020Nonlinear fuzzy splits and tree compactnessReduces average MI from 92.88 to 79.64 and TS from 16.41 to 8.54, and against two other oblique trees it lowers MI to 75.24 (vs. 86.79/87.36) at the cost of slightly larger TS (8.13 vs. 5.39/7.03); on 12 datasets, Holm tests show it is significantly more accurate than most standard axis-aligned and fuzzy decision trees but not than C4.5 or an earlier fuzzy tree.
Wang et al. [165]2020Multiclass splitting under imbalancematch or exceed C4.5 accuracy on all datasets and, compared with CART, are at least two percentage points more accurate on five datasets and at least two points worse only on one (“phishing”).
Dhebar & Deb [166]2021Nonlinear trees and interpretabilityAchieves SVM-level accuracy on several benchmarks with far fewer feature appearances and rules, attains single-rule models with top accuracy (including 100% on one engineering task), and confirms scalability by producing compact, interpretable trees even on up to about 500 features.
Loyola-González et al. [167]2023Split-selection robustnessAttains average AUC 0.7985 and STV-VM-DT 0.7962; Friedman–Finner and Bayesian tests rank VM-DT variants above individual split measures, and their trees have average depths 9.83 and 9.79, node counts 81.89 and 81.09, and training times 14.29 and 13.28 min.
Zhang et al. [168]2024Causal splitting and interpretabilityProduces trees with average depth 5.5 versus 8.6 for the baseline (≈35% reduction) while maintaining accuracy in the range 70.50–97.96% and AUC 63.12–97.97%, improving interpretability and fairness

4.4. Recent Research on Logistic Regression (LR)

Sheng et al. [169] created the Subclass-Weighted Logistic Regression (SWLR) model to improve traditional logistic regression while still being able to be understood in high-dimensional, heterogeneous neuroimaging data. After principal component analysis (PCA) gets rid of noise and redundancy, the method uses k-means++ clustering to make subclasses. Learning subclass-specific coefficients improves global logistic regression weights by keeping them easy to understand while still capturing changes in local data. In order to stabilize estimation and avoid overfitting, SWLR optimization uses dual L2 regularization on global and subclass coefficients and minimizes the negative log-likelihood, which is equivalent to the standard logistic loss, according to the maximum likelihood framework. Although hyperparameter sensitivity and the computational cost of clustering have an impact on SWLR’s performance, it achieved competitive or superior accuracy in Alzheimer’s disease classification while maintaining interpretability. Maximum likelihood-based logistic regression becomes unstable and extremely sensitive to outliers in real-world image classification, where samples frequently contain sensor noise or occlusions. A doubly robust logistic regression with elastic net regularization (DRLRENR) for robust image classification was proposed by Song et al. [170] in order to address this issue. By reducing the integrated squared error between empirical and model-based class probabilities, the logistic L2E estimator, which the model substitutes for the MLE, improves robustness to mild outliers. Concurrently, a tensor robust PCA (TRPCA) component preserves the intrinsic tensor structure of the data while recovering its clean low-rank representation and isolating sparse corruptions. Elastic net regularization, which uses both L1 and L2 penalties to balance sparsity and stability, makes the classifier even more stable. The alternating direction method of multipliers (ADMM) is an efficient way to estimate parameters. It makes sure that the process converges even with the non-convex L2E loss. The framework is still only able to do binary classification, even though it works well. It also needs careful tuning of regularization parameters. Charizanos et al. [171] revisited logistic regression due to diverging and unstable coefficient estimates that arise when predictor variables nearly perfectly differentiate between outcome classes (a phenomenon termed data separation), coupled with biased estimation resulting from class imbalance. The authors created a probabilistic fuzzy logistic regression framework that combines fuzzified and clear predictors, outputs, and coefficients. It uses a fuzzy probability threshold to classify data. A Monte Carlo search process is used to estimate model coefficients. This process samples and tests different sets of candidate parameters over and over again. The binary response is changed into triangular fuzzy numbers. The user sets hyperparameters that are adjusted through experimentation to define the upper and lower bounds, asymmetry, and degree of fuzziness. Across experiments, the mean absolute error (MAE) was found to be the most stable optimization criterion. The suggested framework demonstrated consistently good classification performance while keeping the odds ratios interpretable by using the center-of-gravity method to defuzzify the coefficients. Training logistic regression with continuous distance-based cost functions that minimize the gap between predicted probabilities and actual labels may not always enhance classification accuracy due to the inherently discrete nature of classification. In response to this limitation, Khashei et al. [172] introduced discrete learning-based logistic regression, which relies on an objective function aimed at maximizing the alignment between the sign of the predicted value and the actual class label. To make the optimization solvable as a mixed-integer linear program, the function uses binary variables for each sample to show the two classes and the neutral assignment. These variables are subject to rules that make sure the binary assignments are valid. Three benchmark credit scoring datasets showed that this method was more accurate, but it took longer to train because discrete optimization requires more computing power. Genome-wide studies with millions of SNPs and small sample sizes show the curse of dimensionality, which makes classical logistic regression unstable because there are so many correlated predictors. In Sun [173] the Integrative Functional Logistic Regression (IFLR) framework, a binary functional extension of logistic regression, addresses this challenge by partitioning the genome into consecutive SNP regions and representing each individual’s genotypes within a region as a smooth genotype curve using B-spline basis expansions. For each region, a shared effect curve is estimated to describe how genetic variation along that segment influences disease risk. The genotype and effect curves are combined through a functional inner product to compute region-level contributions, which are then aggregated within a logistic regression model to predict phenotype probabilities. Local sparsity and smoothness penalties jointly control dimensionality and ensure that the estimated effect curves remain interpretable and biologically coherent. A Newton-Raphson algorithm is used to find the best values for model parameters, and the Bayesian Information Criterion automatically adjusts the penalty strength. Although the cost of additional processing power, IFLR produces strong, interpretable inferences and drastically reduces dimensionality. Recent logistic-regression studies mainly address classical limits in high-dimensional instability/overfitting, outlier sensitivity and separation-related instability, and limitations of the classical formulation under complex or heterogeneous structure, while aiming to retain coefficient interpretability. They stabilize estimation by adding structured representations (subclasses or region-wise effects) with explicit regularization, but report hyperparameter sensitivity and extra computational cost from preprocessing or clustering/partitioning. Robustness is improved by replacing the MLE with robust objectives or by using fuzzy thresholding and search-based estimation, yet these frameworks require careful tuning and are often limited to binary classification. Discrete, classification-aligned optimization can improve accuracy, but it increases training time due to more expensive optimization. Limitations that remain weakly addressed across this set include the classical linear decision boundary and class-imbalance handling beyond thresholding/weighting, along with interpretability trade-offs as added structure and tuning grow. The Logistic Regression studies are collated and summarized in Table 12.
Table 12. Recent research on Logistic Regression (LR).
Table 12. Recent research on Logistic Regression (LR).
StudyYearTargeted IssueKey Outcomes
Sheng et al. [169]2022Interpretability in high-dimensional neuroimagingAttains 89.5–95.8% accuracy, improves classical LR by +6.3–10.4 pp (avg +8.34 pp), yields a much more balanced sensitivity–specificity profile than LR, and achieves best or near-best accuracy compared with recent competing classifiers.
Song et al. [170]2023High-dimensional image outliersOn a noisy EEG task, DRLRENR attains 83.33% accuracy vs. 66.67–79.17% for baselines; on five face-image tasks with 30%-pixel corruption it achieves 82.81–98.44% accuracy.
Charizanos et al. [171]2024Separation with class imbalanceOn synthetic MAE-optimized models, fuzzy LR averages Spec 0.946, Sens 0.839, F1 0.874, MCC 0.744; on five real datasets, best F1 spans 0.807–0.996 and it drastically reduces Sens = 0/Spec = 1 collapse (45–62.5% vs. 0.9–1.3%) and separation-induced perfect-score runs (35–50% vs. <3%).
Khashei et al. [172]2024Loss–decision mismatch in LRAttains 88.07% average accuracy vs. 81.95% for classical LR; on a Japanese dataset it reaches 91.58% vs. 85.33% and generally surpasses recent single and hybrid statistical/intelligent classifiers.
Sun [173]2025High-dimensional, correlated SNP data in GWAS LRLowers misclassification in high-dimensional SNP settings (for example, at n = 5000 its MCR is about 0.19–0.20 vs. 0.32–0.42). In a coronary artery disease GWAS it achieves MCR about 0.29–0.30 vs. 0.46–0.61 while selecting SNPs in genes previously linked to cardiac or metabolic traits.

4.5. Recent Research on Naïve Bayes (NB)

Chen et al. [174] came up with the Selective Naïve Bayes (SNB) classifier to fix the problem of accuracy loss that happens when the conditional independence principle is broken in real life. This method ranks features by how much information they share with the class. This makes a set of models that each add one feature to the one before it. An incremental leave-one-out cross-validation process also effectively tests all nested models in a single extra pass by updating the frequency table as each training tuple is removed and picking the model with the lowest root-mean-square error (RMSE). Empirical evaluations indicate that SNB generally yields superior predictive accuracy compared to traditional Naïve Bayes. However, its feature ordering is based only on univariate mutual information with the class, so it might keep extra features that do not add much to its ability to tell the difference. Zhang et al. [175] present the Attribute- and Instance-Weighted Naïve Bayes (AIWNB), which improves the classic model by combining discriminative attribute and instance weighting. When features are strongly related to the class but not very related to each other, attribute weights go up, which makes the model better at telling the difference between classes. In the eager version, instance influence is determined during training through an attribute-value frequency heuristic that gives greater importance to instances containing more frequent attribute values. The lazy version instead computes instance weights at prediction time based on the number of matching attribute values between a test instance and each training instance, functioning as a similarity measure. By embedding these weights directly into the estimation of class priors and conditional probabilities, AIWNB relaxes the conditional-independence assumption and achieves consistently superior predictive performance over conventional and enhanced Naïve Bayes models, though the lazy variant entails additional computation at classification time. Alizadeh et al. [176] propose the Multi-Independent Latent Component Naïve Bayes (MILC-NB) classifier, which relaxes the conditional-independence assumption of Naïve Bayes while retaining its structural simplicity through component-level independence. An undirected, conditional-mutual-information-weighted feature graph is first constructed and partitioned into non-overlapping clusters. Each cluster is governed by a latent variable that parents its features, forming a component; these components remain mutually independent given the class. Model parameters are estimated via an Expectation–Maximization (EM) procedure that decomposes into independent sub-problems, and latent-state cardinalities are tuned through cross-validated model selection. During inference, each component marginalizes over its latent states when computing class scores, mitigating numerical underflow in high-dimensional spaces. The MILC-NB classifier demonstrates superior predictive accuracy and competitive AUC compared with other Naïve Bayes variants; however, its performance relies on experimentally tuned hyperparameters, and the cross-validation process can be computationally intensive for large-scale data. Because NB struggles with dynamic streams and two-way decisions under uncertainty, Yang et al. [177] proposed a three-way incremental NB (3WD-INB). For continuous features, the method fits ten candidate distributions and selects the one with the minimum Residual Sum of Squares (RSS) to compute class-conditional probabilities. In incremental updates, a confidence factor governs whether new instances are added, and two per-class thresholds implement accept/reject with a boundary domain for ambiguous cases. Across seven discrete and eight continuous datasets under fivefold cross-validation, 3WD-INB reported higher F1 and Precision results in most comparisons. Because conventional Naïve Bayes models optimize overall accuracy and treat all features as equally important, their predictions often become biased toward the majority class under imbalance. To overcome this limitation, Kim and Lee [178] proposed the RankOptAUC Naïve Bayes (RNB). Their approach formulates learning as a nonlinear optimization problem, with the objective of maximizing a sigmoid-smoothed approximation of the AUC (area under the ROC curve). The optimizer learns non-negative feature weights and, with a weight-decay hyperparameter, can shrink uninformative weights toward zero; it also oversamples the positive class in the transformed probability-ratio space (using duplication, random oversampling, or SMOTE) to equalize class sizes for the RankOptAUC objective. Across a broad range of imbalanced datasets, RNB on average achieved superior AUC performance compared to NB variants. The current extensions focus on two recurring classical weaknesses: dependence among predictors and class-imbalance bias. Dependence is mitigated through feature selection, attribute/instance weighting, or latent-component grouping, improving accuracy but adding trade-offs such as extra computation at prediction time or computationally intensive tuning/model selection. Other work extends NB for incremental and uncertainty-aware decisions by updating parameters over streams and introducing accept/reject thresholds, at the cost of added thresholding and distribution-selection complexity. Imbalance is handled by optimizing AUC-oriented objectives with learned feature weights and resampling, improving minority-sensitive performance but increasing optimization and hyperparameter dependence. Limitations that persist include NB’s typically linear decision surfaces and the fact that stronger dependence-handling often reduces the simplicity that motivates classical NB. A compact summary of the Naïve Bayes papers is given in Table 13.
Table 13. Recent research on Naïve Bayes (NB).
Table 13. Recent research on Naïve Bayes (NB).
StudyYearTargeted IssueKey Outcomes
Chen et al. [174]2020Dependence and redundancy among featuresReduces zero–one loss and RMSE more often than it worsens them relative to classical NB (wins/draws/losses 33/14/18 and 36/9/20; p ≈ 0.05), and its average runtime (0.123 s) lies between the fastest and slowest methods.
Zhang et al. [175]2021Unequal informativeness of attributes and training instancesLift mean accuracy to 84.94% and 85.52% (vs. 83.86–84.81% for competing NB variants), never lose in win–tie–loss counts (eager 7–13 wins, 23–29 ties; lazy 7–15 wins, 21–29 ties), and show clear Wilcoxon dominance with R+ far exceeding R (eager 488.0–555.5 vs. 110.5–142.0; lazy 462.0–591.5 vs. 74.5–168.0).
Alizadeh et al. [176]2021Correlated high-dimensional featuresAverage AUC 0.92 and accuracy 0.89, while alternatives stay at or below 0.91 AUC and 0.87 accuracy. It records 98/62 wins–losses in AUC and 124/30 in accuracy, achieves the best Friedman ranks (3.7941, 2.7647; p = 3.08 × 10−6, 0.009), and post hoc tests confirm statistically significant gains.
Yang et al. [177]2023Uncertain and ambiguous predictionsAttains average F1 of 0.9501/0.9081 and precision of 0.9648/0.9289; versus standard Naïve Bayes on discrete data F1 improves from 0.6364 to 0.9167 and precision from 0.7778 to 1.0000, and versus Gaussian NB on continuous data F1 rises from 0.8036 to 0.9967 and precision from 0.5285 to 0.8850.
Kim & Lee [178]2023class imbalanceAchieving average AUC gains over NB of +6.76%, +6.71%, and +6.56%, with the best AUC on eleven of the thirty datasets. Wilcoxon tests report p < 0.05 versus baselines.

4.6. Recent Research on Random Forest (RF)

To enhance robustness and generalization, Gajowniczek et al. [179] proposed a weighted Random Forest in which each tree is assigned a performance-based weight. To do this, they devised a weighted AUC metric that incorporated samples from both inside and outside the bag, placing greater importance on observations that were not correctly categorized, and that, to balance generalization error and stability, includes a tunable parameter. The ranked trees are subsequently assigned varying weights based on an exponential parameter, which determines how quickly the effect between trees decays. The method performance was evaluated for alarm classification in life-threatening arrhythmias, though the limited and imbalanced dataset may constrain generalizability. Bi et al. [180] introduced the Clustering Evolutionary Random Forest (CERF), intended for small-sample, high-dimensional scenarios to reduce tree redundancy and insufficient diversity, which frequently lead to overfitting. The primary concept of CERF is to iteratively enhance the ensemble through similarity-based hierarchical clustering of decision trees, retaining the tree from each cluster that demonstrates optimal accuracy on the validation set. The method was evaluated on seventy-two patients from the Alzheimer’s Disease Neuroimaging Initiative (ADNI), each possessing thousands of multimodal fusion features. It got about 90% of the classifications right, but how well it did depend on the clustering-evolution parameters, such as the number of initial trees, the step size, and the number of evolution rounds. Wan et al. [181] discussed the ineffectiveness of random forests with large datasets due to the persistent voting behavior of trees. First, their improved random forest (IRF) gets rid of trees that are not very accurate. Then, it looks for structural redundancy by looking for paths that are similar. Paths with the same roots and leaf classes are compared at the node level. Nodes are thought to be similar if they split on the same feature within a certain range of tolerance. To get the overall tree similarity, a weighted average of the normalized similarity scores is used to find the best-matching paths. The model was only evaluated on data for finding faults, but it was very accurate and made guesses faster. Nonetheless, threshold selection continues to be heuristic, and there is no established method for feature weighting in high-dimensional data. Jalal et al. [182] enhanced the capabilities of Random Forest in managing high-dimensional and imbalanced text data by introducing the Improved Random Forest for Text Classification (IRFTC). The model repeatedly removes features that are not as useful, while also adjusting the number of trees as needed. To find out how important a feature is, you add up the split quality of each feature across trees, using out-of-bag error as a weight. New trees are only added when the update condition shows an increase in accuracy, which is based on the strength of the forest and the correlation between trees. The Bag of Unimportant Features is made up of features that are not in the top-ranked set. Low-scoring features are slowly removed from this bag based on a statistical cutoff. When evaluated on several binary and multiclass text datasets, IRFTC did better than standard Random Forest, SVM, Naïve Bayes, and Logistic Regression, even though its pruning is based on a simple heuristic threshold. Tian et al. [183] presented the Graph Random Forest (GRF), engineered for the classification of gene and miRNA expression. This model preserves predictive accuracy while enabling feature selections to create connected subgraphs instead of isolated features, thereby improving interpretability concerning biological networks. In the first stage, GRF fits a forest of shallow depth-1 trees and records how often each feature appears as the root; these counts determine the head-splitting nodes and the number of local trees generated in the subsequent stage. Based on a tunable neighborhood size, each head feature is expanded with its neighboring features from the biological graph, encouraging selections that align with the domain’s connectivity structure. Their classifier, tested only on binary datasets, achieved accuracy comparable to the classical Random Forest, yet its added value lies in producing more connected and thus more interpretable feature selections. The recent research efforts in this model target redundancy and overfitting, training/inference efficiency, and robustness in high-dimensional or imbalanced data, with some effort to improve interpretability. These gains are obtained by reweighting or selecting trees, pruning redundant structure, and feature filtering/adaptive growth, but the reported trade-offs are reliance on heuristic thresholds, added tunable parameters, and limited evidence beyond the tested settings. Limitations that persist include ensemble-level opacity, cost that scales with forest size, and class-imbalance bias unless explicitly corrected. Table 14 provides a consolidated per-study summary of the included Random Forest (RF) variant.
Table 14. Recent research on Random Forest (RF).
Table 14. Recent research on Random Forest (RF).
StudyYearTargeted IssueKey Outcomes
Gajowniczek et al. [179]2020Overfitting and limited generalization in tree ensemblesYields large, significant score/AUC gains on two tasks (e.g., 61.7 → 80.7 with 0.93 → 0.985 and 30.6 → 86.1 with ≈0.97 → >0.995), but only slight, non-significant score changes (~31.5 → 31.9) and near-baseline AUC (≈0.87–0.99 with changes within about ±0.01) on the remaining tasks.
Bi et al. [180]2020Overfitting in small-sample, high-dimensional dataOn AD vs. HC, with 340 trees and 7 evolutions reaches ~90% accuracy, and an RF using the top 290 CERF-selected pairs reaches 91.3%; Outperforms unimodal and t-test baselines and generalizes with stable accuracy to EMCI (37/36) and PPMI-PD (55/49) cohorts.
Wan et al. [181]2021Redundancy and inefficiency in large random forestsachieves 97.09–98.12% accuracy (≈+0.7–0.8 pp over RF baselines), cuts diagnosis time to 3.1 min (~7× faster), and reduces training and sub-forest optimization times by 51–73% and 49–72% when using four workers; it also keeps >91% accuracy at 20% noise and ~80% at 100% noise, and under imbalance remains slightly better.
Jalal et al. [182]2022Redundant features in high-dimensional, imbalanced text dataOn binary SMS and Hate & Offensive data, IRFTC attains accuracies of 0.922 and 0.940 (≈+2.1 and +5.9 pp over RF), and on multiclass US Airline and Hate Speech it reaches 0.957 and 0.925 (≈+1.4 and +2.3 pp); it also shows the lowest accuracy standard deviations in 10-fold CV.
Tian et al. [183]2023Interpretability of feature subsetsMatches RF accuracy on NSCLC (0.9457 vs. 0.9483) and hESC (0.9280 vs. 0.9301), while its top-100 features are more connected (NSCLC 20.65 vs. 94.9 components, largest 73.75 vs. 3.7; hESC 31.15 vs. 83.85, largest 67 vs. 7.95), and show similar accuracy but much higher feature-selection AUC (>0.9 vs. 0.6–0.75) with more connected predictors.

4.7. Summary Across Performance Perspectives

To enhance practical utility and enable rapid method selection, we present a compact cross-family synthesis that compares the six classical classifiers across the performance perspectives used in Section 3 and Section 4. Table 15 summarizes each family using a consistent “baseline → mitigation trend → remaining gap” synopsis, grounded in the methodological characteristics, foundational variants, and recent advances reviewed.

5. Discussion and Future Research Directions

Grounded in the comparative findings for the six families under the shared methodological perspectives, we outline six focused directions:
  • In Random Forest classifiers, performance gains rely on randomness from bootstrap sampling of instances and random subspace selection of features across individual trees. Although Random Forests can be more computationally demanding than simpler models, they do not uniformly outperform alternative classifiers across tasks and datasets [184]. Future research could allocate region-specific ensemble sizes to parts of the feature space (for example, subsets of instances defined by a clear procedure such as out-of-bag margin diagnostics, leaf-neighborhood error, or clustering near class boundaries [132,185,186], particularly where complex boundaries arise from overlap, imbalance, noise, outliers, or proximity to class margins. In such cases, allocating more trees to difficult regions and fewer to easy ones, under a fixed total-tree budget, may yield a controlled accuracy–compute trade-off [186,187]. We are not aware of any method that, during training within a single forest, explicitly allocates different numbers of trees to distinct regions; existing variants instead adjust capacity globally, reweigh or prune trees, fit local forests, or use dynamic ensemble selection at prediction time. Pursuing this direction may yield higher accuracy under fixed training and inference budgets, with complexity that is controllable and justifiable.
  • Hyperparameters are central determinants of whether a classifier underfits, overfits, or generalizes effectively [188]. However, despite the availability of automated tuning methods, they are still often selected heuristically using domain expertise or tuned empirically [189,190]. A promising direction is to regularize these quantities using signals derived directly from training instances rather than relying on decision-boundary alignment or aggregate accuracy metrics. For instance, one could estimate an instance error ratio for each misclassified example by comparing it with a peer instance in the same class that is correctly classified with high confidence. This quantifies each instance’s relative contribution to training and allows these values to directly guide optimization. While related to instance weighting and prototype-based methods, this mechanism differs by offering a boundary-agnostic, exemplar-driven signal that can be incorporated into the training of traditional classifiers. This direction may reduce the number of optimization iterations and the overall fitting time, but computing per-instance signals introduces additional overhead, so the net benefit would need to be demonstrated empirically [191].
  • A large proportion of recent studies have benchmark novel classification methods against well-established algorithms, typically demonstrating relative performance improvements using standard metrics such as accuracy, F1-score, and AUC. Yet the influence of preprocessing (e.g., normalization, handling missing values, noise reduction) and feature selection techniques is often overlooked in such evaluations [47,192]. Because feature values and dimensionality directly influence how instances align with a classifier’s decision boundary, preprocessing strategies should be systematically included in performance assessments, particularly under noisy or imbalanced conditions where outcomes can be disproportionately affected [193]. A similar concern applies to dimensionality reduction methods (e.g., PCA, LDA, t-SNE), which can substantially reshape the feature space and thereby alter classification outcomes [194]. Future research should therefore prioritize integrated evaluation frameworks that jointly assess classifiers, together with preprocessing, feature selection, and dimensionality reduction pipelines, enabling benchmarking that more faithfully reflects the complexities of real-world data and the robustness of algorithms across diverse scenarios [195].
  • The perceived contribution and generalizability of numerous proposed methodologies can be restricted by the challenges that real-world datasets frequently present, including high dimensionality, overlapping classes, imbalanced instance ratios, and scalability issues [54,196,197]. One major reason is that these methods are often fine-tuned for performance in specific experimental settings, and the benchmark datasets used to test them may not accurately represent the variety and complexity of real-world situations [195]. To fill this gap, future research could look into modular or hybrid models that can still be understood at the component level. Such models could integrate dedicated mechanisms to handle distinct challenges when shaping the decision boundary. For example, one component might regulate the trade-off between accuracy and overfitting in regions with a relatively simple structure, while another could activate synthetic resampling to mitigate imbalance. An additional adaptive parameter could be designed to adjust the boundary flexibly in highly non-linear regions. With such next-generation modular models, classification methods may achieve greater robustness on real-world datasets and extend their applicability to unforeseen or emerging cases.
  • In many classification methods, explicitly representing the learned decision boundary in the feature space is challenging, especially in high-dimensional settings where interpretability is constrained [198]. For example, ensemble approaches such as random forests induce fragmented, nonparametric boundaries, and probabilistic models such as Naïve Bayes yield class-conditional rules that are not easily visualized in high-dimensional spaces. Linear SVMs can yield explicit hyperplanes, but multiclass extensions and kernelized variants often increase computational complexity. Future research could investigate representing decision boundaries via per-class boundary instances. For each class, the decision boundary is explicitly encoded by a small set of class-specific border instances. These boundary instances extend beyond one-vs.-one oppositions and form a comprehensive per-class boundary structure that governs membership decisions [44,199]. From these sets, compact per-class summaries (for example, a handful of representative border examples or simple local rules) can be derived with parallel procedures, lowering training and refresh cost while yielding more interpretable and defensible decision boundaries. When new classes emerge, models could be updated gradually rather than retrained end to end, with precautions to preserve established boundaries. To ensure stability, adaptive mechanisms should minimize redundant updates and manage boundary drift for additional samples within existing classes [200].
  • The distributional characteristics of instances in the feature space, as indicated by their shapes and regional densities, are crucial for classifier selection and may substantially affect the computational effort required during the learning process [201]. A potentially fruitful research direction is to dynamically derive model requirements directly from the inherent characteristics of the data distribution, rather than solely relying on experimental parameter tuning to identify the optimal configuration for each distribution [202]. Standard deviation (which shows how spread out the features are), covariance (which shows how different variables are related), and empirically estimated probability density functions could all be useful signals for dynamically guiding the learning process [203]. In other words, a good way to go is to reduce reliance on search-heavy tuning and start using methods that automatically adjust to the data distribution. This is consistent with recent advancements in meta-learning and automated machine learning (AutoML), and it further develops these techniques to facilitate more direct, distribution-driven adaptation [204].

Author Contributions

Conceptualization, G.B.; methodology, A.H.A. (Ali Hussein Alshammari); formal analysis, A.H.A. (Almashhadani Hasnain Ali); investigation, A.H.A. (Ali Hussein Alshammari); resources, G.B., A.H.A. (Almashhadani Hasnain Ali) and A.H.A. (Ali Hussein Alshammari); writing—original draft preparation, A.H.A. (Ali Hussein Alshammari); writing—review and editing, A.H.A. (Ali Hussein Alshammari) and A.H.A. (Almashhadani Hasnain Ali); supervision, G.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare that they have no potential conflicts of interest.

References

  1. Jordan, M.I.; Mitchell, T.M. Machine learning: Trends, perspectives, and prospects. Science 2015, 349, 255–260. [Google Scholar] [CrossRef] [PubMed]
  2. Bzdok, D.; Krzywinski, M.; Altman, N. Machine learning: Supervised methods. Nat. Methods 2018, 15, 5. [Google Scholar] [CrossRef] [PubMed]
  3. Wu, X.; Kumar, V.; Ross Quinlan, J.; Ghosh, J.; Yang, Q.; Motoda, H.; McLachlan, G.J.; Ng, A.; Liu, B.; Yu, P.S. Top 10 algorithms in data mining. Knowl. Inf. Syst. 2008, 14, 1–37. [Google Scholar] [CrossRef]
  4. Sarker, I.H. Machine learning: Algorithms, real-world applications and research directions. SN Comput. Sci. 2021, 2, 160. [Google Scholar] [CrossRef]
  5. Hollmann, N.; Müller, S.; Purucker, L.; Krishnakumar, A.; Körfer, M.; Hoo, S.B.; Schirrmeister, R.T.; Hutter, F. Accurate predictions on small data with a tabular foundation model. Nature 2025, 637, 319–326. [Google Scholar] [CrossRef]
  6. Smith, A.M.; Walsh, J.R.; Long, J.; Davis, C.B.; Henstock, P.; Hodge, M.R.; Maciejewski, M.; Mu, X.J.; Ra, S.; Zhao, S. Standard machine learning approaches outperform deep representation learning on phenotype prediction from transcriptomics data. BMC Bioinform. 2020, 21, 119. [Google Scholar] [CrossRef]
  7. Rudin, C. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nat. Mach. Intell. 2019, 1, 206–215. [Google Scholar] [CrossRef]
  8. Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
  9. Xie, F.; Zhou, J.; Lee, J.W.; Tan, M.; Li, S.; Rajnthern, L.S.O.; Chee, M.L.; Chakraborty, B.; Wong, A.-K.I.; Dagan, A. Benchmarking emergency department prediction models with machine learning and public electronic health records. Sci. Data 2022, 9, 658. [Google Scholar] [CrossRef]
  10. Dumitrescu, E.; Hué, S.; Hurlin, C.; Tokpavi, S. Machine learning for credit scoring: Improving logistic regression with non-linear decision-tree effects. Eur. J. Oper. Res. 2022, 297, 1178–1192. [Google Scholar] [CrossRef]
  11. Li, Q.; Peng, H.; Li, J.; Xia, C.; Yang, R.; Sun, L.; Yu, P.S.; He, L. A survey on text classification: From traditional to deep learning. ACM Trans. Intell. Syst. Technol. (TIST) 2022, 13, 1–41. [Google Scholar] [CrossRef]
  12. Wang, P.; Fan, E.; Wang, P. Comparative analysis of image classification algorithms based on traditional machine learning and deep learning. Pattern Recognit. Lett. 2021, 141, 61–67. [Google Scholar] [CrossRef]
  13. Zhang, C.; Jia, D.; Wang, L.; Wang, W.; Liu, F.; Yang, A. Comparative research on network intrusion detection methods based on machine learning. Comput. Secur. 2022, 121, 102861. [Google Scholar] [CrossRef]
  14. Adugna, T.; Xu, W.; Fan, J. Comparison of random forest and support vector machine classifiers for regional land cover mapping using coarse resolution FY-3C images. Remote Sens. 2022, 14, 574. [Google Scholar] [CrossRef]
  15. Theissler, A.; Pérez-Velázquez, J.; Kettelgerdes, M.; Elger, G. Predictive maintenance enabled by machine learning: Use cases and challenges in the automotive industry. Reliab. Eng. Syst. Saf. 2021, 215, 107864. [Google Scholar] [CrossRef]
  16. Wolpert, D.H. The lack of a priori distinctions between learning algorithms. Neural Comput. 1996, 8, 1341–1390. [Google Scholar] [CrossRef]
  17. Fernández-Delgado, M.; Cernadas, E.; Barro, S.; Amorim, D. Do we need hundreds of classifiers to solve real world classification problems? J. Mach. Learn. Res. 2014, 15, 3133–3181. [Google Scholar]
  18. Cervantes, J.; Garcia-Lamont, F.; Rodríguez-Mazahua, L.; Lopez, A. A comprehensive survey on support vector machine classification: Applications, challenges and trends. Neurocomputing 2020, 408, 189–215. [Google Scholar] [CrossRef]
  19. Biau, G.; Scornet, E. A random forest guided tour. Test 2016, 25, 197–227. [Google Scholar] [CrossRef]
  20. Mienye, I.D.; Jere, N. A survey of decision trees: Concepts, algorithms, and applications. IEEE Access 2024, 12, 86716–86727. [Google Scholar] [CrossRef]
  21. de Hond, A.A.; Leeuwenberg, A.M.; Hooft, L.; Kant, I.M.; Nijman, S.W.; van Os, H.J.; Aardoom, J.J.; Debray, T.P.; Schuit, E.; van Smeden, M. Guidelines and quality criteria for artificial intelligence-based prediction models in healthcare: A scoping review. NPJ Digit. Med. 2022, 5, 2. [Google Scholar] [CrossRef] [PubMed]
  22. Woodman, R.J.; Mangoni, A.A. A comprehensive review of machine learning algorithms and their application in geriatric medicine: Present and future. Aging Clin. Exp. Res. 2023, 35, 2363–2397. [Google Scholar] [CrossRef] [PubMed]
  23. Menghani, G. Efficient deep learning: A survey on making deep learning models smaller, faster, and better. ACM Comput. Surv. 2023, 55, 1–37. [Google Scholar] [CrossRef]
  24. Boser, B.E.; Guyon, I.M.; Vapnik, V.N. A training algorithm for optimal margin classifiers. In Proceedings of the Fifth Annual Workshop on Computational Learning Theory, Pittsburgh, PA, USA, 27–29 July 1992; pp. 144–152. [Google Scholar]
  25. Cortes, C.; Vapnik, V. Support-vector networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
  26. Wang, J.; Zhao, Z.; Zhu, J.; Li, X.; Dong, F.; Wan, S. Improved support vector machine for voiceprint diagnosis of typical faults in power transformers. Machines 2023, 11, 539. [Google Scholar] [CrossRef]
  27. Kalita, D.J.; Singh, S. SVM Hyper-parameters optimization using quantized multi-PSO in dynamic environment. Soft Comput. Fusion Found. Methodol. Appl. 2020, 24, 1225. [Google Scholar] [CrossRef]
  28. Chang, C.-C.; Lin, C.-J. LIBSVM: A library for support vector machines. ACM Trans. Intell. Syst. Technol. (TIST) 2011, 2, 1–27. [Google Scholar] [CrossRef]
  29. Bergstra, J.; Bengio, Y. Random search for hyper-parameter optimization. J. Mach. Learn. Res. 2012, 13, 281–305. [Google Scholar]
  30. Luo, Y.; Tseng, H.-H.; Cui, S.; Wei, L.; Ten Haken, R.K.; El Naqa, I. Balancing accuracy and interpretability of machine learning approaches for radiation treatment outcomes modeling. BJR Open 2019, 1, 20190021. [Google Scholar] [CrossRef]
  31. Sayeed, M.A.; Rahman, A.; Rahman, A.; Rois, R. On the interpretability of the SVM model for predicting infant mortality in Bangladesh. J. Health Popul. Nutr. 2024, 43, 170. [Google Scholar] [CrossRef]
  32. Rezvani, S.; Pourpanah, F.; Lim, C.P.; Wu, Q. Methods for class-imbalanced learning with support vector machines: A review and an empirical evaluation. arXiv 2024, arXiv:2406.03398. [Google Scholar] [CrossRef]
  33. Iranmehr, A.; Masnadi-Shirazi, H.; Vasconcelos, N. Cost-sensitive support vector machines. Neurocomputing 2019, 343, 50–64. [Google Scholar] [CrossRef]
  34. Steinwart, I. Sparseness of support vector machines. J. Mach. Learn. Res. 2003, 4, 1071–1105. [Google Scholar]
  35. Chapelle, O.; Vapnik, V.; Bousquet, O.; Mukherjee, S. Choosing multiple parameters for support vector machines. Mach. Learn. 2002, 46, 131–159. [Google Scholar] [CrossRef]
  36. Hofmann, T.; Schölkopf, B.; Smola, A.J. Kernel methods in machine learning. Ann. Statist. 2008, 36, 1171–1220. [Google Scholar] [CrossRef]
  37. Azzeh, M.; Elsheikh, Y.; Nassif, A.B.; Angelis, L. Examining the performance of kernel methods for software defect prediction based on support vector machine. Sci. Comput. Program. 2023, 226, 102916. [Google Scholar] [CrossRef]
  38. Piccialli, V.; Sciandrone, M. Nonlinear optimization and support vector machines. Ann. Oper. Res. 2022, 314, 15–47. [Google Scholar] [CrossRef]
  39. Schölkopf, B.; Smola, A.J.; Williamson, R.C.; Bartlett, P.L. New support vector algorithms. Neural Comput. 2000, 12, 1207–1245. [Google Scholar] [CrossRef]
  40. Suykens, J.A.; Vandewalle, J. Least squares support vector machine classifiers. Neural Process. Lett. 1999, 9, 293–300. [Google Scholar] [CrossRef]
  41. Khemchandani, R.; Chandra, S. Twin support vector machines for pattern classification. IEEE Trans. Pattern Anal. Mach. Intell. 2007, 29, 905–910. [Google Scholar] [CrossRef]
  42. Zhu, J.; Rosset, S.; Tibshirani, R.; Hastie, T. 1-norm support vector machines. Adv. Neural Inf. Process. Syst. 2003, 16. [Google Scholar]
  43. Veropoulos, K.; Campbell, C.; Cristianini, N. Controlling the sensitivity of support vector machines. In Proceedings of the International Joint Conference on AI, Stockholm, Sweden, 31 July–6 August 1999; p. 60. [Google Scholar]
  44. Crammer, K.; Singer, Y. On the algorithmic implementation of multiclass kernel-based vector machines. J. Mach. Learn. Res. 2001, 2, 265–292. [Google Scholar]
  45. Cover, T.; Hart, P. Nearest neighbor pattern classification. IEEE Trans. Inf. Theory 1967, 13, 21–27. [Google Scholar] [CrossRef]
  46. Aha, D.W.; Kibler, D.; Albert, M.K. Instance-based learning algorithms. Mach. Learn. 1991, 6, 37–66. [Google Scholar] [CrossRef]
  47. Singh, D.; Singh, B. Investigating the impact of data normalization on classification performance. Appl. Soft Comput. 2020, 97, 105524. [Google Scholar] [CrossRef]
  48. Blanco-Mallo, E.; Morán-Fernández, L.; Remeseiro, B.; Bolón-Canedo, V. Do all roads lead to Rome? Studying distance measures in the context of machine learning. Pattern Recognit. 2023, 141, 109646. [Google Scholar] [CrossRef]
  49. Jia, H.; Cheung, Y.-m.; Liu, J. A new distance metric for unsupervised learning of categorical data. IEEE Trans. Neural Netw. Learn. Syst. 2015, 27, 1065–1079. [Google Scholar] [CrossRef]
  50. Hall, P.; Park, B.U.; Samworth, R.J. Choice of neighbor order in nearest-neighbor classification. Ann. Statist. 2008, 36, 2135–2152. [Google Scholar] [CrossRef]
  51. Zhang, S.; Li, X.; Zong, M.; Zhu, X.; Cheng, D. Learning k for knn classification. ACM Trans. Intell. Syst. Technol. (TIST) 2017, 8, 1–19. [Google Scholar] [CrossRef]
  52. Gadat, S.; Klein, T.; Marteau, C. Classification in general finite dimensional spaces with the k-nearest neighbor rule. Ann. Statist. 2016, 44, 982–1009. [Google Scholar] [CrossRef]
  53. François, D.; Wertz, V.; Verleysen, M. The concentration of fractional distances. IEEE Trans. Knowl. Data Eng. 2007, 19, 873–886. [Google Scholar] [CrossRef]
  54. Radovanovic, M.; Nanopoulos, A.; Ivanovic, M. Hubs in space: Popular nearest neighbors in high-dimensional data. J. Mach. Learn. Res. 2010, 11, 2487–2531. [Google Scholar]
  55. Györfi, L.; Weiss, R. Universal consistency and rates of convergence of multiclass prototype algorithms in metric spaces. J. Mach. Learn. Res. 2021, 22, 1–25. [Google Scholar]
  56. Döring, M.; Györfi, L.; Walk, H. Rate of Convergence of $ k $-Nearest-Neighbor Classification Rule. J. Mach. Learn. Res. 2018, 18, 1–16. [Google Scholar]
  57. Lu, S.-C.; Swisher, C.L.; Chung, C.; Jaffray, D.; Sidey-Gibbons, C. On the importance of interpretable machine learning predictions to inform clinical decision making in oncology. Front. Oncol. 2023, 13, 1129380. [Google Scholar] [CrossRef]
  58. Chen, G.H.; Shah, D. Explaining the success of nearest neighbor methods in prediction. Found. Trends® Mach. Learn. 2018, 10, 337–588. [Google Scholar] [CrossRef]
  59. Mullick, S.S.; Datta, S.; Das, S. Adaptive learning-based $ k $-nearest neighbor classifiers with resilience to class imbalance. IEEE Trans. Neural Netw. Learn. Syst. 2018, 29, 5713–5725. [Google Scholar] [CrossRef]
  60. Zhang, X.; Li, Y.; Kotagiri, R.; Wu, L.; Tari, Z.; Cheriet, M. KRNN: K rare-class nearest neighbour classification. Pattern Recognit. 2017, 62, 33–44. [Google Scholar] [CrossRef]
  61. Muja, M.; Lowe, D.G. Scalable nearest neighbor algorithms for high dimensional data. IEEE Trans. Pattern Anal. Mach. Intell. 2014, 36, 2227–2240. [Google Scholar] [CrossRef]
  62. Wilson, D.L. Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans. Syst. Man Cybern. 2007, 3, 408–421. [Google Scholar] [CrossRef]
  63. Dudani, S.A. The distance-weighted k-nearest-neighbor rule. IEEE Trans. Syst. Man Cybern. 1976, 4, 325–327. [Google Scholar] [CrossRef]
  64. Keller, J.M.; Gray, M.R.; Givens, J.A. A fuzzy k-nearest neighbor algorithm. IEEE Trans. Syst. Man Cybern. 1985, 4, 580–585. [Google Scholar] [CrossRef]
  65. Sun, S.; Huang, R. An adaptive k-nearest neighbor algorithm. In Proceedings of the 2010 Seventh International Conference on Fuzzy Systems And Knowledge Discovery, Yantai, China, 10–12 August 2010; IEEE: New York City, NY, USA, 2010; pp. 91–94. [Google Scholar]
  66. Weinberger, K.Q.; Saul, L.K. Distance metric learning for large margin nearest neighbor classification. J. Mach. Learn. Res. 2009, 10, 207–244. [Google Scholar]
  67. Hart, P. The condensed nearest neighbor rule (corresp.). IEEE Trans. Inf. Theory 1968, 14, 515–516. [Google Scholar] [CrossRef]
  68. Quinlan, J.R. C4.5: Programs for Machine Learning; Elsevier: Amsterdam, The Netherlands, 2014. [Google Scholar]
  69. Breiman, L.; Friedman, J.; Olshen, R.A.; Stone, C.J. Classification and Regression Trees; Chapman and Hall/CRC: Boca Raton, FL, USA, 2017. [Google Scholar]
  70. Quinlan, J.R. Induction of decision trees. Mach. Learn. 1986, 1, 81–106. [Google Scholar] [CrossRef]
  71. Costa, V.G.; Pedreira, C.E. Recent advances in decision trees: An updated survey. Artif. Intell. Rev. 2023, 56, 4765–4800. [Google Scholar] [CrossRef]
  72. Zhang, G.; Gionis, A. Regularized impurity reduction: Accurate decision trees with complexity guarantees. Data Min. Knowl. Discov. 2023, 37, 434–475. [Google Scholar] [CrossRef]
  73. Safavian, S.R.; Landgrebe, D. A survey of decision tree classifier methodology. IEEE Trans. Syst. Man Cybern. 1991, 21, 660–674. [Google Scholar] [CrossRef]
  74. Klusowski, J.M.; Tian, P.M. Large scale prediction with decision trees. J. Am. Stat. Assoc. 2024, 119, 525–537. [Google Scholar] [CrossRef]
  75. Lazebnik, T.; Bunimovich-Mendrazitsky, S. Decision tree post-pruning without loss of accuracy using the SAT-PP algorithm with an empirical evaluation on clinical data. Data Knowl. Eng. 2023, 145, 102173. [Google Scholar] [CrossRef]
  76. Liu, W.; Tsang, I.W. Making decision trees feasible in ultrahigh feature and label dimensions. J. Mach. Learn. Res. 2017, 18, 1–36. [Google Scholar]
  77. Loh, W.Y. Fifty years of classification and regression trees. Int. Stat. Rev. 2014, 82, 329–348. [Google Scholar] [CrossRef]
  78. Esposito, F.; Malerba, D.; Semeraro, G.; Kay, J. A comparative analysis of methods for pruning decision trees. IEEE Trans. Pattern Anal. Mach. Intell. 1997, 19, 476–491. [Google Scholar] [CrossRef]
  79. Cieslak, D.A.; Hoens, T.R.; Chawla, N.V.; Kegelmeyer, W.P. Hellinger distance decision trees are robust and skew-insensitive. Data Min. Knowl. Discov. 2012, 24, 136–158. [Google Scholar] [CrossRef]
  80. Zhu, Y.; Li, C.; Dunson, D.B. Classification trees for imbalanced data: Surface-to-volume regularization. J. Am. Stat. Assoc. 2023, 118, 1707–1717. [Google Scholar] [CrossRef]
  81. Gajowniczek, K.; Ząbkowski, T. ImbTreeEntropy and ImbTreeAUC: Novel R packages for decision tree learning on the imbalanced datasets. Electronics 2021, 10, 657. [Google Scholar] [CrossRef]
  82. Kass, G.V. An exploratory technique for investigating large quantities of categorical data. J. R. Stat. Soc. Ser. C (Appl. Stat.) 1980, 29, 119–127. [Google Scholar] [CrossRef]
  83. Domingos, P.; Hulten, G. Mining high-speed data streams. In Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Boston, MA, USA, 20–23 August 2000; pp. 71–80. [Google Scholar]
  84. Mehta, M.; Agrawal, R.; Rissanen, J. SLIQ: A fast scalable classifier for data mining. In International Conference on Extending Database Technology; Springer: Berlin/Heidelberg, Germany, 1996; pp. 18–32. [Google Scholar]
  85. Shafer, J.; Agrawal, R.; Mehta, M. SPRINT: A Scalable Parallel Classifier for Data Mining; Vldb: San Jose, CA, USA, 1996; pp. 544–555. [Google Scholar]
  86. Cox, D.R. The regression analysis of binary sequences. J. R. Stat. Soc. Ser. B Stat. Methodol. 1958, 20, 215–232. [Google Scholar] [CrossRef]
  87. Nelder, J.A.; Wedderburn, R.W. Generalized linear models. J. R. Stat. Soc. Ser. A Stat. Soc. 1972, 135, 370–384. [Google Scholar] [CrossRef]
  88. Theil, H. A multinomial extension of the linear logit model. In Henri Theil’s Contributions to Economics and Econometrics: Econometric Theory and Methodology; Springer: Berlin/Heidelberg, Germany, 1992; pp. 181–191. [Google Scholar]
  89. Lin, C.-J.; Weng, R.C.; Keerthi, S.S. Trust region Newton method for large-scale logistic regression. J. Mach. Learn. Res. 2008, 9, 627–650. [Google Scholar]
  90. Riley, R.D.; Snell, K.I.; Ensor, J.; Burke, D.L.; Harrell, F.E., Jr.; Moons, K.G.; Collins, G.S. Minimum sample size for developing a multivariable prediction model: PART II-binary and time-to-event outcomes. Stat. Med. 2019, 38, 1276–1296. [Google Scholar] [CrossRef] [PubMed]
  91. Mansournia, M.A.; Geroldinger, A.; Greenland, S.; Heinze, G. Separation in logistic regression: Causes, consequences, and control. Am. J. Epidemiol. 2018, 187, 864–870. [Google Scholar] [CrossRef] [PubMed]
  92. Ostrovskii, D.M.; Bach, F. Finite-sample analysis of M-estimators using self-concordance. Electron. J. Statist. 2021, 15, 326–391. [Google Scholar] [CrossRef]
  93. Sur, P.; Candès, E.J. A modern maximum-likelihood theory for high-dimensional logistic regression. Proc. Natl. Acad. Sci. USA 2019, 116, 14516–14525. [Google Scholar] [CrossRef]
  94. Vatcheva, K.P.; Lee, M.; McCormick, J.B.; Rahbar, M.H. Multicollinearity in regression analyses conducted in epidemiologic studies. Epidemiology 2016, 6, 227. [Google Scholar] [CrossRef]
  95. Norton, E.C.; Dowd, B.E. Log odds and the interpretation of logit models. Health Serv. Res. 2018, 53, 859–878. [Google Scholar] [CrossRef]
  96. Van Calster, B.; McLernon, D.J.; Van Smeden, M.; Wynants, L.; Steyerberg, E.W. Calibration: The Achilles heel of predictive analytics. BMC Med. 2019, 17, 230. [Google Scholar] [CrossRef]
  97. Moons, K.G.; Altman, D.G.; Reitsma, J.B.; Ioannidis, J.P.; Macaskill, P.; Steyerberg, E.W.; Vickers, A.J.; Ransohoff, D.F.; Collins, G.S. Transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis (TRIPOD): Explanation and elaboration. Ann. Intern. Med. 2015, 162, W1–W73. [Google Scholar] [CrossRef]
  98. Murdoch, W.J.; Singh, C.; Kumbier, K.; Abbasi-Asl, R.; Yu, B. Definitions, methods, and applications in interpretable machine learning. Proc. Natl. Acad. Sci. USA 2019, 116, 22071–22080. [Google Scholar] [CrossRef]
  99. King, G.; Zeng, L. Logistic regression in rare events data. Political Anal. 2001, 9, 137–163. [Google Scholar] [CrossRef]
  100. Esposito, C.; Landrum, G.A.; Schneider, N.; Stiefl, N.; Riniker, S. GHOST: Adjusting the decision threshold to handle imbalanced data in machine learning. J. Chem. Inf. Model. 2021, 61, 2623–2640. [Google Scholar] [CrossRef]
  101. Krawczyk, B. Learning from imbalanced data: Open challenges and future directions. Prog. Artif. Intell. 2016, 5, 221–232. [Google Scholar] [CrossRef]
  102. Bewick, V.; Cheek, L.; Ball, J. Statistics review 14: Logistic regression. Crit. Care 2005, 9, 112. [Google Scholar] [CrossRef] [PubMed]
  103. Cessie, S.L.; Houwelingen, J.V. Ridge estimators in logistic regression. J. R. Stat. Soc. Ser. C Appl. Stat. 1992, 41, 191–201. [Google Scholar] [CrossRef]
  104. Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B Stat. Methodol. 1996, 58, 267–288. [Google Scholar] [CrossRef]
  105. Zou, H.; Hastie, T. Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B Stat. Methodol. 2005, 67, 301–320. [Google Scholar] [CrossRef]
  106. McCullagh, P. Regression models for ordinal data. J. R. Stat. Soc. Ser. B Methodol. 1980, 42, 109–127. [Google Scholar] [CrossRef]
  107. Firth, D. Bias reduction of maximum likelihood estimates. Biometrika 1993, 80, 27–38. [Google Scholar] [CrossRef]
  108. McFadden, D. Conditional Logit Analysis of Qualitative Choice Behavior; University of California, Berkeley: Berkeley, CA, USA, 1972. [Google Scholar]
  109. Hastie, T.; Tibshirani, R. Generalized additive models. Stat. Sci. 1986, 1, 297–310. [Google Scholar] [CrossRef]
  110. Domingos, P.; Pazzani, M. On the optimality of the simple Bayesian classifier under zero-one loss. Mach. Learn. 1997, 29, 103–130. [Google Scholar] [CrossRef]
  111. Bishop, C.M.; Nasrabadi, N.M. Pattern Recognition and Machine Learning; Springer: Berlin/Heidelberg, Germany, 2006; Volume 4. [Google Scholar]
  112. Sebastiani, F. Machine learning in automated text categorization. ACM Comput. Surv. (CSUR) 2002, 34, 1–47. [Google Scholar] [CrossRef]
  113. Xu, S. Bayesian Naïve Bayes classifiers to text classification. J. Inf. Sci. 2018, 44, 48–59. [Google Scholar] [CrossRef]
  114. Chen, S.F.; Goodman, J. An empirical study of smoothing techniques for language modeling. Comput. Speech Lang. 1999, 13, 359–394. [Google Scholar] [CrossRef]
  115. Wang, W.; Duan, Y.; Cao, L.; Jiang, Z. Application of improved Naive Bayes classification algorithm in 5G signaling analysis. J. Supercomput. 2023, 79, 6941. [Google Scholar] [CrossRef]
  116. Raizada, R.D.; Lee, Y.-S. Smoothness without smoothing: Why Gaussian naive Bayes is not naive for multi-subject searchlight studies. PLoS ONE 2013, 8, e69566. [Google Scholar] [CrossRef]
  117. Pajila, P.B.; Sheena, B.G.; Gayathri, A.; Aswini, J.; Nalini, M. A comprehensive survey on naive bayes algorithm: Advantages, limitations and applications. In Proceedings of the 2023 4th International Conference on Smart Electronics and Communication (ICOSEC), Trichy, India, 20–22 September 2023; IEEE: New York City, NY, USA, 2023; pp. 1228–1234. [Google Scholar]
  118. Fang, X. Inference-based naive bayes: Turning naive bayes cost-sensitive. IEEE Trans. Knowl. Data Eng. 2012, 25, 2302–2313. [Google Scholar] [CrossRef]
  119. Rish, I. An empirical study of the naive Bayes classifier. In Proceedings of the IJCAI 2001 Workshop on Empirical Methods in Artificial Intelligence, Seattle, DC, USA, 4–6 August 2001; pp. 41–46. [Google Scholar]
  120. Abellán, J.; Castellano, J.G. Improving the Naive Bayes classifier via a quick variable selection method using maximum of entropy. Entropy 2017, 19, 247. [Google Scholar] [CrossRef]
  121. Berend, D.; Kontorovich, A. A finite sample analysis of the Naive Bayes classifier. J. Mach. Learn. Res. 2015, 16, 1519–1545. [Google Scholar]
  122. Rahnama, A.H.A.; Bütepage, J.; Geurts, P.; Boström, H. Can local explanation techniques explain linear additive models? Data Min. Knowl. Discov. 2024, 38, 237–280. [Google Scholar] [CrossRef]
  123. Nagahisarchoghaei, M.; Nur, N.; Cummins, L.; Nur, N.; Karimi, M.M.; Nandanwar, S.; Bhattacharyya, S.; Rahimi, S. An empirical survey on explainable ai technologies: Recent trends, use-cases, and categories from technical and application perspectives. Electronics 2023, 12, 1092. [Google Scholar] [CrossRef]
  124. Lu, Y.; Cheung, Y.-M.; Tang, Y.Y. Bayes imbalance impact index: A measure of class imbalanced data set for classification problem. IEEE Trans. Neural Netw. Learn. Syst. 2019, 31, 3525–3539. [Google Scholar] [CrossRef] [PubMed]
  125. Blanquero, R.; Carrizosa, E.; Ramírez-Cobo, P.; Sillero-Denamiel, M.R. Constrained Naïve Bayes with application to unbalanced data classification. Cent. Eur. J. Oper. Res. 2022, 30, 1403–1425. [Google Scholar] [CrossRef]
  126. Treder, M.S. MVPA-light: A classification and regression toolbox for multi-dimensional data. Front. Neurosci. 2020, 14, 289. [Google Scholar] [CrossRef] [PubMed]
  127. Friedman, N.; Geiger, D.; Goldszmidt, M. Bayesian network classifiers. Mach. Learn. 1997, 29, 131–163. [Google Scholar] [CrossRef]
  128. Webb, G.I.; Boughton, J.R.; Wang, Z. Not so naive Bayes: Aggregating one-dependence estimators. Mach. Learn. 2005, 58, 5–24. [Google Scholar] [CrossRef]
  129. Rennie, J.D.; Shih, L.; Teevan, J.; Karger, D.R. Tackling the poor assumptions of naive bayes text classifiers. In Proceedings of the 20th International Conference on Machine Learning (ICML-03), Washington, DC, USA, 21–24 August 2003; pp. 616–623. [Google Scholar]
  130. John, G.H.; Langley, P. Estimating continuous distributions in Bayesian classifiers. arXiv 2013, arXiv:1302.4964. [Google Scholar] [CrossRef]
  131. Langley, P.; Sage, S. Induction of selective Bayesian classifiers. In Uncertainty in Artificial Intelligence; Elsevier: Amsterdam, The Netherlands, 1994; pp. 399–406. [Google Scholar]
  132. Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
  133. Chi, C.-M.; Vossler, P.; Fan, Y.; Lv, J. Asymptotic properties of high-dimensional random forests. Ann. Stat. 2022, 50, 3415–3438. [Google Scholar] [CrossRef]
  134. Mentch, L.; Zhou, S. Randomization as regularization: A degrees of freedom explanation for random forest success. J. Mach. Learn. Res. 2020, 21, 1–36. [Google Scholar]
  135. Wright, M.N.; Ziegler, A. ranger: A fast implementation of random forests for high dimensional data in C++ and R. J. Stat. Softw. 2017, 77, 1–17. [Google Scholar] [CrossRef]
  136. Lin, Y.; Jeon, Y. Random forests and adaptive nearest neighbors. J. Am. Stat. Assoc. 2006, 101, 578–590. [Google Scholar] [CrossRef]
  137. Basu, S.; Kumbier, K.; Brown, J.B.; Yu, B. Iterative random forests to discover predictive and stable high-order interactions. Proc. Natl. Acad. Sci. USA 2018, 115, 1943–1948. [Google Scholar] [CrossRef] [PubMed]
  138. Deng, H. Interpreting tree ensembles with intrees. Int. J. Data Sci. Anal. 2019, 7, 277–287. [Google Scholar] [CrossRef]
  139. Athey, S.; Tibshirani, J.; Wager, S. Generalized random forests. Ann. Statist. 2019, 47, 1148–1178. [Google Scholar] [CrossRef]
  140. O’Brien, R.; Ishwaran, H. A random forests quantile classifier for class imbalanced data. Pattern Recognit. 2019, 90, 232–249. [Google Scholar] [CrossRef]
  141. He, J.; Cheng, M.X. Weighting methods for rare event identification from imbalanced datasets. Front. Big Data 2021, 4, 715320. [Google Scholar] [CrossRef]
  142. Probst, P.; Wright, M.N.; Boulesteix, A.L. Hyperparameters and tuning strategies for random forest. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2019, 9, e1301. [Google Scholar] [CrossRef]
  143. Geurts, P.; Ernst, D.; Wehenkel, L. Extremely randomized trees. Mach. Learn. 2006, 63, 3–42. [Google Scholar] [CrossRef]
  144. Rodriguez, J.J.; Kuncheva, L.I.; Alonso, C.J. Rotation forest: A new classifier ensemble method. IEEE Trans. Pattern Anal. Mach. Intell. 2006, 28, 1619–1630. [Google Scholar] [CrossRef]
  145. Hothorn, T.; Hornik, K.; Zeileis, A. Unbiased recursive partitioning: A conditional inference framework. J. Comput. Graph. Stat. 2006, 15, 651–674. [Google Scholar] [CrossRef]
  146. Menze, B.H.; Kelm, B.M.; Splitthoff, D.N.; Koethe, U.; Hamprecht, F.A. On oblique random forests. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases; Springer: Berlin/Heidelberg, Germany, 2011; pp. 453–469. [Google Scholar]
  147. Lin, C.-F.; Wang, S.-D. Fuzzy support vector machines. IEEE Trans. Neural Netw. 2002, 13, 464–471. [Google Scholar] [PubMed]
  148. Tao, X.; Li, Q.; Ren, C.; Guo, W.; He, Q.; Liu, R.; Zou, J. Affinity and class probability-based fuzzy support vector machine for imbalanced data sets. Neural Netw. 2020, 122, 289–307. [Google Scholar] [CrossRef] [PubMed]
  149. Tax, D.M.; Duin, R.P. Support vector data description. Mach. Learn. 2004, 54, 45–66. [Google Scholar] [CrossRef]
  150. Yu, K.; Ji, L.; Zhang, X. Kernel nearest-neighbor algorithm. Neural Process. Lett. 2002, 15, 147–156. [Google Scholar] [CrossRef]
  151. Gao, Z.; Fang, S.-C.; Luo, J.; Medhin, N. A kernel-free double well potential support vector machine with applications. Eur. J. Oper. Res. 2021, 290, 248–262. [Google Scholar] [CrossRef]
  152. Wang, H.; Shao, Y.; Zhou, S.; Zhang, C.; Xiu, N. Support Vector Machine Classifier via $ L_ {0/1} $ L 0/1 Soft-Margin Loss. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 7253–7265. [Google Scholar] [CrossRef]
  153. Francis, L.M.; Sreenath, N. Robust scene text recognition: Using manifold regularized twin-support vector machine. J. King Saud Univ. Comput. Inf. Sci. 2022, 34, 589–604. [Google Scholar] [CrossRef]
  154. Pimentel, J.S.; Ospina, R.; Ara, A. A novel fusion Support Vector Machine integrating weak and sphere models for classification challenges with massive data. Decis. Anal. J. 2024, 11, 100457. [Google Scholar] [CrossRef]
  155. Strack, R.; Kecman, V.; Strack, B.; Li, Q. Sphere support vector machines for large classification tasks. Neurocomputing 2013, 101, 59–67. [Google Scholar] [CrossRef]
  156. Wang, S.; Li, Z.; Liu, C.; Zhang, X.; Zhang, H. Training data reduction to speed up SVM training. Appl. Intell. 2014, 41, 405–420. [Google Scholar] [CrossRef]
  157. Sowmya, T.; Anita, E.M. A Novel SHiP Vector Machine for Network Intrusion Detection. IEEE Access 2025, 13, 117445–117463. [Google Scholar] [CrossRef]
  158. Maillo, J.; García, S.; Luengo, J.; Herrera, F.; Triguero, I. Fast and scalable approaches to accelerate the fuzzy k-nearest neighbors classifier for big data. IEEE Trans. Fuzzy Syst. 2019, 28, 874–886. [Google Scholar] [CrossRef]
  159. Gou, J.; Sun, L.; Du, L.; Ma, H.; Xiong, T.; Ou, W.; Zhan, Y. A representation coefficient-based k-nearest centroid neighbor classifier. Expert Syst. Appl. 2022, 194, 116529. [Google Scholar] [CrossRef]
  160. Ma, C.; Chi, Y. KNN normalized optimization and platform tuning based on hadoop. IEEE Access 2022, 10, 81406–81433. [Google Scholar] [CrossRef]
  161. Liu, G.; Zhao, H.; Fan, F.; Liu, G.; Xu, Q.; Nazir, S. An enhanced intrusion detection model based on improved kNN in WSNs. Sensors 2022, 22, 1407. [Google Scholar] [CrossRef]
  162. Ozturk Kiyak, E.; Ghasemkhani, B.; Birant, D. High-Level K-Nearest Neighbors (HLKNN): A supervised machine learning model for classification analysis. Electronics 2023, 12, 3828. [Google Scholar] [CrossRef]
  163. Lin, K.Y.C. Optimizing variable selection and neighbourhood size in the K-nearest neighbour algorithm. Comput. Ind. Eng. 2024, 191, 110142. [Google Scholar] [CrossRef]
  164. Cai, Y.; Zhang, H.; He, Q.; Duan, J. A novel framework of fuzzy oblique decision tree construction for pattern classification. Appl. Intell. 2020, 50, 2959–2975. [Google Scholar] [CrossRef]
  165. Wang, F.; Wang, Q.; Nie, F.; Li, Z.; Yu, W.; Ren, F. A linear multivariate binary decision tree classifier based on K-means splitting. Pattern Recognit. 2020, 107, 107521. [Google Scholar] [CrossRef]
  166. Dhebar, Y.; Deb, K. Interpretable rule discovery through bilevel optimization of split-rules of nonlinear decision trees for classification problems. IEEE Trans. Cybern. 2020, 51, 5573–5584. [Google Scholar] [CrossRef]
  167. Loyola-González, O.; Ramírez-Sáyago, E.; Medina-Pérez, M.A. Towards improving decision tree induction by combining split evaluation measures. Knowl. Based Syst. 2023, 277, 110832. [Google Scholar] [CrossRef]
  168. Zhang, S.; Chen, X.; Ran, X.; Li, Z.; Cao, W. Prioritizing causation in decision trees: A framework for interpretable modeling. Eng. Appl. Artif. Intell. 2024, 133, 108224. [Google Scholar] [CrossRef]
  169. Sheng, J.; Wu, S.; Zhang, Q.; Li, Z.; Huang, H. A binary classification study of Alzheimer’s disease based on a novel subclass weighted logistic regression method. IEEE Access 2022, 10, 68846–68856. [Google Scholar] [CrossRef]
  170. Song, Z.; Wang, L.; Xu, X.; Zhao, W. Doubly robust logistic regression for image classification. Appl. Math. Model. 2023, 123, 430–446. [Google Scholar] [CrossRef]
  171. Charizanos, G.; Demirhan, H.; İçen, D. A Monte Carlo fuzzy logistic regression framework against imbalance and separation. Inf. Sci. 2024, 655, 119893. [Google Scholar] [CrossRef]
  172. Khashei, M.; Etemadi, S.; Bakhtiarvand, N. A new discrete learning-based logistic regression classifier for Bankruptcy prediction. Wirel. Pers. Commun. 2024, 134, 1075–1092. [Google Scholar] [CrossRef]
  173. Sun, W. Integrative functional logistic regression model for genome-wide association studies. Comput. Biol. Med. 2025, 187, 109766. [Google Scholar] [CrossRef]
  174. Chen, S.; Webb, G.I.; Liu, L.; Ma, X. A novel selective naïve Bayes algorithm. Knowl. Based Syst. 2020, 192, 105361. [Google Scholar] [CrossRef]
  175. Zhang, H.; Jiang, L.; Yu, L. Attribute and instance weighted naive Bayes. Pattern Recognit. 2021, 111, 107674. [Google Scholar] [CrossRef]
  176. Alizadeh, S.H.; Hediehloo, A.; Harzevili, N.S. Multi independent latent component extension of naive Bayes classifier. Knowl. Based Syst. 2021, 213, 106646. [Google Scholar] [CrossRef]
  177. Yang, Z.; Ren, J.; Zhang, Z.; Sun, Y.; Zhang, C.; Wang, M.; Wang, L. A new three-way incremental naive Bayes classifier. Electronics 2023, 12, 1730. [Google Scholar] [CrossRef]
  178. Kim, T.; Lee, J.-S. Maximizing AUC to learn weighted naive Bayes for imbalanced data classification. Expert Syst. Appl. 2023, 217, 119564. [Google Scholar] [CrossRef]
  179. Gajowniczek, K.; Grzegorczyk, I.; Ząbkowski, T.; Bajaj, C. Weighted random forests to improve arrhythmia classification. Electronics 2020, 9, 99. [Google Scholar] [CrossRef] [PubMed]
  180. Bi, X.-a.; Hu, X.; Wu, H.; Wang, Y. Multimodal data analysis of Alzheimer’s disease based on clustering evolutionary random forest. IEEE J. Biomed. Health Inform. 2020, 24, 2973–2983. [Google Scholar] [CrossRef] [PubMed]
  181. Wan, L.; Gong, K.; Zhang, G.; Yuan, X.; Li, C.; Deng, X. An efficient rolling bearing fault diagnosis method based on spark and improved random forest algorithm. IEEE Access 2021, 9, 37866–37882. [Google Scholar] [CrossRef]
  182. Jalal, N.; Mehmood, A.; Choi, G.S.; Ashraf, I. A novel improved random forest for text classification using feature ranking and optimal number of trees. J. King Saud Univ. Comput. Inf. Sci. 2022, 34, 2733–2742. [Google Scholar] [CrossRef]
  183. Tian, L.; Wu, W.; Yu, T. Graph random forest: A graph embedded algorithm for identifying highly connected important features. Biomolecules 2023, 13, 1153. [Google Scholar] [CrossRef]
  184. Shmuel, A.; Glickman, O.; Lazebnik, T. A comprehensive benchmark of machine and deep learning models on structured data for regression and classification. Neurocomputing 2025, 655, 131337. [Google Scholar] [CrossRef]
  185. Shi, T.; Horvath, S. Unsupervised learning with random forest predictors. J. Comput. Graph. Stat. 2006, 15, 118–138. [Google Scholar] [CrossRef]
  186. Yang, F.; Lu, W.-h.; Luo, L.-k.; Li, T. Margin optimization based pruning for random forest. Neurocomputing 2012, 94, 54–63. [Google Scholar] [CrossRef]
  187. Ragab Hassen, H.; Alabdeen, Y.Z.; Gaber, M.M.; Sharma, M. D2TS: A dual diversity tree selection approach to pruning of random forests. Int. J. Mach. Learn. Cybern. 2023, 14, 467–481. [Google Scholar] [CrossRef]
  188. Morales-Hernández, A.; Van Nieuwenhuyse, I.; Rojas Gonzalez, S. A survey on multi-objective hyperparameter optimization algorithms for machine learning. Artif. Intell. Rev. 2023, 56, 8043–8093. [Google Scholar] [CrossRef]
  189. Gong, J.; Chen, T. Deep configuration performance learning: A systematic survey and taxonomy. ACM Trans. Softw. Eng. Methodol. 2024, 34, 1–62. [Google Scholar] [CrossRef]
  190. Kannengiesser, N.; Hasebrook, N.; Morsbach, F.; Zöller, M.-A.; Franke, J.K.; Lindauer, M.; Hutter, F.; Sunyaev, A. Practitioner Motives to Use Different Hyperparameter Optimization Methods. ACM Trans. Comput. Hum. Interact. 2025, 32, 1–33. [Google Scholar] [CrossRef]
  191. Soviany, P.; Ionescu, R.T.; Rota, P.; Sebe, N. Curriculum learning: A survey. Int. J. Comput. Vis. 2022, 130, 1526–1565. [Google Scholar] [CrossRef]
  192. Xie, J.; Wang, M.; Grant, P.W.; Pedrycz, W. Feature selection with discernibility and independence criteria. IEEE Trans. Knowl. Data Eng. 2024, 36, 6195–6209. [Google Scholar] [CrossRef]
  193. Niño-Adan, I.; Landa-Torres, I.; Portillo, E.; Manjarres, D. Influence of statistical feature normalisation methods on K-Nearest Neighbours and K-Means in the context of industry 4.0. Eng. Appl. Artif. Intell. 2022, 111, 104807. [Google Scholar] [CrossRef]
  194. Jia, W.; Sun, M.; Lian, J.; Hou, S. Feature dimensionality reduction: A review. Complex Intell. Syst. 2022, 8, 2663–2693. [Google Scholar] [CrossRef]
  195. Gijsbers, P.; Bueno, M.L.; Coors, S.; LeDell, E.; Poirier, S.; Thomas, J.; Bischl, B.; Vanschoren, J. Amlb: An automl benchmark. J. Mach. Learn. Res. 2024, 25, 1–65. [Google Scholar]
  196. Santos, M.S.; Abreu, P.H.; Japkowicz, N.; Fernández, A.; Santos, J. A unifying view of class overlap and imbalance: Key concepts, multi-view panorama, and open avenues for research. Inf. Fusion 2023, 89, 228–253. [Google Scholar] [CrossRef]
  197. Verbraeken, J.; Wolting, M.; Katzy, J.; Kloppenburg, J.; Verbelen, T.; Rellermeyer, J.S. A survey on distributed machine learning. ACM Comput. Surv. (CSUR) 2020, 53, 1–33. [Google Scholar] [CrossRef]
  198. Sohns, J.T.; Garth, C.; Leitte, H. Decision boundary visualization for counterfactual reasoning. In Computer Graphics Forum; Wiley Online Library: Hoboken, NJ, USA, 2023; Volume 42, pp. 7–20. [Google Scholar]
  199. Doǧan, Ü.; Glasmachers, T.; Igel, C. A unified view on multi-class support vector classification. J. Mach. Learn. Res. 2016, 17, 1550–1831. [Google Scholar]
  200. Gama, J.; Žliobaitė, I.; Bifet, A.; Pechenizkiy, M.; Bouchachia, A. A survey on concept drift adaptation. ACM Comput. Surv. (CSUR) 2014, 46, 1–37. [Google Scholar] [CrossRef]
  201. Lorena, A.C.; Garcia, L.P.; Lehmann, J.; Souto, M.C.; Ho, T.K. How complex is your classification problem? A survey on measuring classification complexity. ACM Comput. Surv. (CSUR) 2019, 52, 1–34. [Google Scholar] [CrossRef]
  202. Rivolli, A.; Garcia, L.P.; Soares, C.; Vanschoren, J.; de Carvalho, A.C. Meta-features for meta-learning. Knowl. Based Syst. 2022, 240, 108101. [Google Scholar] [CrossRef]
  203. Alcobaça, E.; Siqueira, F.; Rivolli, A.; Garcia, L.P.; Oliva, J.T.; De Carvalho, A.C. MFE: Towards reproducible meta-feature extraction. J. Mach. Learn. Res. 2020, 21, 1–5. [Google Scholar]
  204. Feurer, M.; Eggensperger, K.; Falkner, S.; Lindauer, M.; Hutter, F. Auto-sklearn 2.0: Hands-free automl via meta-learning. J. Mach. Learn. Res. 2022, 23, 1–61. [Google Scholar]
Table 1. This raw interface counts from IEEE Xplore, ScienceDirect, SpringerLink (Springer Nature), and MDPI.
Table 1. This raw interface counts from IEEE Xplore, ScienceDirect, SpringerLink (Springer Nature), and MDPI.
Source/ModelDTRFNBSVMLRKNN
IEEE: Title97167231923479
IEEE: Abstract987155028321944621227
Science
Direct
413312,5581067524912,2463342
MDPI: Title44121101262116
MDPI: Abstract28538689849582243052058
Springer Nature: Title326981051433
Table 2. The resulting per-classifier identification pools.
Table 2. The resulting per-classifier identification pools.
ClassifierIdentification Pool Entering Screening
DT216
RF396
NB80
SVM581
LR279
KNN244
Table 3. Foundational variants of SVM.
Table 3. Foundational variants of SVM.
VariantTargeted LimitMethodologyTrade-Off/Limitation
v-SVM [39] C is less interpretable as a control of errors and sparsity.Reparametrizes soft-margin SVM by replacing C with ν ∈ (0,1], which upper-bounds the fraction of margin errors and lower-bounds the fraction of support vectors.Easier to understand but less flexible than C-SVC because errors and complexity are not controlled separately.
Least Squares Support Vector Machine (LS-SVM) [40]Standard SVM needs QP training.It Uses equality constraints and squared-error loss, reducing training to solving linear equations and yielding dense example weights.Dense solution, so prediction aggregates most training points.
Twin Support Vector Machine (TWSVM) [41] One large QP slows training.Solves two smaller class-specific QPs to learn nonparallel hyperplanes and assigns new points to the class whose hyperplane has the smallest perpendicular distance.Two QPs and several hyperparameters make it more complex.
1-norm Support Vector Machine (1-norm SVM) [42]Interpretability in high-dimensional noise.Replaces the L2 penalty with an L1 penalty that drives many coefficients to zero and uses a solution-path algorithm to trace the piecewise-linear path as features enter or leave the model.May drop small but relevant effects; sparsity is limited by sample size; path can be heavy on very large datasets.
Sensitivity-controlled Support Vector Machine [43]Class imbalance.Uses class-dependent misclassification penalties and class-dependent kernel regularization to bias the margin toward the minority or critical class; hyperparameters are tuned using ROC-based sensitivity–specificity trade-offs.Extra hyperparameters; still sensitive to kernel choice and data distribution.
Crammer–Singer Multiclass Support Vector Machine [44]Multiclass handling.Formulates a single global multiclass objective with one weight vector per class; enforces a margin between true and other class scores and solves it via decomposition over individual training examples.Global, interdependent optimization increases computational complexity.
Table 4. Foundational variants of KNN.
Table 4. Foundational variants of KNN.
VariantTargeted LimitMethodologyTrade-Off/Limitation
Edited Nearest Neighbor (ENN) [62]Sharpen boundaries and reduce error.Iteratively removes samples whose label disagrees with the majority of their k nearest neighbors (typically k = 3).Strongly dependent on the choice of k.
Distance-Weighted k-Nearest Neighbors (DW-KNN) [63]Improve decisions in overlaps and reduce tiesChanges the voting rule by weighting neighbor votes by distance, giving nearer neighbors greater influence while keeping the original training set.Depending on the chosen distance metric and weighting scheme.
Fuzzy k-Nearest Neighbors (FKNN) [64]Managing unequal neighbor influenceReplaces hard votes with distance-based class-membership degrees, so each neighbor can support multiple classes according to its distance and class distribution.Membership definition and parameter tuning are more complex.
Adaptive k-Nearest Neighbors (AdaNN) [65]Improve reliability under varying densities.For each training sample, finds its smallest correct k in the range 1–9; at prediction time, a query adopts the k of its nearest training neighbor, adapting neighborhood size to local data.Needs offline k-search and still uses the full dataset at test time.
Large Margin Nearest Neighbor (LMNN) [66]Use discriminative metric to improve accuracy.Learns a discriminative Mahalanobis metric that pulls same-class neighbors closer and pushes impostors beyond a margin to enlarge interclass separation.Training is costly because it optimizes a large weight matrix.
Condensed Nearest Neighbor (CNN) [67]speed up classificationIteratively traverses the dataset and adds any misclassified samples to the prototype set until no training errors remain.Sensitive to data order; subset may approach the full set in the worst case.
Table 5. Foundational variants of DT.
Table 5. Foundational variants of DT.
VariantTargeted LimitMethodologyTrade-Off/Limitation
Chi-squared Automatic Interaction Detector (CHAID) [82]Getting Shallower, more interpretable trees.Uses chi-square tests and p-values with recursive category merging and multiway splits for nominal predictors, splitting until no statistically significant association remains.Chi-square calculations become difficult in high-dimensional settings.
Very Fast Decision Tree/Hoeffding Tree (VFDT) [83]DT induction on infinite/high-speed data streams.Incrementally updates attribute–class counts in streaming leaves and splits only when the best attribute beats the second-best by the Hoeffding bound; discards weak attributes locally.Local attribute pruning can be less effective in high-dimensional settings and relies on heuristic parameters such as sample thresholds and tie-breaking rules.
Supervised Learning In Quest (SLIQ) [84]Training to large, disk-resident datasets.Global presort of numeric attributes plus a RAM class list and per-attribute lists on disk; breadth-first growth with single scan per level and Gini-based split retention, followed by MDL pruning.Training costs still grow linearly with the number of attributes.
Scalable Parallelizable Induction of Decision Trees (SPRINT) [85]Further improve scalability on large, disk-resident datasets beyond what SLIQ can achieve.Extends SLIQ by removing the centralized class list; each attribute list directly carries class labels and record IDs and uses running histograms to evaluate all candidate splits in a single pass while keeping breadth-first growth, MDL pruning, and the categorical-split strategy.Per-level scans over all attribute lists keep training cost linear in the number of attributes, which can limit scalability in very high-dimensional data.
Table 6. Foundational variants of LR.
Table 6. Foundational variants of LR.
VariantTargeted LimitMethodologyTrade-Off/Limitation
Ridge Logistic Regression (L2-penalized LR) [103]Unstable maximum-likelihood estimates and poor predictive accuracy when predictors are many or strongly correlated (multicollinearity).Maximizes a penalized log-likelihood that subtracts an L2 penalty on the squared coefficient magnitudes, shrinking them toward zero and stabilizing estimation in the presence of multicollinearity.The quadratic penalty cannot drive coefficients exactly to zero, so all predictors remain in the model and there is no true sparsity or built-in variable selection.
Lasso Logistic Regression (L1-penalized LR) [104]Need for sparse logistic models with built-in variable selection when many predictors may be irrelevant.Maximizes a penalized log-likelihood with an L1 penalty on the absolute coefficient values, driving some coefficients exactly to zero and discarding the corresponding predictors.With correlated predictors, it often keeps only one and discards others, losing grouped information and stability.
Elastic Net Logistic Regression [105]Unstable, non-sparse estimates when predictors are many and highly correlated, where ridge or lasso alone are unsatisfactory.Maximizes a penalized log-likelihood mixing L1 and L2; corrected version rescales coefficients to avoid double shrinkage.The naïve elastic net over-shrinks coefficients; the corrected form alleviates this, but the original work did not treat multinomial extensions.
Proportional-Odds Logistic Regression (Ordinal LR) [106]Information loss when ordinal outcomes are binarized or assigned arbitrary numeric scores.Models the cumulative probability of being in a given or lower category as a logistic function of the predictors, with a single set of slopes and category-specific thresholds estimated across ordered classes.Relies on proportional-odds/parallel-slopes assumption; may misfit if effects vary across thresholds.
Jeffreys-prior Penalized/Bias-Reduced Logistic Regression [107]Small or sparse data (including imbalance or perfect prediction) causes biased or infinite MLE estimates.Adds a Jeffreys-prior penalty to the likelihood and maximizes the resulting penalized likelihood iteratively with weight updates, applying data-adaptive shrinkage that stabilizes the estimates.Bias–variance trade-off; penalty can reduce precision depending on model complexity and data distribution.
Multinomial Logistic Regression (Random-Utility/SoftMax LR) [108]Binary LR is insufficient for multi-class outcomes.Uses class-specific utilities with Gumbel noise to yield closed-form multinomial probabilities; coefficients estimated jointly by MLE.Independence of irrelevant alternatives (IIA) may fail for similar classes; higher computational cost from joint estimation.
GAM-Logit (Semiparametric Logistic Regression) [109]Linear logit assumption cannot capture complex nonlinear patterns.Replaces linear terms with smooth functions (typically spline or basis expansions) controlled by smoothing parameters; fitted by local scoring/backfitting, which is equivalent to maximizing a penalized log-likelihood with a roughness penalty.Additive assumption remains; sensitive to smoothing choice; higher computational cost than classical LR.
Rare-Events Logistic Regression (ReLogit) [99]Standard LR underestimates probabilities in rare-event data.Post-estimation bias correction for coefficients, probability adjustment using variance–covariance, and case–control sampling with intercept/weight correction.Adds methodological complexity; tailored to binary rare-event contexts, less applicable to multiclass or balanced settings.
Table 7. Foundational variants of NB.
Table 7. Foundational variants of NB.
VariantTargeted LimitMethodologyTrade-Off/Limitation
Tree Augmented Naïve Bayes (TAN) [127]Independence assumption limits accuracy.Adds a tree-structured dependency: each feature has the class as a parent and at most one extra feature parent chosen via conditional mutual information, learned as a maximum spanning tree.Conditioning on class and another feature enlarges probability tables, causing data fragmentation and unreliable estimates in high-dimensional or small datasets.
Averaged One-Dependence Estimators (AODE) [128]Bias from the independence assumption while keeping NB’s efficiency.Forms an ensemble of one-dependence models by letting each attribute act as a “super-parent” with the class; averages their class-probability estimates, falling back to NB for rare values.Prediction is slower; gains may diminish in very high-dimensional data with sparse super-parent values.
Complement Naïve Bayes (CNB) [129]Multinomial NB biasing decisions under class imbalance.Reweights features using counts from the complement of each class (all other classes), then smooths and normalizes those complement statistics.Departs from the standard generative form and requires extra complement-statistic computation, making prediction slightly slower than multinomial NB.
Kernel-Density Naïve Bayes (KDE-NB)/Flexible Bayes [130]Gaussian NB’s rigid normality assumption fails for multimodal, skewed, or irregular continuous feature distributions.Replaces Gaussian likelihoods with kernel density estimation: each training point contributes a Gaussian kernel with shared bandwidth, and class densities are averages of these kernels.Prediction must consider all training instances, is bandwidth-sensitive, noise-vulnerable, and still assumes independence across features.
Selective Bayesian Classifier (SBC) [131]NB can overweight correlated or redundant features, reducing accuracy.Keeps NB estimation but adds greedy forward feature selection (no backtracking), retaining features that improve or do not reduce training accuracy.Selected subset is not guaranteed optimal; independence is still assumed among kept features; training-accuracy selection can overfit.
Table 8. Foundational variants of RF.
Table 8. Foundational variants of RF.
VariantTargeted LimitMethodologyTrade-Off/Limitation
Extremely Randomized Trees (Extra-Trees) [143]Improve accuracy and reduce training timeIncreases randomness by selecting random split thresholds (no greedy threshold search) in addition to random feature subsets; uses the whole training set per tree.Each tree is more biased; it still needs tuning for features-per-node, minimum node size, and number of trees.
Rotation Forest [144]Increase base-learner diversity to improve accuracyTrains each tree on a rotated feature space: randomly splits features into subsets, applies PCA on each subset (using 75% bootstrap and random class subset), keeps all components, and builds a rotation matrix for data transformation.Higher computational complexity from repeated PCA and no standard hyperparameter-tuning mechanism.
Conditional Inference Forest (cforest) [145]Remove split-selection biasUses conditional-inference trees: variable selection via permutation-based conditional independence tests (smallest p-value chosen), then split point determined within that variable; forests grown on bootstraps with random feature sub spacing.More computation due to statistical testing at each node.
Oblique Random Forest (Oblique RF-Ridge) [146]High-dimensional or correlated data Replace axis-aligned splits with oblique splits using weighted linear combinations of features; ridge regression at each node finds the split direction, then samples are projected and thresholded by Gini.Slower training: hyperparameters (trees, features per split) remain heuristically chosen.
Table 15. Cross-family comparison across performance perspectives.
Table 15. Cross-family comparison across performance perspectives.
ClassifierAccuracy and BoundaryHyperparameters, High Dimensionality and ScalabilityClass Imbalance, Multiclass Classification, Interpretability and Speed
SVMMargin-based; kernels enable complex boundaries but accuracy depends on C and kernel settings → recent work uses membership/weighting and kernel-free nonlinear formulations → evidence is still limited for truly large-scale and sparse high-dimensional regimes.Tuning burden is dominated by C and kernel parameters; cost grows with support vectors → mitigations via margin-near active sets, coreset/weak-model selection, and SV reduction → adds heuristic/tunable settings governing the efficiency–accuracy trade-off.Soft-margin can bias toward the majority class → class-dependent costs and membership weighting address imbalance → multiclass relies on decompositions; nonlinear models remain less interpretable and SV-heavy prediction can be slow.
KNNLocal nonlinear; accuracy sensitive to k , distance choice, and noisy neighbors → recent work strengthens neighbor selection/metric/voting → high-dimensional distance degradation persists.Few hyperparameters (mainly k but storage + neighbor search dominates → mitigation via prototype reduction and approximate/distributed indexing; sometimes joint k +feature weighting/selection → added preprocessing/optimization cost scales with features.Naturally multiclass and decisions are traceable to retrieved neighbors → majority-vote bias under imbalance and slow querying on large data remain → often needs explicit rebalancing/cost-sensitive handling.
DTPiecewise nonlinear, axis-aligned; deep trees fit complex structure but overfit and become unstable → mitigations via more expressive/local splits and improved split selection (including shallower stopping) → instability and weak scalability evidence in very high-dimensional/large-scale settings persist.Depth/node-size/leaves and pruning control complexity; many noisy/sparse attributes make split search costly and unreliable → mitigations via statistical-test splitting, streaming induction, and disk/parallel level-wise construction → added per-node computation/heuristic thresholds and linear dependence on attributes still constrain very high-dimensional regimes.Interpretable path rules when shallow and supports multiclass, but impurity splits favor the majority class and large trees reduce clarity and speed → shallower/test- or causality-guided growth improves compactness while adding computation → classical imbalance bias remains largely unaddressed in the reviewed variants.
LRLinear boundary; strong on near-linear structure but biased under nonlinearity → recent work adds structured effects, robust objectives, and discrete training → added complexity, and boundary remains linear unless nonlinear terms are introduced.Few intrinsic hyperparameters, but high-d/collinearity and (quasi) separation destabilize MLE → regularization and structured representations stabilize and reduce effective dimensionality → tuning and added preprocessing/optimization can increase cost.Interpretable coefficients, but imbalance plus fixed thresholds can miss the rare class → mitigations via thresholding/weighting and bias-reduction ideas → interpretability can weaken as added structure and tuning grow.
NBOften competitive but independence and simple likelihoods can misfit dependencies/complex continuous structure → variants select/weight features, add limited dependencies or latent components, and use flexible densities → added structure/tuning and dependence effects remain limiting.Nearly hyperparameter-free (mainly smoothing) and efficient with many features/classes → dependence-relaxing and flexible-density variants add tuning and can raise prediction or model-selection cost (e.g., cross-validated structure/latent size, bandwidth-like choices) → heavy tuning/CV can be costly at scale.Interpretable additive per-feature terms and fast prediction; multiclass via class posteriors → skewed priors bias toward majority classes → imbalance-oriented NB variants reweight/optimize AUC and may use resampling, but add optimization and hyperparameter dependence.
RFStrong accuracy; flexible nonlinear boundary via many trees → variants reweight/select trees, reduce redundancy, and use oblique/transform splits to exploit correlations → evidence still limited and gains can depend on heuristic choices.Key settings (trees, features-per-split, leaf/depth) drive bias–variance; random subspaces help high-d, but cost scales with forest size → mitigations via extra randomization, redundancy pruning, and structure-based simplification → added tunables/heuristic thresholds and preprocessing or per-node tests can raise cost.Majority voting supports multiclass but impurity and vote bias favor the majority class; ensemble is less transparent than single trees → some work improves diagnostics/feature-structure interpretability (graph-guided selection) → imbalance sensitivity and ensemble opacity largely persist without explicit correction.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Alshammari, A.H.; Bencsik, G.; Ali, A.H. A Survey of Six Classical Classifiers, Including Algorithms, Methodological Characteristics, Foundational Variants, and Recent Advances. Algorithms 2026, 19, 37. https://doi.org/10.3390/a19010037

AMA Style

Alshammari AH, Bencsik G, Ali AH. A Survey of Six Classical Classifiers, Including Algorithms, Methodological Characteristics, Foundational Variants, and Recent Advances. Algorithms. 2026; 19(1):37. https://doi.org/10.3390/a19010037

Chicago/Turabian Style

Alshammari, Ali Hussein, Gergely Bencsik, and Almashhadani Hasnain Ali. 2026. "A Survey of Six Classical Classifiers, Including Algorithms, Methodological Characteristics, Foundational Variants, and Recent Advances" Algorithms 19, no. 1: 37. https://doi.org/10.3390/a19010037

APA Style

Alshammari, A. H., Bencsik, G., & Ali, A. H. (2026). A Survey of Six Classical Classifiers, Including Algorithms, Methodological Characteristics, Foundational Variants, and Recent Advances. Algorithms, 19(1), 37. https://doi.org/10.3390/a19010037

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop