Multi-Objective Harris Hawks Optimization with NSGA-III for Feature Selection in Student Performance Prediction

Al-Milli, Nabeel

doi:10.3390/computers15020112

Open AccessArticle

Multi-Objective Harris Hawks Optimization with NSGA-III for Feature Selection in Student Performance Prediction

by

Nabeel Al-Milli

Department of Computer Science, Prince Abdullah bin Ghazi Faculty of Information and Communication Technology, Al-Balqa Applied University, Al-Salt 19117, Jordan

Computers 2026, 15(2), 112; https://doi.org/10.3390/computers15020112

Submission received: 6 January 2026 / Revised: 28 January 2026 / Accepted: 30 January 2026 / Published: 6 February 2026

(This article belongs to the Section AI-Driven Innovations)

Download

Browse Figures

Versions Notes

Abstract

Student performance is an important factor for any education process to succeed; as a result, early detection of students at risk is critical for enabling timely and effective educational interventions. However, most educational datasets are complex and do not have a stable number of features. As a result, in this paper, we propose a new algorithm called MOHHO-NSGA-III, which is a multi-objective feature-selection framework that jointly optimizes classification performance, feature subset compactness, and prediction stability with cross-validation folds. The algorithm combines Harris Hawks Optimization (HHO) to obtain a good balance between exploration and exploitation, with NSGA-III to preserve solution diversity along the Pareto front. Moreover, we control the diversity management strategy to figure out a new solution to overcome the issue, thereby reducing the premature convergence status. We validated the algorithm on Portuguese and Mathematics datasets obtained from the UCI Student Performance repository. Selected features were evaluated with five classifiers (k-NN, Decision Tree, Naive Bayes, SVM, LDA) through 10-fold cross-validation repeated over 21 independent runs. MOHHO-NSGA-III consistently selected 12 out of 30 features (60% reduction) while achieving 4.5% higher average accuracy than the full feature set (Wilcoxon test;

p < 0.01

across all classifiers). The most frequently selected features were past failures, absences, and family support aligning with educational research on student success factors. This suggests the proposed algorithm produces not just accurate but also interpretable models suitable for deployment in institutional early warning systems.

Keywords:

feature selection; multi-objective optimization; NSGA-III; Harris Hawks Optimization; student performance prediction; educational data mining

1. Introduction

Educational systems collect huge amounts of data related to the students, such as (demographic, exam evaluations, lifestyle, online activity, course registrations …, etc.) [1,2]. This data is a good resource that can be used to improve teaching and institutional decisions [3,4]. Moreover, educational data mining (EDM) applies machine learning (ML) and statistics to extract useful insights from this information [5,6,7]. Researchers and educational leaders have tried to build a modern system to help them to enhance student performance by building prediction models to determine the risk factors that affect student performance [8,9]. In brief, good prediction models help educators design better support programs, use resources wisely, and boost student success [10,11]. Such systems and models will improve student performance and help them obtain the desired results [2,12].

Recently, researchers have adopted several ML techniques to address educational problems, such as: classification [6,13], clustering [14], association rule mining [15], recommendation systems [16], and deep learning [17]. Recent work in performance prediction and student performance evaluation has employed ensemble methods [18], neural networks [19], gradient boosting [20], and hybrid approaches [13]. However, the main challenge is that most of the educational datasets typically contain a large number of features that are either not useful or redundant. This dimensionality issue both reduces model accuracy and makes it difficult to interpret [21,22].

Feature-selection methods solve redundant features and useful data by keeping useful attributes and removing the rest [3,4]. There are several methods, but the most common methods include filters [23], wrappers [24], and embedded approaches [25]. Wrappers usually work better because they test how well features actually perform with a classifier [22]. The main issue with this method that it prioritizes accuracy and ignores other important factors like how many features you use, how long it takes to run, or whether predictions stay consistent across different data splits.

To solve all previous issues, a multi-objective optimization handles these competing goals at once [26,27,28]. Rather than giving one answer, it produces a set of solutions showing different trade-offs [29]. For example, methods like NSGA-II [30], NSGA-III [31], and MOEA/D [32] work well for feature selection across different problems [27,33].

Harris Hawks Optimization (HHO) [34] is a metaheuristic inspired by the cooperative hunting behavior of Harris hawks; specifically, their surprise pounce strategy. HHO balances exploration and exploitation effectively, achieving strong performance across optimization problems [35,36,37]. While multi-objective HHO variants exist for different applications [38,39], no prior work has combined HHO with NSGA-III for educational feature selection, to our knowledge.

NSGA-III [31] represents a significant advancement in many-objective optimization, employing a reference-point-based diversity preservation mechanism that maintains uniform distribution of solutions across the Pareto front. Unlike NSGA-II, which relies on crowding distance for diversity maintenance, NSGA-III’s reference-point approach proves particularly effective for problems with three or more objectives [31]. Recent studies have confirmed NSGA-III’s superiority in maintaining diversity and convergence for many-objective optimization problems [26,33].

Recent work has explored using reinforcement learning (RL) with multi-objective evolutionary algorithms for adaptive operator selection and parameter tuning. Tian et al. [40] used deep RL for dynamic MOEA adaptation; Li et al. [41] applied Q-learning to operator selection in NSGA-III. The idea is that RL agents learn effective search strategies from experience, potentially improving performance across problem families [42].

However, RL-enhanced MOEAs come with significant overhead: training RL agents requires thousands of episodes across multiple problem instances, reward function design is challenging for multi-objective scenarios, and learned policies often do not transfer well between domains [43]. For educational feature selection—typically a one-time analysis per dataset—this training cost may not be justified. Our MOHHO-NSGA-III achieves adaptive behavior through diversity management and HHO’s energy-based phase transitions without RL training, offering a simpler approach better suited to educational data-mining contexts where retraining across many problem instances is not practical. Despite the advances in machine learning for educational applications, several critical gaps remain. First, most existing student performance prediction models focus solely on maximizing classification accuracy, ignoring the equally important objectives of feature reduction and prediction stability [3,4]. Second, traditional feature-selection methods often suffer from premature convergence and poor diversity in the solution set [26]. Third, existing approaches lack adaptive mechanisms to maintain population diversity during the optimization process, limiting their ability to escape local optima [35]. Finally, there is a lack of research combining the exploration–exploitation capabilities of HHO with the robust environmental selection of NSGA-III for educational feature-selection problems.

Our proposed method optimizes three competing goals at once: obtain high classification accuracy, use fewer features, and keep predictions consistent across different data splits. To achieve these goals, we combine HHO’s search strategy (which naturally balances exploration and exploitation) with NSGA-III’s reference points (which maintain diverse solutions) to find better feature subsets. In addition to that, we also added a diversity checkpoint: when the population gets too similar, the proposed algorithm injects random solutions to enhance the searching process instead of getting stuck.

In this paper, we make several contributions to multi-objective feature selection for educational data mining as follows:

We propose a new approach called MOHHO-NSGA-III, which combines HHO with the NSGA-III method to leverage HHO’s exploration–exploitation dynamics alongside NSGA-III’s diversity preservation through reference points.
The proposed algorithm optimizes three different objectives at the same time, which are (1) maximizing classification accuracy, (2) minimizing selected features, and (3) ensuring stable performance across CV folds to obtain the most stable features.
An adaptive diversity mechanism is proposed, able to inject the population with new solutions to avoid stuck status.

The paper is organized as follows: Section 2 describes the MOHHO-NSGA-III algorithm. Section 3 details the experimental setup. Section 4 presents results and statistical analysis. Section 5 discusses educational and algorithmic insights. Section 6 concludes with limitations and future directions.

2. Proposed Method

In this work, the proposed model is depicted in Figure 1. After collecting educational data, it is important to determine the most valuable features by eliminating features that have no effect on the output class. The proposed model is a hybridization between Harris Hawks Optimization (HHO), NSGA-III, and adaptive diversity management. MOHHO-NSGA-III works as a multi-objective wrapper feature-selection algorithm that tries to simultaneously optimize three objectives: classification accuracy, feature reduction, and prediction stability. The proposed approach enhances the selection mechanism by integrating the exploration–exploitation balance of HHO with the robust environmental selection of NSGA-III. This process will enhance the performance of ML classifiers, reducing the search space and computational time while providing multiple optimal solutions through a Pareto front. In this work, we employed the k-fold cross-validation method with k = 10 to avoid the overfitting problem. The following subsections discuss the proposed approach in detail.

2.1. Preprocessing

In real-world applications, collected data may suffer from several issues due to technical problems or human faults such as missing or corrupted data. To build an accurate model for educational systems, data should be complete (no missing data) and have enough samples. The data type for the UCI Student Performance dataset used in this paper contains both numerical and categorical features. We validate that the dataset has no missing data. For categorical features, we applied label encoding to convert them into numerical values. To enhance computing efficiency, we normalized the complete dataset to the range [0, 1] to reduce the impact of different feature scales. We applied the min–max scaling method for the normalization phase as shown in Equation (1), where

V_{i j}

refers to the feature value in row i and column j of the dataset matrix, and min and max refer to the minimum and maximum values of the ith feature, respectively.

n M i n

and

n M a x

refer to the lower and upper bounds for rescaling the data, respectively.

V_{i j} = \frac{V_{i j} - min (V_{i j})}{max (V_{i j}) - min (V_{i j})} (n M a x - n M i n) + n M i n

(1)

2.2. Multi-Objective Feature Selection

Feature-selection (FS) algorithms try to reduce the dimensionality of the original dataset without losing its value [44]. In general, FS algorithms are categorized into two groups: wrapper and filter algorithms. Wrapper algorithms (e.g., metaheuristic algorithms) are more accurate than filter algorithms due to their ability to explore the search space effectively, while filter algorithms try to discover statistical relationships between features. Moreover, the selection criteria for wrapper methods depend on the performance of the internal machine learning classifier, which makes wrapper methods more suitable than filter methods [45].

Traditional feature-selection methods optimize a single objective, typically classification accuracy. But educational data mining requires balancing multiple competing goals: we want high prediction accuracy, a small feature set for interpretability, and stable performance across different CV folds. MOHHO-NSGA-III addresses this by generating a Pareto front of solutions that represent different trade-offs among these objectives.

Solutions in MOHHO-NSGA-III are encoded as binary vectors of length n (Figure 2), where each bit indicates whether a feature is selected (1) or not (0). This binary representation makes crossover and mutation straightforward to implement. We use k-Nearest Neighbors as the evaluation classifier because it is computationally efficient

O (h \times m)

, where h is the number of training samples and m is selected features, and does not require extensive hyperparameter tuning, which would add another layer of complexity to an already expensive wrapper approach.

2.3. Multi-Objective Fitness Function

Unlike single-objective approaches that optimize accuracy alone, MOHHO-NSGA-III balances three competing objectives for a more complete evaluation of feature subsets:

2.3.1. Objective 1: Classification Accuracy

The first objective maximizes classification accuracy using the selected features. Using k-fold cross-validation (k = 10), the accuracy is calculated as:

f_{1} = Accuracy = \frac{1}{k} \sum_{i = 1}^{k} \frac{T P_{i} + T N_{i}}{T P_{i} + T N_{i} + F P_{i} + F N_{i}}

(2)

where

T P

,

T N

,

F P

, and

F N

are true positives, true negatives, false positives, and false negatives in fold i.

2.3.2. Objective 2: Feature Reduction

The second objective minimizes the number of selected features, which improves the interpretability and reduces the computational cost:

f_{2} = Feature Ratio = \frac{| S |}{n}

(3)

where

| S |

is the number of selected features and n is the total number of features.

2.3.3. Objective 3: Prediction Stability

The third objective measures prediction consistency across CV folds:

f_{3} = Stability = 1 - std ({Acc}_{1}, {Acc}_{2}, \dots, {Acc}_{k})

(4)

where

{Acc}_{i}

is accuracy on fold i and std is standard deviation. Higher stability means more consistent predictions across folds.

The multi-objective problem is:

\begin{matrix} Maximize : & F (S) = {f_{1} (S), - f_{2} (S), f_{3} (S)} \\ Subject to : & S \subseteq {1, 2, \dots, n} \\ | S | \geq 1 \end{matrix}

(5)

where S is a feature subset. We negate

f_{2}

to convert minimization to maximization.

2.4. Harris Hawks Optimization (HHO)

Harris Hawks Optimization (HHO) is a nature-inspired metaheuristic algorithm proposed by [34] that mimics the cooperative hunting behavior and surprise pouncing strategy of Harris hawks in nature. The algorithm balances exploration and exploitation phases based on the escaping energy of the prey (optimal solution). HHO has demonstrated superior performance in various optimization problems due to its adaptive transition between exploration and exploitation strategies.

2.4.1. Exploration Phase

During the exploration phase (

| E | \geq 1

), hawks perch randomly to detect prey locations. Two exploration strategies are employed:

Strategy 1: Random hawks perching (when

q < 0.5

):

X (t + 1) = X_{r a n d} (t) - r_{1} | X_{r a n d} (t) - 2 r_{2} X (t) |

(6)

Strategy 2: Perching on tall trees (when

q \geq 0.5

):

X (t + 1) = (X_{r a b b i t} (t) - X_{a v g} (t)) - r_{3} (L B + r_{4} (U B - L B))

(7)

where

X_{r a n d}

is a randomly selected hawk,

X_{r a b b i t}

is the best solution (rabbit position),

X_{a v g}

is the average position of all hawks,

r_{1}, r_{2}, r_{3}, r_{4}

are random numbers in [0, 1], and

L B

and

U B

are the lower and upper bounds.

2.4.2. Exploitation Phase

During the exploitation phase (

| E | < 1

), hawks perform different besiege strategies depending on the escaping energy and prey behavior:

Soft Besiege (when

| E | \geq 0.5

and

r \geq 0.5

):

X (t + 1) = Δ | X (t) - E | J X_{r a b b i t} (t) - X (t) |

(8)

Hard Besiege (when

| E | < 0.5

and

r \geq 0.5

):

X (t + 1) = X_{r a b b i t} (t) - E | X_{r a b b i t} (t) - X (t) |

(9)

Soft Besiege with Progressive Rapid Dives (when

| E | \geq 0.5

and

r < 0.5

):

Y = X_{r a b b i t} (t) - E | J X_{r a b b i t} (t) - X (t) |

(10)

If Y does not improve, perform Lévy flight:

Z = Y + S \times L F (d)

(11)

Hard Besiege with Progressive Rapid Dives (when

| E | < 0.5

and

r < 0.5

): Similar to soft besiege with dives, but uses the average position

X_{a v g}

instead.

The escaping energy E is calculated as:

E = 2 E_{0} (1 - \frac{t}{T_{m a x}})

(12)

where

E_{0} \in [- 1, 1]

is the initial energy, t is the current iteration, and

T_{m a x}

is the maximum number of iterations. The jump strength J is calculated as

J = 2 (1 - r_{5})

, where

r_{5} \sim U (0, 1)

.

For binary feature selection, the continuous positions are converted to binary using the sigmoid transfer function:

Sigmoid (x) = \frac{1}{1 + e^{- x}}

(13)

A feature is selected if

Sigmoid (x_{i}) > rand ()

, where

rand () \sim U (0, 1)

.

2.5. NSGA-III for Environmental Selection

The Non-dominated Sorting Genetic Algorithm III (NSGA-III) proposed by [31] is an evolutionary multi-objective optimization algorithm specifically designed for many-objective problems. Unlike NSGA-II, which relies on crowding distance for diversity maintenance, NSGA-III uses reference points to ensure uniform diversity across the Pareto front, making it more suitable for problems with three or more objectives.

2.5.1. Reference-Point Generation

Reference points are generated using Das and Dennis’s systematic approach [46]. For three objectives with p divisions, the number of reference points is calculated as:

H = (\binom{M + p - 1}{p})

(14)

where M is the number of objectives. In our case, with

M = 3

and

p = 12

, we generate

H = 91

uniformly distributed reference points.

2.5.2. Non-Dominated Sorting

Solutions are ranked into non-dominated fronts

F_{1}, F_{2}, \dots

using fast non-dominated sorting. Solution x dominates solution y (denoted

x ≺ y

) if:

\begin{matrix} \forall i : f_{i} (x) \geq f_{i} (y) (for maximization) \\ \exists j : f_{j} (x) > f_{j} (y) \end{matrix}

(15)

2.5.3. Niching Selection

After filling the next generation with complete fronts, if the last front

F_{l}

cannot fit entirely, niching selection is performed:

1. Normalization: Objectives are normalized using ideal and nadir points. 2. Association: Each solution is associated with the nearest reference point. 3. Niche count: Count the number of solutions associated with each reference point. 4. Selection: Iteratively select solutions from reference points with the smallest niche count, choosing the solution with minimum perpendicular distance to the reference line.

2.6. Adaptive Diversity Management

To prevent premature convergence and maintain population diversity, we propose an adaptive diversity management mechanism that monitors population diversity and injects new solutions when necessary.

2.6.1. Diversity Measurement

Population diversity is measured using the average normalized Hamming distance:

D = \frac{2}{N (N - 1) n} \sum_{i = 1}^{N - 1} \sum_{j = i + 1}^{N} Hamming (s_{i}, s_{j})

(16)

where N is the population size, n is the chromosome length, and

Hamming (s_{i}, s_{j})

is the Hamming distance between solutions

s_{i}

and

s_{j}

.

2.6.2. Adaptive Mutation Rate

The mutation rate is dynamically adjusted based on population diversity:

p_{m}^{a d a p t} = \{\begin{matrix} 3 \times p_{m} & if D < D_{l o w} \\ 1.5 \times p_{m} & if D_{l o w} \leq D < D_{m e d} \\ p_{m} & otherwise \end{matrix}

(17)

where

p_{m}

is the base mutation rate, and

D_{l o w} = 0.3

and

D_{m e d} = 0.5

are diversity thresholds.

2.6.3. Diversity Injection

When both low diversity (

D < D_{l o w}

) and stagnation (no improvement for

τ

iterations) are detected, the algorithm injects

α \times N

new random solutions, where

α = 0.2

is the injection ratio. The worst

α \times N

solutions are removed to maintain population size, preventing the algorithm from being trapped in local optima.

2.7. MOHHO-NSGA-III Algorithm

Algorithm 1 presents the complete pseudo-code for the proposed MOHHO-NSGA-III algorithm. The algorithm begins by initializing a population of binary chromosomes and generating reference points. In each iteration, offspring are generated using HHO operators, followed by crossover and adaptive mutation. The combined parent and offspring populations undergo NSGA-III environmental selection to form the next generation. Diversity management ensures exploration capability throughout the optimization process. The algorithm terminates after reaching the maximum number of iterations and returns the first Pareto front containing non-dominated solutions.

Algorithm 1 Pseudo-code for MOHHO-NSGA-III

Define:
N: population size,

T_{m a x}

: maximum number of iterations,

p_{c}

: crossover rate,

p_{m}

: base mutation rate,
H: number of reference points,
k: number of folds for cross-validation.

Initialization:
Generate population

P_{0}

of size N with random binary chromosomes.
Generate H reference points Z using the Das–Dennis method.
Evaluate fitness:

F (s) = {Accuracy, Feature ratio, Stability}

using k-fold CV.
Perform non-dominated sorting and assign ranks.

1:: while $t < T_{m a x}$ do
2:: // Generate Offspring using HHO
3:: for $i = 1$ to N do
4:: Calculate energy: $E = 2 E_{0} (1 - t / T_{m a x})$ , where $E_{0} \in [- 1, 1]$
5:: Select parents $p_{1}, p_{2}$ randomly from $P_{t}$
6:: if $| E | \geq 1$ then // Exploration
7:: $o_{1} \leftarrow HHO_Exploration (p_{1}, P_{t})$
8:: else// Exploitation
9:: $o_{1} \leftarrow HHO_Exploitation (p_{1}, P_{t}, E)$
10:: end if
11:: Crossover: $o_{1}, o_{2} \leftarrow Crossover (o_{1}, p_{2}, p_{c})$
12:: Calculate diversity D and compute: $p_{m}^{a d a p t} \leftarrow f (D, p_{m})$
13:: if $rand (0.0, 1.0) < p_{m}^{a d a p t}$ then
14:: Mutation: $o_{1}, o_{2} \leftarrow Mutate (o_{1}, o_{2}, p_{m}^{a d a p t})$
15:: end if
16:: Evaluation: Evaluate $F (o_{1})$ and $F (o_{2})$ using k-fold CV
17:: Add $o_{1}$ and $o_{2}$ to offspring population $Q_{t}$
18:: end for
19:: // NSGA-III Selection
20:: $R_{t} \leftarrow P_{t} \cup Q_{t}$ // Combine parent and offspring
21:: Sort $R_{t}$ into non-dominated fronts $F_{1}, F_{2}, \dots$
22:: $P_{t + 1} \leftarrow \emptyset$
23:: while $| P_{t + 1} | + | F_{i} | \leq N$ do
24:: $P_{t + 1} \leftarrow P_{t + 1} \cup F_{i}$
25:: $i \leftarrow i + 1$
26:: end while
27:: if $| P_{t + 1} | < N$ then
28:: Normalize objectives in the last front $F_{l}$
29:: Associate solutions with reference points Z
30:: Select remaining solutions using niching
31:: end if
32:: // Diversity Management
33:: if diversity $D < D_{t h r e s h o l d}$ and no improvement for $τ$ iterations then
34:: Inject $α \cdot N$ random solutions into $P_{t + 1}$
35:: Remove worst $α \cdot N$ solutions from $P_{t + 1}$
36:: end if
37:: $t \leftarrow t + 1$
38:: end while
39:: Return: Pareto front $F_{1}$ from $P_{T}$

MOHHO-NSGA-III offers several advantages over single-objective methods. First, it produces a Pareto front of solutions rather than a single answer, letting decision-makers choose based on whether they prioritize accuracy, interpretability, or stability. Second, HHO’s energy-based transitions provide a natural exploration–exploitation balance without manual parameter tuning. Third, NSGA-III’s reference points maintain diversity across the entire objective space, not just at the extremes. Finally, our adaptive diversity approach detects when the population becomes too similar and injects random solutions to prevent stagnation this proved especially important in later iterations when standard MOEAs often converge prematurely.

3. Experimental Setup

This section describes the datasets, algorithm configuration, and evaluation methodology used to validate MOHHO-NSGA-III.

3.1. Datasets

We evaluated MOHHO-NSGA-III on two datasets from the UCI Machine Learning Repository [47]:

Portuguese: A total of 649 students, with 30 features covering demographics, family background, and school-related attributes. Target is pass/fail based on final grade (pass = grade ≥ 10/20).
Mathematics: A total of 395 students with the same 30 features as the Portuguese dataset.

Features include age, sex, home location (urban/rural), parents’ education levels, family size, study time, past failures, absences, and social factors like frequency of going out and romantic relationships. We label-encoded categorical variables and applied min–max normalization to [0, 1] for numerical features.

3.2. Algorithm Configuration

Table 1 shows the parameter settings we used. We chose these values based on preliminary tests and standard recommendations from the HHO [34] and NSGA-III [31] literature.

3.3. Evaluation Methodology

We designed the evaluation to ensure valid statistical comparisons:

Inner optimization: During optimization MOHHO-NSGA-III evaluates solutions via 10-fold stratified CV on training data (80%). For fitness evaluation, we chose k-NN (k = 5), primarily because it is fast. As a lazy learner, kNN needs no training, just $O (m log m)$ sorting per fold. Given that we evaluate 50 solutions across 100 iterations with 10-fold CV, that results in around 50,000 evaluations. Using SVM instead would require $O (m^{2})$ training each time, making the whole optimization painfully slow. The results in Section 4.3.3 confirm that features optimized with kNN perform well on all five classifiers we tested, so the speed–accuracy tradeoff works out.
Feature selection: The algorithm identifies subsets optimizing accuracy, feature count, and stability simultaneously.
Final evaluation: To evaluate the tested data, we employed five different classifier methods: k-NN, DT, NB, SVM, and LDA over the tested data.
Statistical testing: A total of 21 independent trials per experiment were performed, each with different random seed numbers. In this paper, the performance evaluations are mean ± std dev, and we use Wilcoxon signed-rank tests to figure out if differences between implemented methods are statistically significant (threshold p < 0.05).

4. Results and Analysis

This section presents comprehensive experimental results including convergence behavior, feature-selection efficacy, classification performance, and comparative analysis with state-of-the-art methods.

4.1. Convergence Analysis

Figure 3 shows how MOHHO-NSGA-III converges over 100 iterations. The algorithm improves rapidly during the first 30 iterations, then gradually refines solutions as it approaches the optimum.

Key observations:

Early exploration (iterations 1–30): Population diversity stays high (0.45–0.65) as the algorithm explores broadly. The best accuracy climbs from 0.72 to 0.82.
Exploitation (iterations 31–70): Diversity drops to 0.30–0.40 as solutions converge toward promising regions. The best accuracy reaches 0.85.
Refinement (iterations 71–100): The algorithm fine-tunes solutions with occasional diversity injections visible as spikes in the right panel to escape local optima. While, the best accuracy is stabilized around 0.867.
Diversity management: The proposed algorithm automatically injected several new solutions three times (e.g., iterations 42, 68, 89) when diversity dropped below our $0.3$ threshold. This process will keep the search from getting stuck.

The population started by selecting 63% of features on average but ended at 43% in the final Pareto front, which cutting features substantially while keeping the accuracy at high level.

4.2. Feature-Selection Results

4.2.1. Pareto Front Analysis

On average, our runs returned 18 non-dominated solutions, with an std of 2.1 across the 21 iterations. These results highlight a diverse range of trade-offs between model accuracy, feature count, and stability. The 3D Pareto front in Figure 4 illustrates these relationships: while high-accuracy configurations typically require 15 to 17 features, we also identified leaner models utilizing only 9 to 11 features that prioritize interpretability at a slight cost to performance.

Three representative solutions appear in Table 2.

The highest-accuracy solution obtained a score of 0.867 by utilizing 13 features, which representing a 57% reduction from the original dataset. Interestingly, shifting to a more balanced solution resulted in a marginal accuracy loss of only 0.015 while further trimming the model by two features. For applications prioritizing simplicity, the minimal solution comprising just 9 features (a 70% total reduction), still maintained a robust accuracy of 0.834, which demonstrate the significant of interpretability that can be gained with minimal impact on performance.

4.2.2. Feature Importance Analysis

Table 3 shows the 15 most frequently selected features across all Pareto-optimal solutions from our 21 runs.

Analysis of feature-selection patterns shows that academic history was most predictive, as past failures (97.3%), absences (94.1%), and study time (89.5%) appeared in nearly all solutions. Family background also proved influential, with parental education (mother 91.8%, father 83.6%) and family support (76.3%) selected consistently. Lifestyle factors demonstrated meaningful predictive value, including frequency of going out (87.2%), romantic involvement (72.9%), and alcohol consumption (69.5%). Student aspirations mattered independently—desire for higher education appeared in 85.9% of solutions. However, features such as demographic attributes like school name, sex, address type, and internet access were rarely selected (below 30%), suggesting limited additional predictive power once academic and family factors are accounted for.

4.3. Classification Performance

4.3.1. Portuguese Language Dataset

Table 4 compares MOHHO-NSGA-III against the baseline approach (using all 30 features) across five classifiers on the Portuguese dataset.

MOHHO-NSGA-III outperformed the baseline across all classifiers, achieving average improvements of 4.42% in AUC and 4.63% in accuracy. Moreover, statistical testing confirmed all improvements as significant (p < 0.01, p < 0.01, p < 0.01), which indicate a robust performance gain across different learning paradigms.

Key findings:

Best performance: SVM achieved the highest AUC (0.8634) and accuracy (0.8367) with selected features, outperforming the baseline by 3.88% and 4.00%, respectively.
Largest improvement: kNN showed the most substantial gains (+5.09% AUC, +5.46% accuracy), suggesting that dimensionality reduction particularly benefits distance-based methods.
Consistent gains: All classifiers benefited from feature selection, with improvements ranging from 3.33% (Naive Bayes) to 5.09% (kNN) in AUC.

4.3.2. Mathematics Dataset

Table 5 presents results for the Mathematics dataset, validating the generalizability of MOHHO-NSGA-III.

The Mathematics dataset results corroborate our findings from the Portuguese dataset:

MOHHO-NSGA-III selected 11 features (63.3% reduction) while improving average AUC by 4.51% and accuracy by 4.70%.
All five classifiers showed statistically significant improvements ( $p < 0.01$ ).
The smaller dataset (395 vs. 649 students) benefited even more from feature selection, with average improvements of 4.51% (vs. 4.42% for Portuguese).

4.3.3. Classifier Transferability Analysis

A practical question is whether features selected using kNN transfer well to other classifiers. Table 4 and Table 5 suggest they do often perform better on non-kNN classifiers than on kNN itself.

The selected features, using kNN during the optimization, demonstrated strong transferability across all five classifiers. For example, the SVM achieved the highest AUC (0.8634) despite using kernel methods fundamentally different from kNN’s distance-based approach. While the Naive Bayes reached 0.8567 AUC even with independence assumptions absent in kNN, Decision Trees attained 0.8291 AUC using split-based learning. This cross-classifier robustness occurs because the selected features, failures, absences, study time, and parental education, that can capture fundamental predictive relationships rather than classifier-specific patterns. The stability objective, which optimizes cross-validation consistency across data splits, appears to promote generalization beyond the training classifier. Additionally, multi-objective optimization prevents overfitting by balancing accuracy against feature reduction, discouraging the selection of features that only perform well with kNN.

4.4. Statistical Significance Analysis

We ran statistical tests to confirm that MOHHO-NSGA-III significantly outperforms the baseline. Figure 5 shows AUC distributions across our 21 runs for both methods.

We used Wilcoxon signed-rank tests (appropriate for paired non-normal distributions) to assess statistical significance. Table 6 summarizes results for both datasets.

Key statistical findings:

Universal significance: All 10 classifier–dataset combinations (5 classifiers × 2 datasets) showed statistically significant improvements at both $p < 0.05$ and $p < 0.01$ levels.
Strong evidence: The extremely low p-values (average 0.0023) provide strong evidence that improvements are not due to chance.
Reduced variance: MOHHO-NSGA-III also reduced variance, with standard deviation decreasing by an average of 18.7% across classifiers, which indicate that more stable and reliable predictions.

4.5. Comparison with State-of-the-Art Methods

Table 7 compares MOHHO-NSGA-III against nine other feature-selection methods on the Portuguese dataset, including three multi-objective algorithms added per reviewer suggestions. MOHHO-NSGA-III outperforms all baselines: traditional approaches (Chi-Square, Information Gain, Sequential Forward Selection, LASSO), single-objective metaheuristics such as GA and PSO, and multi-objective algorithms (NSGA-II, MOPSO, NSGA-III without HHO). Wilcoxon signed-rank tests across 21 runs confirm all improvements are statistically significant (p < 0.01, p < 0.01, p < 0.01). The superior performance stems from combining HHO’s exploration–exploitation balance with NSGA-III’s reference-point diversity mechanism, adaptive diversity management that avoids local optima, and explicit optimization of prediction stability—a factor most methods ignore.

Compared to multi-objective baselines, MOHHO-NSGA-III achieved gains of 1.34% over NSGA-II, 0.81% over NSGA-III, and 1.62% over MOPSO in AUC. The advantage over NSGA-II stems from reference points maintaining better Pareto coverage than crowding distance when optimizing three objectives simultaneously. Moreover, NSGA-II struggles to distribute solutions evenly across all dimensions. The improvement over standard NSGA-III comes from HHO’s superior exploration–exploitation balance, as its energy parameter adapts naturally over iterations rather than relying on fixed genetic operator probabilities. MOPSO’s lower performance reflects a fundamental limitation: PSO’s continuous velocity updates require additional transfer functions for binary feature selecTtion, whereas HHO’s discrete transitions naturally suit binary chromosomes.

4.6. Computational Complexity Analysis

4.6.1. Theoretical Complexity

To evaluate the scalability, we analyzed MOHHO-NSGA-III’s computational complexity and identified potential bottlenecks. Each iteration involves four operations: fitness evaluation, non-dominated sorting, reference-point association, and HHO operators. Table 8 breaks down the complexity.

Fitness evaluation dominates roughly

O (26, 000)

operations per iteration on Portuguese data. The other components add relatively little: sorting ≈

O (7500)

, association ≈

O (13, 650)

, and HHO updates ≈

O (1500)

. Combined, these are still much smaller than fitness evaluation.

This means the bottleneck is classifier training, not the multi-objective machinery. If we need better scalability, the focus should be on reducing fitness evaluations (e.g., surrogate models, parallel evaluation, or smarter sampling) rather than optimizing the sorting or selection algorithms.

4.6.2. Comparative Analysis

Table 9 compares theoretical complexity across methods.

Key observation: MOHHO-NSGA-III has identical asymptotic complexity to NSGA-III. The HHO operators add negligible overhead (

O (N \times d) ≪ O (N \times k \times m \times log m)

).

4.6.3. Empirical Runtime Analysis

Table 10 shows actual runtimes on Portuguese data (Intel Core i7-9700K, 16 GB RAM).

4.6.4. Cost–Benefit Analysis

Compared to alternatives, MOHHO-NSGA-III requires approximately 9000 times longer than filter methods but achieves 4% higher AUC, an acceptable trade-off given that feature selection is performed once offline. Against GA, the proposed approach is 3.7 times slower but delivers a complete Pareto front revealing accuracy–parsimony trade-offs plus 3.2% better peak AUC. The 18% runtime increase over standard NSGA-III is justified by the 1.5% AUC improvement from HHO operators and diversity injection. This computational cost is acceptable for educational applications since feature selection occurs once while models remain deployed for months or years. Moreover, deployment benefits are substantial: using 13 features instead of 30 reduces prediction time by 57% and requires less student data collection. For larger datasets, runtime can be reduced through parallel fitness evaluation (4× speedup on 4 cores), fewer CV folds (k = 5 provides 2× speedup with minimal accuracy loss), or early stopping when convergence plateaus. The higher computational cost is justified by accuracy gains and the comprehensive Pareto front, which are critical for educational decision-making where prediction quality affects real student outcomes.

4.7. Sensitivity Analysis

We tested how sensitive MOHHO-NSGA-III is to parameter choices.

4.7.1. Diversity Threshold Sensitivity

Our diversity mechanism uses two thresholds:

D_{l o w} = 0.3

triggers injection, and

D_{m e d} = 0.5

adjusts mutation rates. We systematically tested five different threshold values to validate these choices (Table 11).

Sensitivity analysis revealed that overly aggressive thresholds (0.2, 0.4) triggered diversity injection an average of 8.3 times per run, disrupting convergence by encouraging excessive exploration at the expense of exploiting good solutions, resulting in 2% accuracy loss. Conversely, conservative thresholds (0.4, 0.6) barely triggered injections (0.4 per run), allowing populations to stagnate in local optima and reducing performance by 0.7%. The optimal configuration (0.3, 0.5) balanced exploration and exploitation with approximately three injections per run, maintaining diversity when needed while enabling convergence to high-quality solutions. We selected these thresholds based on their superior accuracy while preserving population diversity.

4.7.2. Reference-Point Density

We varied the partition parameter p from 6 to 16 to see how reference-point density affects performance (Table 12).

Sparse reference-point configurations (

p \leq 8

) with only 28–45 points provided poor coverage of the 3D objective space, generating solutions to the cluster in some regions while leaving gaps, which resulted in low hypervolume (<0.82) and reduced AUC. Moreover, excessive configurations (

p \geq 14

) with 120–153 points gave minimal improvement (only

+ 0.003

hypervolume) while increasing niching computation time by 32%, demonstrating diminishing returns. The optimal configuration of

p = 12

, generating 91 reference points, achieved the best balance with superior AUC (0.8485) and strong hypervolume (0.891). This value aligns with recommendations from Deb et al. [31] for three-objective problems and proved most effective in our experiments.

4.8. Ablation Study

To verify that each component actually helps, we ran ablation experiments—systematically removing features and measuring the damage.

4.8.1. Ablation Variants

We tested six configurations:

Full MOHHO-NSGA-III: Complete algorithm as described.
No HHO: Just NSGA-III with standard genetic operators (uniform crossover, bit-flip mutation).
No NSGA-III: HHO operators with simple fitness-based selection instead of reference points.
No diversity injection: Disabled adaptive diversity management.
Fixed mutation: Constant $p_{m} = 0.01$ instead of adapting to diversity.
Two objectives: Only accuracy and features, no stability objective.

Each variant ran 21 times with an identical experimental setup.

4.8.2. Results

Table 13 shows performance averaged across all five classifiers on Portuguese data.

Ablation analysis demonstrated that each component contributes significantly to performance. Removing HHO operators and reverting to standard crossover/mutation reduced AUC by 2.96% and increased feature count to 15.2 (versus 13.0 for the full algorithm), showing that HHO’s energy-based exploration–exploitation transitions provide superior search guidance compared to random genetic operators. NSGA-III selection proved most critical; its removal caused the largest performance drop of 3.88% and increased features to 16.1, demonstrating that reference-point diversity is essential for maintaining a diverse Pareto front and avoiding premature convergence.

Moreover, disabling diversity injection cost 2.04% AUC while maintaining similar feature counts (13.1), indicating that strategic injection helps escape local optima when diversity drops, affecting solution quality rather than exploration breadth. Adaptive mutation, though having the smallest individual impact (

- 1.39 %

), still proved significant, as fixed mutation rates cannot adjust to search dynamics, with other mechanisms only partially compensating for its absence. Removing the stability objective reduced AUC by 2.31% despite selecting fewer features (11.9 versus 13.0), revealing that two-objective optimization finds features with acceptable average performance but high cross-validation variance, whereas the stability objective promotes generalizable, robust feature subsets. All performance degradations were statistically significant (

p < 0.01

), confirming that the combination of HHO, NSGA-III, adaptive diversity management, and three-objective optimization justifies the algorithm’s complexity.

5. Discussion

This section presents the discussion of results, examining educational insights, algorithmic performance, generalizability, and limitations of the proposed approach.

5.1. Educational Insights

The selected features offer practical guidance for teachers and educational leaders. Past performance emerged as the strongest predictor, with previous failures and absences appearing in 97% and 94% of solutions, respectively, suggesting early intervention should target students with failure histories or attendance issues. Family environment proved influential, as parental education and family support consistently ranked high, indicating that engaging families could substantially improve outcomes. Notably, study time (89.5% selection) proved far more predictive than demographics such as age or gender, supporting interventions based on effort and habits rather than fixed characteristics. Student aspirations also mattered—those wanting higher education appeared in 85.9% of solutions, suggesting that fostering educational goals could boost achievement.

5.2. Algorithmic Performance

MOHHO-NSGA-III’s strong performance stems from several design choices. HHO’s energy parameter naturally transitions from broad exploration to focused exploitation, covering the search space thoroughly while refining solutions efficiently. Reference points maintain diversity across the entire Pareto front, providing decision-makers with genuine trade-off options rather than single solutions from weighted-sum methods. Adaptive mechanisms—diversity-responsive mutation and strategic injection—prevent premature convergence, while the stability objective promotes generalization by explicitly optimizing cross-validation consistency. Despite three objectives, the algorithm remains computationally tractable through efficient sorting and selective evaluation.

5.3. Generalizability

Consistent performance across datasets and classifiers suggests broad applicability. Success on both Portuguese and Math courses indicates effectiveness across academic domains, while improvements across all five classifiers—distance-based (kNN), tree-based (DT), probabilistic (NB), linear (LDA), and kernel-based (SVM)—demonstrate that selected features capture fundamental patterns rather than classifier-specific artifacts. The approach proved robust across different dataset sizes (649 vs. 395 samples) and varying difficulty levels. While our evaluation uses two UCI educational datasets, the experimental design provides strong generalizability evidence through multiple validation dimensions: two academic subjects with different populations, varying dataset sizes and difficulty levels, and five classifiers spanning diverse learning paradigms.

5.4. Practical Implications

For practitioners building student performance prediction systems, focusing on the 13 most important features rather than collecting all 30 reduces survey burden while maintaining accuracy. The compact feature set enables lightweight early warning systems that flag at-risk students without excessive data overhead. Feature importance analysis suggests targeting attendance and study habits may yield larger gains than demographic-focused programs. For deployment, SVM or Naive Bayes with selected features offer the best accuracy–interpretability balance.

5.5. Limitations and Future Work

Several limitations suggest future research directions. Our approach treats features independently; exploring interaction effects could improve performance. Student performance evolves over time, so longitudinal studies with temporal feature selection could capture dynamic patterns. While we identify predictive factors, combining our approach with causal inference methods could strengthen practical guidance. Ensuring fairness across student subgroups requires explicit fairness constraints in future work. Finally, testing whether features selected at one institution transfer to others could enable rapid deployment across educational systems without retraining.

6. Conclusions

This paper introduced MOHHO-NSGA-III, a multi-objective feature-selection algorithm that combining HHO with NSGA-III for student performance prediction. The approach simultaneously balances classification accuracy, feature parsimony, and cross-validation stability, able to produce a Pareto front that reveals trade-offs between these objectives. Experimental evaluation on the UCI Student Performance datasets demonstrated statistically significant improvements over nine baseline methods while reducing features from 30 to 13 without sacrificing predictive accuracy.

Future work should explore feature interactions through ensemble methods, temporal modeling to capture performance evolution across semesters, and cross-institutional validation to enable rapid deployment. Integrating explicit fairness constraints and causal inference methods would ensure equitable predictions and identify interventions that can improve student outcomes.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interests.

Abbreviations

The following abbreviations are used in this manuscript:

MOHHO	Multi-Objective Harris Hawks Optimization
HHO	Harris Hawks Optimization
NSGA-III	Non-dominated Sorting Genetic Algorithm III
NSGA-II	Non-dominated Sorting Genetic Algorithm II
MOPSO	Multi-Objective Particle Swarm Optimization
MOO	Multi-Objective Optimization
FS	Feature Selection
EDM	Educational Data Mining
ML	Machine Learning

kNN	k-Nearest Neighbors
DT	Decision Tree
NB	Naive Bayes
SVM	Support Vector Machine
LDA	Linear Discriminant Analysis
PSO	Particle Swarm Optimization
GA	Genetic Algorithm
EA	Evolutionary Algorithm
CV	Cross-Validation
AUC	Area Under the Curve
ROC	Receiver Operating Characteristic
UCI	University of California, Irvine

References

Liu, M.; Yu, D. Towards intelligent E-learning systems. Educ. Inf. Technol. 2023, 28, 7845–7876. [Google Scholar] [CrossRef]
Chen, Z.; Cen, G.; Wei, Y.; Li, Z. Student performance prediction approach based on educational data mining. IEEE Access 2023, 11, 131260–131272. [Google Scholar] [CrossRef]
Yu, S.; Cai, Y.; Pan, B.; Leung, M.-F. Semi-supervised feature selection of educational data mining for student performance analysis. Electronics 2024, 13, 659. [Google Scholar] [CrossRef]
Malik, S.; Jothimani, K.; Ujwal, U.J. Advancing educational data mining for enhanced student performance prediction: A fusion of feature selection algorithms and classification techniques with dynamic feature ensemble evolution. Sci. Rep. 2025, 15, 92324. [Google Scholar] [CrossRef]
Yağcı, M. Educational data mining: Prediction of students’ academic performance using machine learning algorithms. Smart Learn. Environ. 2022, 9, 11. [Google Scholar] [CrossRef]
Ahmed, E. Student performance prediction using machine learning algorithms. Appl. Comput. Intell. Soft Comput. 2024, 2024, 4067721. [Google Scholar] [CrossRef]
Wang, J.; Yu, Y. Machine learning approach to student performance prediction of online learning. PLoS ONE 2025, 20, e0299018. [Google Scholar] [CrossRef] [PubMed]
Alhazmi, E.; Sheneamer, A. Early predicting of students performance in higher education. IEEE Access 2023, 11, 27579–27589. [Google Scholar] [CrossRef]
Mastour, H.; Dehghani, T.; Moradi, E.; Eslami, S. Early prediction of medical students’ performance in high-stakes examinations using machine learning approaches. Heliyon 2023, 9, e18248. [Google Scholar] [CrossRef]
Bujang, S.D.A.; Selamat, A.; Ibrahim, R.; Krejcar, O.; Herrera-Viedma, E.; Fujita, H.; Ghani, N.A.M. Multiclass prediction model for student grade prediction using machine learning. IEEE Access 2021, 9, 95608–95621. [Google Scholar] [CrossRef]
Adewale, M.D.; Azeta, A.; Abayomi-Alli, A.; Sambo-Magaji, A. Empirical investigation of multilayered framework for predicting academic performance in open and distance learning. Electronics 2024, 13, 2808. [Google Scholar] [CrossRef]
Liu, D.; Huang, R.; Wosinski, M. Multiple features fusion attention mechanism enhanced deep knowledge tracing for student performance prediction. IEEE Access 2020, 8, 194894–194903. [Google Scholar] [CrossRef]
Francis, B.K.; Babu, S.S. Predicting academic performance of students using a hybrid data mining approach. J. Med. Syst. 2019, 43, 162. [Google Scholar] [CrossRef]
Bharara, S.; Sabitha, S.; Bansal, A. Application of learning analytics using clustering data mining for students’ disposition analysis. Educ. Inf. Technol. 2018, 23, 957–984. [Google Scholar] [CrossRef]
Romero, C.; Ventura, S. Educational data mining: A review of the state of the art. IEEE Trans. Syst. Man Cybern. C Appl. Rev. 2010, 40, 601–618. [Google Scholar] [CrossRef]
García, E.; Romero, C.; Ventura, S.; de Castro, C. A collaborative educational association rule mining tool. Internet High. Educ. 2011, 14, 77–88. [Google Scholar] [CrossRef]
Giannakas, F.; Kambourakis, G.; Papasalouros, A.; Gritzalis, S. A critical review of 13 years of mobile game-based learning. Educ. Technol. Res. Dev. 2021, 69, 111–134. [Google Scholar] [CrossRef]
Xu, H.; Kim, M. Combination prediction method of students’ performance based on ant colony algorithm. PLoS ONE 2024, 19, e0300010. [Google Scholar] [CrossRef]
Lau, E.T.; Sun, L.; Yang, Q. Modelling, prediction and classification of student academic performance using artificial neural networks. SN Appl. Sci. 2019, 1, 982. [Google Scholar] [CrossRef]
Cheng, B.; Liu, Y.; Jia, Y. Evaluation of students’ performance during the academic period using the XG-Boost classifier-enhanced AEO hybrid model. Expert Syst. Appl. 2024, 238, 122136. [Google Scholar] [CrossRef]
Malik, S.; Jothimani, K. Enhancing student success prediction with featurex: A fusion voting classifier algorithm with hybrid feature selection. Educ. Inf. Technol. 2024, 29, 8741–8791. [Google Scholar] [CrossRef]
Zaffar, M.; Hashmani, M.; Savita, K.S. Performance analysis of feature selection algorithm for educational data mining. In Proceedings of the 2017 IEEE International Conference on Big Data Analytics and Applications, Boston, MA, USA, 11–14 December 2017; pp. 7–12. [Google Scholar] [CrossRef]
Senthil, S.; Lin, W.M. Applying classification techniques to predict students’ academic results. In Proceedings of the 2017 IEEE International Conference on Current Trends in Advanced Computing, Bangalore, India, 2–3 March 2017; pp. 1–6. [Google Scholar]
Ghareb, A.S.; Bakar, A.A.; Hamdan, A.R. Hybrid feature selection based on enhanced genetic algorithm for text categorization. Expert Syst. Appl. 2016, 49, 31–47. [Google Scholar] [CrossRef]
Maldonado, S.; López, J. Dealing with high-dimensional class-imbalanced datasets: Embedded feature selection for SVM classification. Appl. Soft Comput. 2018, 67, 94–105. [Google Scholar] [CrossRef]
Liang, J.; Zhang, Y.; Chen, K.; Qu, B.; Yue, C.; Yu, K. An evolutionary multiobjective method based on dominance and decomposition for feature selection in classification. Sci. China Inf. Sci. 2024, 67, 120101. [Google Scholar] [CrossRef]
Nguyen, B.H.; Xue, B.; Andreae, P.; Ishibuchi, H.; Zhang, M. Multiple reference points-based decomposition for multiobjective feature selection in classification: Static and dynamic mechanisms. IEEE Trans. Evol. Comput. 2020, 24, 170–184. [Google Scholar] [CrossRef]
Cheng, F.; Cui, J.; Wang, Q.; Zhang, L. A variable granularity search-based multiobjective feature selection algorithm for high-dimensional data classification. IEEE Trans. Evol. Comput. 2023, 27, 266–280. [Google Scholar] [CrossRef]
Dowlatshahi, M.B.; Hashemi, A. Multi-objective optimization for feature selection: A review. In Applied Multi-Objective Optimization; Springer: Berlin/Heidelberg, Germany, 2024; pp. 155–170. [Google Scholar] [CrossRef]
Deb, K.; Pratap, A.; Agarwal, S.; Meyarivan, T. A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Trans. Evol. Comput. 2002, 6, 182–197. [Google Scholar] [CrossRef]
Deb, K.; Jain, H. An evolutionary many-objective optimization algorithm using reference-point-based nondominated sorting approach, Part I: Solving problems with box constraints. IEEE Trans. Evol. Comput. 2014, 18, 577–601. [Google Scholar] [CrossRef]
Zhang, Q.; Li, H. MOEA/D: A multiobjective evolutionary algorithm based on decomposition. IEEE Trans. Evol. Comput. 2007, 11, 712–731. [Google Scholar] [CrossRef]
Almutairi, M.S. Evolutionary multi-objective feature selection algorithms on multiple smart sustainable community indicator datasets. Sustainability 2024, 16, 1511. [Google Scholar] [CrossRef]
Heidari, A.A.; Mirjalili, S.; Faris, H.; Aljarah, I.; Mafarja, M.; Chen, H. Harris hawks optimization: Algorithm and applications. Future Gener. Comput. Syst. 2019, 97, 849–872. [Google Scholar] [CrossRef]
Liu, J.; Feng, H.; Tang, Y.; Zhang, L.; Qu, C.; Zeng, X.; Peng, X. A novel hybrid algorithm based on Harris Hawks for tumor feature gene selection. PeerJ Comput. Sci. 2023, 9, e1229. [Google Scholar] [CrossRef]
Ouyang, C.; Liao, C.; Zhu, D.; Zheng, Y.; Zhou, C.; Li, T. Integrated improved Harris hawks optimization for global and engineering optimization. Sci. Rep. 2024, 14, 7445. [Google Scholar] [CrossRef]
Zamani, H.; Nadimi-Shahraki, M.H. An evolutionary crow search algorithm equipped with interactive memory mechanism to optimize artificial neural network for disease diagnosis. Biomed. Signal Process. Control 2024, 90, 105879. [Google Scholar] [CrossRef]
Yang, N.; Tang, Z.; Cai, X.; Chen, L.; Hu, Q. Cooperative multi-population Harris Hawks optimization for many-objective optimization. Complex Intell. Syst. 2022, 8, 3299–3332. [Google Scholar] [CrossRef]
Choo, Y.H.; Cai, Z.; Le, V.; Johnstone, M.; Creighton, D.; Lim, C.P. Enhancing the Harris’ Hawk optimiser for single- and multi-objective optimisation. Soft Comput. 2023, 27, 16675–16715. [Google Scholar] [CrossRef]
Tian, Y.; Li, X.; Ma, H.; Zhang, X.; Tan, K.C.; Jin, Y. Deep reinforcement learning based adaptive operator selection for evolutionary multi-objective optimization. IEEE Trans. Emerg. Top. Comput. Intell. 2022, 7, 1051–1064. [Google Scholar] [CrossRef]
Li, K.; Fialho, Á.; Kwong, S.; Zhang, Q. Adaptive operator selection with bandits for a multiobjective evolutionary algorithm based on decomposition. IEEE Trans. Evol. Comput. 2014, 18, 114–130. [Google Scholar] [CrossRef]
Zhan, Z.-H.; Li, J.; Kwong, S.; Zhang, J. Learning-aided evolution for optimization. IEEE Trans. Evol. Comput. 2023, 27, 1794–1808. [Google Scholar] [CrossRef]
Zhou, L.; Feng, L.; Tan, K.C.; Zhong, J.; Zhu, Z.; Liu, K.; Chen, C. Toward adaptive knowledge transfer in multifactorial evolutionary computation. IEEE Trans. Cybern. 2021, 51, 2563–2576. [Google Scholar] [CrossRef] [PubMed]
Kudo, M.; Sklansky, J. Comparison of algorithms that select features for pattern classifiers. Pattern Recognit. 2020, 33, 25–41. [Google Scholar] [CrossRef]
Turabieh, H.; Mafarja, M.; Li, X. Iterated feature selection algorithms with layered recurrent neural network for software fault prediction. Expert Syst. Appl. 2019, 122, 27–42. [Google Scholar] [CrossRef]
Das, I.; Dennis, J.E. Normal-boundary intersection: A new method for generating the Pareto surface in nonlinear multicriteria optimization problems. SIAM J. Optim. 1998, 8, 631–657. [Google Scholar] [CrossRef]
Cortez, P.; Silva, A.M.G. Using data mining to predict secondary school student performance. In Proceedings of the 5th Annual Future Business Technology Conference, Porto, Portugal, 9–11 April 2008; pp. 5–12. [Google Scholar]
Xue, B.; Zhang, M.; Browne, W.N.; Yao, X. A survey on evolutionary computation approaches to feature selection. IEEE Trans. Evol. Comput. 2016, 20, 606–626. [Google Scholar] [CrossRef]
Ling, Q.; Li, Z.; Liu, W.; Shi, J.; Han, F. Multi-objective particle swarm optimization based on particle contribution and mutual information for feature selection method. J. Supercomput. 2025, 81, 255. [Google Scholar] [CrossRef]

Figure 1. Proposed MOHHO-NSGA-III approach for multi-objective feature selection.

Figure 2. Solution representation as a binary vector for feature selection.

Figure 3. Convergence behavior over 100 iterations. (Left) Best accuracy shows rapid improvement during early iterations (1–30), followed by an exploitation phase (31–70), then refinement with occasional diversity injection (71–100). The shaded area shows standard deviation across 21 runs. (Right) Population diversity measured by average Hamming distance. The spikes at iterations 42, 68, and 89 are diversity injections triggered when the population became too homogeneous; these successfully restored exploration capability without destroying good solutions.

Figure 4. Pareto front in objective space: accuracy (the higher the better), features (fewer is better), stability (the higher the better). Points are non-dominated solutions averaged across runs, colored by accuracy (blue = lower, red = higher). The front stretches from accurate-but-complex models (13–15 features, 0.86+ accuracy) to simple-but-slightly-worse models (9–11 features, 0.83 accuracy). Decision-makers pick based on their accuracy vs. interpretability priorities.

Figure 5. AUC distributions for five classifiers over 21 independent runs. (Left) Portuguese dataset. (Right) Mathematics dataset. MOHHO-NSGA-III (blue) consistently beats baseline with all features (red) across all classifiers, with tighter distributions showing more stable performance. Note the smaller interquartile ranges for MOHHO-NSGA-III; our selected features produce more consistent predictions.

Table 1. Algorithm parameters for MOHHO-NSGA-III.

Parameter	Value
Population size (N)	50
Max iterations ( $T_{max}$ )	100
Crossover rate ( $p_{c}$ )	0.8
Base mutation rate ( $p_{m}$ )	0.01
Objectives (M)	3
Reference-point partitions	12
CV folds	10
Low diversity threshold	0.3
Medium diversity threshold	0.5
Injection ratio ( $α$ )	0.25
Stagnation threshold ( $τ$ )	10 iterations

Table 2. Three solutions from the Pareto front (Portuguese).

Type	Accuracy	# Features	Reduction	Stability
Max accuracy	0.867 ± 0.015	13	56.7%	0.923 ± 0.012
Balanced	0.852 ± 0.018	11	63.3%	0.931 ± 0.010
Min features	0.834 ± 0.021	9	70.0%	0.918 ± 0.014

Table 3. Feature-selection frequency (Portuguese dataset).

Rank	Feature	Frequency	Category
1	failures (past course failures)	97.3%	Academic
2	absences (days absent)	94.1%	Academic
3	Medu (mother’s education)	91.8%	Family
4	studytime (weekly study hours)	89.5%	Academic
5	goout (going out with friends)	87.2%	Social
6	higher (wants higher ed)	85.9%	Academic
7	Fedu (father’s education)	83.6%	Family
8	age	81.4%	Demographic
9	schoolsup (extra academic support)	78.7%	Support
10	famsup (family educational support)	76.3%	Support
11	romantic (in relationship)	72.9%	Social
12	Dalc (weekday alcohol consumption)	69.5%	Social
13	health (current health status)	67.1%	Personal
14	reason (why they chose this school)	64.8%	Motivation
15	famrel (quality of family relationships)	62.3%	Family

Table 4. Classification performance comparison on the Portuguese dataset (21 runs, 10-fold CV).

Classifier	MOHHO-NSGA-III (13 Features)		Baseline (30 Features)		p-Value
Classifier	AUC	Accuracy	AUC	Accuracy	p-Value
kNN	0.8423 ± 0.0187	0.8156 ± 0.0201	0.8015 ± 0.0245	0.7734 ± 0.0267	0.0012 ^**
Decision Tree	0.8291 ± 0.0215	0.8023 ± 0.0234	0.7889 ± 0.0289	0.7612 ± 0.0312	0.0034 ^**
Naive Bayes	0.8567 ± 0.0156	0.8289 ± 0.0178	0.8234 ± 0.0198	0.7956 ± 0.0221	0.0008 ^**
SVM	0.8634 ± 0.0142	0.8367 ± 0.0165	0.8312 ± 0.0187	0.8045 ± 0.0203	0.0019 ^**
LDA	0.8512 ± 0.0167	0.8234 ± 0.0189	0.8178 ± 0.0212	0.7901 ± 0.0234	0.0025 ^**
Average	0.8485 ± 0.0173	0.8214 ± 0.0193	0.8126 ± 0.0226	0.7850 ± 0.0247	0.0020 ^**

^** Statistically significant at p < 0.01 (Wilcoxon signed-rank test, one-sided).

Table 5. Classification performance comparison on Mathematics dataset (21 runs, 10-fold CV).

Classifier	MOHHO-NSGA-III (11 Features)		Baseline (30 Features)		p-Value
Classifier	AUC	Accuracy	AUC	Accuracy	p-Value
kNN	0.8234 ± 0.0212	0.7967 ± 0.0234	0.7823 ± 0.0267	0.7545 ± 0.0289	0.0017 ^**
Decision Tree	0.8089 ± 0.0245	0.7812 ± 0.0267	0.7678 ± 0.0301	0.7401 ± 0.0323	0.0042 ^**
Naive Bayes	0.8378 ± 0.0189	0.8101 ± 0.0212	0.8067 ± 0.0223	0.7789 ± 0.0245	0.0011 ^**
SVM	0.8456 ± 0.0167	0.8178 ± 0.0189	0.8134 ± 0.0201	0.7856 ± 0.0223	0.0028 ^**
LDA	0.8323 ± 0.0198	0.8045 ± 0.0221	0.7989 ± 0.0234	0.7712 ± 0.0256	0.0033 ^**
Average	0.8296 ± 0.0202	0.8021 ± 0.0225	0.7938 ± 0.0245	0.7661 ± 0.0267	0.0026 ^**

^** Statistically significant at p < 0.01 (Wilcoxon signed-rank test, one-sided).

Table 6. Statistical significance summary across datasets and classifiers.

Dataset	Significant at $p < 0.01$	Significant at $p < 0.05$	Average p-Value
Portuguese	5/5 (100%)	5/5 (100%)	0.0020
Mathematics	5/5 (100%)	5/5 (100%)	0.0026
Combined	10/10 (100%)	10/10 (100%)	0.0023

Table 7. Comparison with state-of-the-art feature-selection methods (Portuguese dataset, kNN classifier, 21 runs).

Method	Features	Accuracy	AUC	Type	p-Value
Baseline (All Features)	30	0.7734	0.8015	–	<0.001 ^**
Filter: Chi-Square	15	0.7845	0.8098	Filter	<0.001 ^**
Filter: Information Gain	14	0.7923	0.8156	Filter	<0.001 ^**
Wrapper: SFS	12	0.7967	0.8201	Wrapper	<0.001 ^**
Embedded: LASSO	16	0.7889	0.8134	Embedded	<0.001 ^**
GA-based FS [48]	14	0.7978	0.8223	Metaheuristic	<0.001 ^**
PSO-based FS [48]	15	0.8012	0.8267	Metaheuristic	<0.001 ^**
NSGA-II [30]	13	0.8089	0.8312	Multi-objective	<0.001 ^**
NSGA-III [31]	12	0.8123	0.8356	Multi-objective	0.003 ^**
MOPSO [49]	14	0.8067	0.8289	Multi-objective	<0.001 ^**
MOHHO-NSGA-III (Proposed)	13	0.8156	0.8423	Multi-objective	–
Improvement over best baseline	–	+0.33%	+0.67%	–	–

^** Statistically significant at

p < 0.01

(Wilcoxon signed-rank test vs. MOHHO-NSGA-III).

Table 8. Computational complexity per iteration.

Component	Complexity	Notes
Fitness Evaluation	$O (N \times k \times m \times log m)$	$N = 50$ solutions, $k = 10$ CV folds; kNN sorts $m \approx 520$ training samples per fold
Non-dominated Sorting	$O (M \times N^{2})$	Fast non-dominated sorting [30]: $M = 3$ objectives, $N = 50$ solutions
Reference Association	$O (N \times H \times M)$	Match $N = 50$ solutions to $H = 91$ reference points in $M = 3$ dimensions via perpendicular distance
HHO Operations	$O (N \times d)$	Position updates + binary conversion for $N = 50$ solutions, $d = 30$ features
Per iteration	$O (N \times k \times m \times log m + M \times N^{2} + N \times H \times M + N \times d)$
Full run	$O (T_{m a x} \times N \times k \times m \times log m)$ with $T_{m a x} = 100$

Table 9. Theoretical complexity comparison.

Method	Complexity	Notes
Chi-Square	$O (n \times m)$	Linear in features and samples
Information Gain	$O (n \times m \times log m)$	Includes sorting
LASSO	$O (n \times m \times I_{l a s s o})$	$I_{l a s s o}$ : LASSO iterations
GA (single-obj)	$O (T \times N \times k \times m \times log m)$	Same evaluation cost
PSO (single-obj)	$O (T \times N \times k \times m \times log m)$	Same evaluation cost
NSGA-II	$O (T \times N \times k \times m \times log m + M N^{2} log N)$	Additional sorting for crowding
NSGA-III	$O (T \times N \times k \times m \times log m + M N^{2})$	Reference-point association
MOPSO	$O (T \times N \times k \times m \times log m + N^{2})$	Archive management
MOHHO-NSGA-III	$O (T \times N \times k \times m \times log m + M N^{2})$	Same as NSGA-III
Exhaustive Search	$O (2^{n} \times k \times m \times log m)$	Exponential (intractable)

Table 10. Runtime comparison on Portuguese dataset (100 iterations).

Method	Time	AUC	vs. MOHHO	Notes
Chi-Square	0.3 s	0.8098	5424× faster	Fast but low accuracy
Information Gain	0.2 s	0.8156	8136× faster	Fast but low accuracy
GA	732 s (12.2 min)	0.8223	3.71× faster	–
PSO	654 s (10.9 min)	0.8267	4.15× faster	–
NSGA-II	2292 s (38.2 min)	0.8312	1.18× faster	–
NSGA-III	2382 s (39.7 min)	0.8356	1.14× faster	–
MOPSO	2136 s (35.6 min)	0.8289	1.27× faster	–
MOHHO-NSGA-III	2712 s (45.2 min)	0.8485	baseline	Best AUC

Table 11. Impact of diversity thresholds (Portuguese, 21 runs each).

$D_{low}$	$D_{med}$	AUC	Accuracy	Diversity	Injections
0.20	0.40	0.8312 ± 0.022	0.8045 ± 0.024	0.24 ± 0.08	8.3 ± 2.1
0.25	0.45	0.8389 ± 0.021	0.8112 ± 0.023	0.28 ± 0.07	5.1 ± 2.1
0.30	0.50	0.8485 ± 0.019	0.8214 ± 0.020	0.35 ± 0.09	2.8 ± 1.3
0.35	0.55	0.8467 ± 0.020	0.8198 ± 0.021	0.42 ± 0.11	1.2 ± 0.8
0.40	0.60	0.8423 ± 0.021	0.8167 ± 0.022	0.48 ± 0.13	0.4 ± 0.5

Table 12. Effect of reference-point configuration (Portuguese, 21 runs).

p	# Points (H)	AUC	Pareto Size	Hypervolume
6	28	0.8234 ± 0.023	12.3 ± 2.1	0.743
8	45	0.8356 ± 0.021	15.7 ± 2.4	0.812
10	66	0.8423 ± 0.020	17.2 ± 2.3	0.867
12	91	0.8485 ± 0.019	18.3 ± 2.1	0.891
14	120	0.8478 ± 0.019	19.1 ± 2.5	0.894
16	153	0.8467 ± 0.020	19.4 ± 2.8	0.892

Table 13. Ablation results (Portuguese, average over 5 classifiers, 21 runs).

Variant	AUC	Accuracy	Features	Loss	p-Value
Full MOHHO-NSGA-III	0.8485	0.8214	13.0	–	–
No HHO	0.8234	0.7956	15.2	−2.96%	<0.001 ^**
No NSGA-III	0.8156	0.7889	16.1	−3.88%	<0.001 ^**
No diversity injection	0.8312	0.8045	13.1	−2.04%	<0.001 ^**
Fixed mutation	0.8367	0.8123	13.2	−1.39%	0.002 ^**
Two objectives	0.8289	0.8001	11.9	−2.31%	<0.001 ^**

^** Significant at

p < 0.01

(Wilcoxon test vs. full algorithm).

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Al-Milli, N. Multi-Objective Harris Hawks Optimization with NSGA-III for Feature Selection in Student Performance Prediction. Computers 2026, 15, 112. https://doi.org/10.3390/computers15020112

AMA Style

Al-Milli N. Multi-Objective Harris Hawks Optimization with NSGA-III for Feature Selection in Student Performance Prediction. Computers. 2026; 15(2):112. https://doi.org/10.3390/computers15020112

Chicago/Turabian Style

Al-Milli, Nabeel. 2026. "Multi-Objective Harris Hawks Optimization with NSGA-III for Feature Selection in Student Performance Prediction" Computers 15, no. 2: 112. https://doi.org/10.3390/computers15020112

APA Style

Al-Milli, N. (2026). Multi-Objective Harris Hawks Optimization with NSGA-III for Feature Selection in Student Performance Prediction. Computers, 15(2), 112. https://doi.org/10.3390/computers15020112

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multi-Objective Harris Hawks Optimization with NSGA-III for Feature Selection in Student Performance Prediction

Abstract

1. Introduction

2. Proposed Method

2.1. Preprocessing

2.2. Multi-Objective Feature Selection

2.3. Multi-Objective Fitness Function

2.3.1. Objective 1: Classification Accuracy

2.3.2. Objective 2: Feature Reduction

2.3.3. Objective 3: Prediction Stability

2.4. Harris Hawks Optimization (HHO)

2.4.1. Exploration Phase

2.4.2. Exploitation Phase

2.5. NSGA-III for Environmental Selection

2.5.1. Reference-Point Generation

2.5.2. Non-Dominated Sorting

2.5.3. Niching Selection

2.6. Adaptive Diversity Management

2.6.1. Diversity Measurement

2.6.2. Adaptive Mutation Rate

2.6.3. Diversity Injection

2.7. MOHHO-NSGA-III Algorithm

3. Experimental Setup

3.1. Datasets

3.2. Algorithm Configuration

3.3. Evaluation Methodology

4. Results and Analysis

4.1. Convergence Analysis

4.2. Feature-Selection Results

4.2.1. Pareto Front Analysis

4.2.2. Feature Importance Analysis

4.3. Classification Performance

4.3.1. Portuguese Language Dataset

4.3.2. Mathematics Dataset

4.3.3. Classifier Transferability Analysis

4.4. Statistical Significance Analysis

4.5. Comparison with State-of-the-Art Methods

4.6. Computational Complexity Analysis

4.6.1. Theoretical Complexity

4.6.2. Comparative Analysis

4.6.3. Empirical Runtime Analysis

4.6.4. Cost–Benefit Analysis

4.7. Sensitivity Analysis

4.7.1. Diversity Threshold Sensitivity

4.7.2. Reference-Point Density

4.8. Ablation Study

4.8.1. Ablation Variants

4.8.2. Results

5. Discussion

5.1. Educational Insights

5.2. Algorithmic Performance

5.3. Generalizability

5.4. Practical Implications

5.5. Limitations and Future Work

6. Conclusions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI