EHCFE: Enhanced Hierarchical Clustering with Feature Engineering for Automating Labeling of Student Performance and Dropout Prediction

Alghanmi, Nusaybah

doi:10.3390/electronics15061265

Open AccessArticle

EHCFE: Enhanced Hierarchical Clustering with Feature Engineering for Automating Labeling of Student Performance and Dropout Prediction

by

Nusaybah Alghanmi

Department of Information Technology, College of Computing and Information Technology at Khulais, University of Jeddah, Jeddah 21959, Saudi Arabia

Electronics 2026, 15(6), 1265; https://doi.org/10.3390/electronics15061265

Submission received: 21 January 2026 / Revised: 8 March 2026 / Accepted: 13 March 2026 / Published: 18 March 2026

(This article belongs to the Special Issue AI-Driven Data Analytics and Mining)

Download

Browse Figures

Versions Notes

Abstract

Educational success is a critical component of societal development, yet increasing student dropout rates present ongoing challenges. While supervised learning models are commonly used for dropout prediction, they rely on manually labeled data, a process that is time-consuming and dependent on expert annotation. Unsupervised learning models, clustering approaches, have been explored as an alternative; however, existing methods typically group students based on activity patterns without generating binary outcome labels such as dropout or success. Furthermore, their effectiveness often depends heavily on the quality of the selected features, and most current solutions utilize only limited or pre-structured subsets of institutional data. This paper addresses these challenges and proposes EHCFE (Enhanced Hierarchical Clustering with Feature Engineering), to automatically generate binary labels from unlabeled educational datasets. EHCFE applies feature engineering by generating new features from the top-ranked features identified during feature selection while retaining the original feature set, thereby improving the quality of the labeling outcomes. The approach is evaluated on three datasets and compared with current and state-of-the-art models using several evaluation metrics, including F₁ score, area under the receiver operating characteristic curve (AUC), and silhouette coefficient. Experimental results show that EHCFE achieves the highest F₁ score (0.709 and 0.28) and AUC values (0.766 and 0.81) on two datasets. A ranking analysis across six evaluation metrics demonstrates that EHCFE outperforms existing models, achieving the highest average ranks of 1.50 and 1.83 on two datasets and a competitive rank of 1.92 on the third.

Keywords:

educational data mining; student performance; dropout prediction; feature engineering; hierarchical clustering; automated labeling

1. Introduction

Academic success is a crucial component of future proofing for both individuals and the government, informing job opportunities and economic growth. Dropout is considered one of the many challenges facing institutions and universities. Dropout refers to students who leave institutions without obtaining a degree, and can be further defined based on time, as either late or early [1]. Dropout can also be considered from a micro- or macro-perspective, where the former considers changes of field and/or institution, and the latter considers only students who leave the education system without a degree [2]. Many factors influence the dropout decisions made by students, with a primary cause often correlating with socioeconomic status [3]. In even the highest performing countries, such as Denmark, only 80% of students graduate [3].

Predicting student performance and dropout rate is crucial at the higher education level. It can assist institutions and universities in improving education outcomes, reducing financial costs, or even providing personalized support and interventions. A considerable amount of data on student performance is generated and made available, which can be challenging to analyze manually. Therefore, education data mining offers a valuable method to analyze and predict student performance and behaviors [4]. It utilizes machine learning algorithms, which work firstly by inputting students’ data with the required features or attributes, which are then subjected to a preprocessing step, such as feature selection. These features or attributes are then selected either based on a domain area or using different methods, such as variance thresholds. Then, a machine learning model is applied, which can be either trained on labeled data to predict students’ performance statuses, known as supervised learning, or based on unlabeled data to group similar students together, which is known as unsupervised learning.

Current works primarily focus on supervised models that try to improve classification performance tasks, such as [5,6,7,8]. Models based on supervised learning are used to classify data based on historically labeled trained data, which may not always be available, and requires an expert to label it, which is time-consuming and costly. Thus, the use of unsupervised learning models for this task has gained attention from many authors due to its focus on unlabeled data. Current works are either based on unsupervised learning as single models, such as using clustering algorithms, or in combination with other approaches, such as supervised learning models. During preprocessing, features selection may require an expert to rate importance. It is vital to emphasize key features and allow new features to emerge to train the model, which is known as feature engineering [9].

This work proposes enhanced hierarchical clustering with feature engineering to label and group student performance data as either dropout or successful (e.g., graduated or continuing). The proposed EHCFE approach first selects the top t important features using random forest and variance, and then derives new features from top t important features using a simple ratio. Finally, a hierarchical clustering algorithm is applied to label the student data based on whether dropout or successful.

Contribution

Recent research primarily focuses on categorizing students based on engagement patterns [10] or grouping them into behavioral clusters using performance and activity features [11,12]. For instance, students are often grouped as active versus non-active learners depending on their assignment submission behaviors. While such clustering is useful for describing engagement styles, these approaches do not directly support a binary identification task that distinguishes between students who ultimately dropout and those who succeed. In this regard, manual labeling of student outcomes is labor-intensive and often impractical. Furthermore, existing studies employ feature selection for clustering tasks [10,11], which may require a domain expert to determine and classify features by importance, potentially overlooking relationships essential for accurate dropout/success labeling.

To address these limitations, this work introduces an Enhanced Hierarchical Clustering with Feature Engineering (EHCFE) approach that automatically labels students as dropout or successful without requiring manual annotation or expert-driven feature selection, thereby resolving key challenges present in prior studies. In this approach, engineered features are integrated with hierarchical clustering to enable binary outcome labeling and to ensure that the most relevant features associated with student status are effectively utilized. This integration provides a structured framework for transforming unlabeled educational datasets into labeled datasets with binary labels, offering a capability not supported by existing clustering methods and improving the identification of at-risk students. The main contributions of this paper are as follows:

The EHCFE approach proposed will automate labeling of students’ performance to facilitate performance and dropout prediction;
Two feature selection methods are applied, random forest and variance, to identify the top t important features;
Feature engineering is combined in EHCFE to create new features from the top t important features to improve the labeling process;
EHCFE employs hierarchical clustering, is applied to three datasets, and compared against current works and state-of-the-art techniques using different evaluation metrics to measure predication performance and quality of clustering.

In the remainder of this paper, Section 2 reviews related work on student performance and dropout prediction, while Section 3 introduces the proposed model. Section 4 outlines the experimental setup, including descriptions of the datasets and evaluation metrics. Section 5 presents and discusses the results, and Section 6 provides an ablation study. Finally, Section 7 concludes the paper.

2. Related Works

This section presents previous works employing single unsupervised approaches or hybrid approaches which consist of or combine supervised and unsupervised methods.

2.1. Single Unsupervised Approaches

This section refers to the single unsupervised approaches, such as using clustering algorithm only, applied to student performance and dropout prediction. Oeda et al. [13] aimed to predict which students would complete a programming lesson forming outlier groups. They used UNIX commands to classify 39 students in a programming class, which is time-series data. They used a combination of K-medoids and K-means++ clustering with dynamic time warping (DTW) to measure time-series similarities.

Palani et al. [14] proposed a clustering-based model to label low-engagement students, who did not complete a course. They utilized data from an e-learning system in the UK known as Open University Learning Analytics (OULA). Three different algorithms were used: k-Prototype, hierarchical clustering, and Gaussian Mixture (GM). The number of clusters was defined by the gap statistic, where three clusters is optimal. Different preprocessing steps were applied to check the missing data and encode categorical data. The results found that the k-Prototype produced better performance compared to other algorithms.

Valles-Coral et al. [15] applied three clustering methods, density-based spatial clustering of applications with noise (DBSCAN), K-means, and hierarchical DBSCAN (HDBSCAN), to group data based on psychological tests of the dropout risk level. The models were applied to 670 undergraduate students from the National University of San Martín. Various preprocessing steps were applied, including normalization. The results showed that HDBSCAN was the best model, based on internal metrics, such as silhouette, and compared to DBSCAN and K-means.

2.2. Hybrid Approaches

This section presents studies that combine supervised and unsupervised methods, either by integrating clustering or any unsupervised algorithms with classification algorithm, or analyzing data separately with both approaches. Nafuri et al. [11] used clustering methods to predict five clusters of students based on performance, such as activities or CGPA. The first step involved applying preprocessing tasks and feature selection. They then used three clustering methods: K-means, DBSCAN, and Balanced Iterative Reducing and Clustering using Hierarchies (BIRCH). The class label generated by K-means was used to train three classification algorithms: artificial neural network (ANN), decision tree, and random forest. The data was collected from the Ministry of Higher Education and comprised 53 features and 248,568 rows. The results determined that using K-means with supervised feature extraction was best, while ANN produced the best classification results compared to other methods.

Pecuchova et al. [12] proposed a clustering model to cluster a large dataset into four clusters based on students’ activities. The dataset was collected from a Learning Management System Moodle. They used different presteps, dimensionality reduction, such as using principal component analysis (PCA). For clustering tasks, different algorithms, such as GM, DBSCAN, and BIRCH were used. The results found that BIRCH was able to classify the students based on their activities. Furthermore, they developed supervised machine learning models to predict student cluster membership, enabling early interventions.

Ghosh et al. [16] proposed a model to track and monitor the educational landscape at state and county levels. They used a dataset consisting of student drop rates by state for the years 2018–2019 and 2020–2021 [17]. They then applied two steps. Firstly, they trained a sequence of data on a recurrent neural network (RNN) to define dropout labels and define the reasons for dropout-based rejoins (cluster-wise). After this they used agglomerative hierarchical clustering to merge the results of different clusters.

Alameri [10] applied hierarchical clustering with average linkage as (HC-Avg) to group data based on engagement profile, such as attendance type, into three clusters (according to the elbow method). Additionally, she applied five classifiers with feature engineering to both binary and multiclass tasks. She used basic feature engineering by aggregating semester-level academic metrics (e.g., average grades, approved units). Additionally, she applied advanced feature engineering by creating derived features that incorporated the behavioral, sociodemographic, institutional, and macroeconomic contexts of the students, such as efficiency ratios and parental academic weightings. This study utilized data of various types, including academic background and demographics.

Kim et al. [18] proposed a hybrid model designed to predict those university students at risk of dropping out with high precision and recall. Feature selection was performed using SHAP (SHapley Additive Explanations) and permutation importance, followed by data preprocessing techniques to handle data imbalances. Additionally, the model combined different machine learning algorithms, specifically XGBoost and CatBoost. The authors used data from Gyeongsang National University, South Korea, with 27 features selected from an initial pool of 40. The proposed model achieved the highest precision and recall compared to ensemble methods or single methods, such as ANN, logistic regression, and gradient boosting. Furthermore, PCA and K-means clustering were employed to categorize reasons for dropouts into four categories: Employment, Did Not Register, Personal Issues, and Admitted to Other University.

3. The Proposed Enhanced Hierarchical Clustering with Feature Engineering Approach (EHCFE)

This work proposes EHCFE to label and group students based on their academic status, across three phases, as shown in Figure 1. In the first step, the top t features are selected based on variance and random forest, then these features are used to derive new features with feature engineering, using a ratio. Finally, hierarchical clustering is employed to label and cluster the students’ data, classifying it into either dropout or success. The input to the EHCFE model is unlabeled student data in numeric form, and the output is the labeling of students as success or dropout. The details for each phase are explained as follows.

Finding the top t important features: The first step in the EHCFE approach is to select the most t important features by combining two methods, variance and random forest (RF), where t is a range from 3 to 5 features. Variance measures the variability of the data from the mean, where low variability indicates the values of attributes close to each other, while higher values of variance mean that this feature is informative and has a high cardinality [11]. The results from the variance are ranked based on highest variance, where 1 refers to important features.
RF is also applied, which works as an ensemble of decision trees to assist in making predictions with greater accuracy [19]. Two measurements are used to measure the importance of features in RF, which are mean decrease in accuracy and the Gini index; then, the rank for each measurement is found, and, finally, average rank is obtained for both measurements. Each decision tree’s degree of node impurity is measured with the Gini index, while accuracy measures performance when attributes are randomly shuffled in the dataset.
RF is a supervised machine learning model that requires labeled data for training [19], and the data inputted into EHCFE is unlabeled; thus, K-means clustering is used to label it as synthetic label data. K-means is selected because it is a simple and widely used clustering algorithm. Combined with RF, this strategy enables the identification of the most relevant features within an unlabeled dataset by leveraging task-specific information provided through synthetic labels. Z-score normalization is applied prior to clustering, as shown in the following equation, as it is more robust to outlier values [20].

$v_{i}^{'} = \frac{v_{i} - \bar{X}}{σ_{X}}$

(1)

where $v_{i}^{'}$ is the normalized value of $v_{i}$ , and $\bar{X}$ and $σ_{X}$ are the mean and standard deviation of feature X.
By combining both ranks found from RF and variance, the most t important features are selected to derive new features, where 1 is the most important feature. Further analysis is performed in the ablation study (Section 6) to examine the effect of using only the top t important features in the training clustering algorithm.
Deriving new features for feature engineering: New features are derived by selecting the top-ranked t features from the previous step, where multiple methods exist for deriving new features [9]. In this work, the simple ratio is selected as it is a widely used and well-established approach in feature engineering [9]. A simple ratio is effective in revealing the relationship between two features while also being computationally inexpensive and easy to interpret. For example, a ratio such as “Feature A divided by Feature B” provides a straightforward proportional comparison between the two variables and remains easy to understand. To construct these ratio-based features, one important feature is divided by another important feature from the selected top-ranked set. The general form of a ratio-based feature is

$\begin{matrix} f_{d n} = \frac{f_{i}}{f_{j}} \end{matrix}$

(2)

where $f_{d n}$ is the new feature, $f_{i}$ is the most important feature, and $f_{j}$ represents a less important feature that follows $f_{i}$ in the ranking order. The feature $f_{j}$ may be any feature in the selected set, including $f_{t}$ , which is the last feature in the ordered set of top-ranked features, with the requirement that $f_{j} \neq f_{i}$ .
Only a fixed number of ratio-based features is retained to avoid unnecessary feature expansion; therefore, the selection is limited to five newly created ratio-based features. Any ratio-based feature with many zero values in the denominator is excluded to prevent unstable or uninformative outputs. Further analysis has been conducted on these new features to determine whether they provide unique information; if not, they are considered highly correlated and will be removed. To determine this, correlated features analysis is computed using a pairwise correlation matrix (using Pearson correlation). These new feature pairs, along with their correlation coefficients, are removed if their correlation coefficient exceeds the threshold of 0.95. Another analysis is performed in the ablation study (Section 6) to see the importance of feature engineering step in the proposed model.
Employing Hierarchical Clustering: An unsupervised learning model is then applied to the data, which is called hierarchical clustering (HC), which establishes the similarity among data points and deals with data in a tree or hierarchical form. HC works by dealing with data points as a single cluster, then merging them with the closest cluster until one or K clusters. There are several methods to measure the distance between clusters, known as linkage methods, such as complete or average linkage. In this work, ward linkage is applied, which differs from other linkages by measuring the variance of clusters [21]. It is based on increasing the average error among clusters, aiming to minimize the total within-cluster variance.
In this step, all features in the dataset and the newly derived features from the previous step are used for hierarchical clustering. Before applying HC, the data is normalized using min–max normalization. Min–max normalization adjusts feature values to a common range, ensuring that all features contribute uniformly to the distance calculations used in machine learning and data analysis [20,22]. This step helps in preserving the relationships among the original feature values [23]. Min–max normalization is calculated as follows:

$\begin{matrix} v_{i}^{'} & = \frac{v_{i} - X_{m i n}}{X_{m a x} - X_{m i n}} \cdot (X_{m a x}^{'} - X_{m i n}^{'}) + X_{m i n}^{'} \end{matrix}$

(3)

where $v_{i}^{'}$ is the normalized value of $v_{i}$ , and $X_{m i n}$ and $X_{m a x}$ are the minimum and maximum values of feature X, where $X_{m i n}^{'}$ and $X_{m a x}^{'}$ are minimum and maximum values of the new range of feature X, which is [0, 1].
The similarity among the data points $d = (d_{1}, d_{2}, \dots, d_{n})$ and $y = (y_{1}, y_{2}, \dots, y_{n})$ is calculated using Euclidean distance, which is one of the most popular methods [24], as shown in the following equation:

$d i s t (d, y) = \sqrt{\sum_{i = 1}^{n} {(d_{i} - y_{i})}^{2}}$

(4)

where $d_{1}, d_{2}, \dots, d_{n}$ and $y_{1}, y_{2}, \dots, y_{n}$ represent the feature values for data points d and y, respectively.

The algorithmic form of the proposed EHCFE algorithm is presented in Algorithm 1, which accepts unlabeled students’ performance data as input, and generates labeled students’ performance data as output. The computational cost of EHCFE is influenced by both the number of records and the number of features in the dataset. The method includes feature selection and feature engineering steps that introduce additional processing before hierarchical clustering is applied. These steps increase the workload compared to using only hierarchical clustering, which does not require any preprocessing. This effect is also reflected in the execution time results reported in Section 5.2.

Algorithm 1 EHCFE: Enhanced Hierarchical Clustering using Feature Engineering Algorithm.
1: Input: U (Unlabeled student performance data)
2: Output: L (Labeled student performance data)
3: Rank the important features as $V D = Variance (U)$
4: Normalize the data U as S	▹ Use normalization Equation (1)
5: Generate synthetic labeled data using K-means as $S^{'}$ from S.
6: Use $S^{'}$ to rank important features as $R D = RandomForest (S^{'})$ .
7: Combine the ranks of features from $V D$ and $R D$ to find the top t important features:
8: $F N = \frac{V D + R D}{2}$	▹ Find average rank for features
9: Apply ratio to top t important features $F N$ :
10: $F N^{'} = Ratio (F N)$	▹ Use simple ratio Equation (2)
11: Combine all features and new features F and $F N^{'}$ as $F D$
12: Normalize the data for $F D$ as $F D^{'}$	▹ Use normalization Equation (3)
13: Employ hierarchical clustering on $F D^{'}$
14: $L = HierarchicalClustering (F D^{'})$
15: Return: L

4. Experiment Setting

This section presents the datasets used and the evaluation metrics employed to evaluate the proposed model.

4.1. Datasets

Three datasets are used to evaluate our proposed model: synthetic and real datasets, with all containing ground truth labels. The details of each dataset are as follows:

Joint Entrance Examination (JEE) dataset: This is a synthetic dataset that simulates the academic and behavioral aspects of students to determine whether they will dropout after class 12 [25]. This dataset consists of 15 attributes that fall between numeric and categorical variables, such as JEE scores (numerical) or family income (categorical, ranging from low to high). The target variable for this dataset describes the students’ state, with a value of 1 indicating that the student dropped out after class 12, and 0 indicating that the student is continuing. The distribution of the data is presented in Figure 2a, with the majority of students from the class continuing after class 12.
Academic success and dropout (ASD) dataset: This dataset was collected from different sources, and consists of 35 attributes, with numeric typing describing the different properties of students, such as academic engagement, demographic, and academic data [2,26,27]. The dataset was collected for the academic years 2008/2009 to 2018/2019, and covers different majors, such as nursing and technology. The dataset consists of 4424 records, with target variables “dropout”, “enrolled”, and “graduate”. In this paper, only classes label dropout and graduation are considered, as the enrolled students’ statuses can later become either “dropout” or “graduate”. The distribution of the data is presented in Figure 2b, with the majority of students having successfully completed their studies and graduated.
The Engineering and Educational Sciences (EES) dataset: It was collected in 2019 from the Faculty of Engineering and the Faculty of Educational Sciences [28,29]. It comprises various personal, educational, and family information about the students. It comprises 145 records with 33 features. The dataset consists of grades as a target value, so the target has been modified to accommodate a binary task by retaining the class “fail”, and categorizing the remaining grades as “pass”. The distribution of the data is presented in Figure 2c, with the majority of students having successfully completed and passed their courses.

As the proposed model and other machine learning algorithms require data to be in numerical form, all categorical or nominal variables must be mapped to numeric values prior to clustering and classification [12,30]. Accordingly, categorical variables in the JEE and EES datasets were converted to numerical representations, whereas the ASD dataset was already provided in numeric form. In the remainder of this paper, the classes “pass”, “graduate” or “continuing” are indicated by the “success” class label, whereas “fail” or “dropout” are denoted by the “dropout” class label.

4.2. Evaluation Metrics

The EHCFE algorithm was implemented, and the results were generated using the R language (version 4.4.2), where the code is available at the following GitHub link (https://github.com/nrghanmi/EHCFE (accessed on 12 March 2026)). The proposed approach is evaluated using clustering evaluation metrics to measure prediction performance and quality of clustering, either dependent on ground truth or independent of ground truth.

4.2.1. Ground Truth Dependent Metrics

The predication performance and quality of clustering are evaluated by comparing the clustering results with ground truth labels, which helps to measure how well the clusters correspond to the predefined labels. F₁ score, area under the receiver operator characteristic (ROC) curve (AUC), and adjusted Rand index (ARI) are used to evaluate the clustering results.

F₁ score: This measures the agreement between clustering algorithms’ results labels and ground truth, measuring the harmonic means of recall and precision. Precision measures the proportion of correctly predicted students with dropout status among all students predicted that have dropout status. Recall measures the corrected predicted student dropout status compared to actual dropout statuses, where higher values are better. F₁ score, precision, and recall are calculated as follows:

$F_{1} = 2 \cdot \frac{(Recall \cdot Precision)}{(Recall + Precision)}$

(5)

$Precision = \frac{TP}{TP + FP}$

(6)

$Recall = \frac{TP}{TP + FN}$

(7)

where TP is the number of students who have actual dropout status and were predicted as dropout status. FP is the number of students who were predicted to have dropout status but received successful status, while FN is the number of students who are actual dropouts but were predicted as having successful status.
Area under the receiver operator characteristic (ROC) curve (AUC): The ROC is a probability curve that helps show the trade-off between recall (on the y-axis) and false positive rate (FPR) (on the x-axis). AUC measures the area under the curve, ranging [0, 1], and a higher value is required. FPR measures the proportion of students who were incorrectly predicted as dropout status, where the actual classification is successful status.

$FPR = \frac{FP}{FP + TN}$

(8)
Adjusted Rand index (ARI): It adjusts the random chance of Rand index (RI) similarity or agreement [31], where RI calculates the proportion of agreement or similarity of clusters, meaning that it measures the correct decision by clustering methods compared to the ground truth. ARI ignores the differences in label names between the predicted and actual labels. The range of values is [−1,1], where −1 has random or disagreement in labeling, and 1 has good agreement in labeling. RI and ARI are calculated as follows:

$RI = \frac{TP + TN}{TP + TN + FP + FN}$

(9)

$ARI = \frac{RI - ERI}{Max (RI) - ERI}$

(10)

where TN is the number of students who have successful status and were predicted as successful. ERI is the RI’s expected value under a random cluster, and max(RI) is the maximum value of RI, which is usually 1.

4.2.2. Ground Truth Independent Metrics

This measures the quality of clustering by comparing the intrinsic properties of the data and the clustering results without using external or ground truth labels, known as internal validation metrics. It usually measures compactness and separation as either a ratio or a summation [32]. Compactness measures how data points are similar and close to one another in the belonging cluster compared to other clusters, whereas separation measures how data points in clusters are distinct from those data points belonging to other clusters.

The Calinski–Harabasz index, silhouette coefficient, and Davies–Bouldin index evaluate the clustering results, and are well-known methods are used in many literature reviews [11,12,14]. The reasons for these selections are that the silhouette coefficient can handle noise that may be present in the data [32]. However, Calinski–Harabasz index is impacted by skewed distributions, which means some clusters have a smaller number of points than others [32].

The Calinski–Harabasz index (CHI) [33] (higher is better) measures the ratio of between-cluster separation to within-cluster compactness, where a higher value indicates better-defined and well-separated clusters.
The silhouette coefficient (SC) [34] (higher is better) also measures how well data points belong to the cluster (compactness) compared to other clusters (separation) by talking the pairwise difference, and has a range of [−1,1], where −1 indicates data points unmatched to their own cluster, and 1 indicates that data points are matched to their own cluster.
The Davies–Bouldin index (DBI) [35] (lower is better) is found by calculating similarity between clusters and other clusters, and the highest value is assigned to each cluster. The DBI is obtained by taking the average of all cluster similarities, where a lower value indicates better-defined and well-separated clusters.

5. Results and Discussion

The proposed model, EHCFE, has selected the top five features for the ASD and JEE dataset, where it creates new features 2 and 5, respectively, while for the EES dataset, there are three top features selected, and three new features are created. Table 1 shows the number of features in the original dataset, drive features using a simple ratio, and total features used for training the model.

The proposed model is compared to other existing models. Different clustering models are applied in the literature; K-means is used [11], HC is applied with an average link [10], referred to as HC-Avg. While the current models utilize some features from their own datasets, our datasets differ from their selected datasets. Therefore, all features are used in K-means and HC-Avg, where K-Means-All and HC-Avg-All are applied to all features, while HC-Avg-Subset is only applied to a subset of features in the ASD dataset, as applied and selected in this work [10]. PAM is also applied, as it is one of the most popular clustering methods for all features [36], which is called PAM-All.

The following section (Section 5.1) presents the evaluation results of EHCFE in comparison with the existing models. Section 5.2 then provides additional discussion and insights based on the performance patterns observed across the three datasets.

5.1. Evaluation of the Proposed Model

The results for the JEE dataset, as shown in Table 2, show that both K-Means-All and EHCFE obtain identical AUC and F₁ score values (0.508 and 0.298). PAM-All records the highest ARI (0.02). K-Means-All produces the lowest DBI, highest SC, and highest CHI values (2.536, 0.134 and 776.377), ranking first on these measures. EHCFE records 2.572, 0.13, and 754.659 for DBI, SC, and CHI, ranking second. HC-Avg-All and PAM-All occupy the lower ranks, producing lower values across most evaluation metrics.

For the ASD dataset, Table 3 shows that EHCFE obtains the highest values in ARI, AUC, and F₁ score, with 0.333, 0.766, and 0.709, respectively. HC-Avg Subset records the lowest DBI and highest SC values (1.258 and 0.401), but its labeling performance is low, with ARI, AUC, and F₁ scores of −0.001, 0.499, and 0.001. PAM-All achieves the highest CHI value (555.536) and competitive values on several other metrics. EHCFE records 0.159 in SC and 505.395 in CHI, ranking second on these metrics. K-Means-All records low values in ARI, AUC, and F₁ score (0.019, 0.357, and 0.111), placing it at or near the bottom for most labeling metrics

The results for the EES dataset (Table 4) show that EHCFE produces the highest AUC, F₁ score, and CHI values (0.81, 0.28, and 17.416). HC-Avg-All achieves the highest scores in ARI, DBI, and SC (0.201, 0.674, and 0.205). Thus, each of these two methods attains the highest value in three metrics. HC-Avg-All, however, records low values in the remaining metrics, including F₁ score, AUC, and CHI. EHCFE records values of 0.131, 2.635, and 0.13 for ARI, DBI, and SC, which place it second on these metrics relative to the other methods. K-Means-All ranks second in AUC, F₁ score, and CHI (0.773, 0.233 and 17.023) but produces the lowest values in ARI, DBI, and SC. PAM-All consistently ranks third out of the four models across all evaluation metrics.

Inspired by the Friedman test [37], the average rank for each model [38] is calculated over different evaluation metrics to check which model has the highest ranking among all these metrics, where one means that the model has best performance. EHCFE ranks first in the EES and ASD dataset, while PAM-All, HC-Avg-All, and K-Means-All follow in second and third positions, respectively. However, K-Means-All produces the highest performance as the first rank, while EHCFE (the proposed model) comes in as the second ranked in the JEE dataset. Overall, EHCFE provides the best balance between alignment with labels and cluster quality. K-Means-All and PAM-All have almost similar results, while HC-Avg-All and HC-Avg-Subset fail to produce meaningful clusters compared to the ground truth.

5.2. Discussion

The performance of EHCFE across the evaluation metrics reveals several consistent patterns when compared with other algorithms, including K-Means-All, PAM-All, and HC-Avg. EHCFE achieves the best results on the ASD and EES datasets in F₁ score and AUC, which is expected as these metrics directly measure the agreement between the generated labels and the actual dropout and success outcomes. On the third dataset, JEE, both K-Means-All, and EHCFE demonstrate the best performance in F₁ score and AUC.

Although EHCFE does not always achieve the best performance on ground truth independent metrics (e.g., DBI, SC, and CHI), it maintains competitive performance across most internal measures on all datasets. This is reasonable as these metrics evaluate structural properties such as compactness and separation rather than correspondence with true labels. When examined individually, EHCFE consistently achieves top or near-top performance across all metrics, including AUC, F₁ score, ARI, DBI, SC, and CHI.

When examining the performance of PAM-All and K-Means-All, it is notable that both methods record the lowest values in several metrics. For instance, PAM-All shows weak performance on SC and DBI in at least two datasets (EES and JEE), despite achieving the best CHI score in some cases. Similarly, K-Means-All produces the lowest ARI, AUC, or F₁ score values in datasets such as ASD and EES, even though it achieves high CHI or SC values in certain instances in JEE dataset. These inconsistencies suggest that both PAM-All and K-Means-All may struggle to produce stable or meaningful cluster structures across different datasets.

A similar pattern is observed with HC-Avg, which applies average linkage and demonstrates noticeable instability. It ranks as low as fourth on several ground truth independent metrics and shows reduced performance on the ground-truth-dependent metrics, particularly for the ASD and JEE datasets. This suggests inconsistent or random-like cluster assignments. In contrast, EHCFE, which employs Ward linkage, produces more stable results and typically ranks first or second across the evaluated metrics.

The distribution of classes is presented using Uniform Manifold Approximation and Projection for Dimension Reduction (UMAP) [39], which helps when comparing the models to the original dataset with ground truth. The cluster label of current models and our proposed model, EHFCE, is defined based on the number of data points in each cluster. The cluster with a higher number of data points consists of students who have a successful status, while the cluster with a lower number of data points is indicated as students who have a dropout status. In the ASD dataset, EHCFE has a similar distribution to the label classes in the original dataset, as shown in Figure 3a,e. However, while K-Means-all and PAM-All show separation of the dataset into restricted groups that are distant from the original dataset as shown in Figure 3b,c, HC-Avg-subset did not successfully group the data in a manner similar to the original dataset (Figure 3d).

For the JEE dataset, as shown in Figure 4, we can see that in the original dataset, most dropout cases are located on the left, while K-Means-All and our proposed model (Figure 4a,b,e) have shared two groups similar to the original dataset, while PAM-All and HC-Avg-All (Figure 4c,d) are in separate groups, with one group shared across the original dataset, as for PAM-All. Finally, the EES dataset, the distribution of pass and fail classes in the original datasets is a little similar to K-Means-All, PAM-All, and our proposed model (Figure 5a–c,e); however, this case is not applied to HC-Avg-All, which groups the data at a distance from the original datasets, as shown in Figure 5d.

In terms of execution time, EHCFE is compared with K-Means-All and PAM-All, as shown in Figure 6. Each algorithm is executed five times, and the average execution time is reported. The results indicate that EHCFE requires more execution time than the other algorithms. EHCFE reaches approximately 4 s, whereas the remaining algorithms complete in less than 1 s. This difference is expected because EHCFE involves two computationally intensive phases, where the feature selection stage uses random forest and variance analysis and is subsequently followed by a feature engineering stage, both of which substantially increase the overall processing time. It is also observed that when the dataset contains around 5000 records, the execution time decreases due to the smaller number of features, indicating that execution time is affected by both the number of records and the number of features. In contrast, K-Means-All and PAM-All do not include any preprocessing or feature selection steps and are, therefore, influenced primarily by data size, allowing them to execute significantly faster than the proposed EHCFE model.

6. Ablation Study

6.1. Selection of t Importance Features

Selection of t importance features is studied, and is considered the first step in the EHCFE approach to establish the effect of using a subset of features, as previously studied by other works [10,11]. In this work, these features are chosen based on the top t features defined as important using random forest and variance. These models are called HC-ward-5F and HC-ward-10F, which means that the selection is based on either the top 5 or 10 important features, respectively. Then, HC is applied only to these subsets of features and the ward linkage method.

The results determine that in terms of prediction measurement, F₁ score, and AUC, our proposed model performs better than the HC-ward-5F and HC-ward-10F models, as shown in Table 5 for the ASD dataset. However, with regard to clustering quality, and using internal validation metrics, selecting five features in HC-ward-5F delivers the best performance, although it performed worse than one-half when the number of features was increased in HC-ward-10F. This means that a subset of features needs to be selected carefully, as performance decreases dramatically when the number of features increases.

6.2. Feature Engineering: Features Derived

The EHCFE approach involves deriving new features from important features, and is studied here to determine the contributions during the feature engineering step. Hierarchical clustering can be applied with all features and without derived features based on the selection of important features, termed HC-ward-All. The results shown in Table 6 reveal that using derived features can improve the results of all metrics except the one (AUC) for HC-ward-All, and this also yields a comparatively competitive result in terms of AUC and F₁ score for HC-ward-All and our EHCFE model. In the EES dataset, as shown in Table 7, in terms of ARI, F₁, and AUC, our model delivers better performance, but we can see that the DBI has better performance for HC-ward-All. This means that the number of features may play a role when labeling the students, and that the creation of new features does not work independently of all features.

7. Conclusions

Predicting student performance and dropout is vital to administrators in the education sector, as it can save money and improve educational outcomes. This work sought to enhance clustering methods using a feature engineering EHCFE approach to automate the labeling process for students, without relying on feature selection or a labeled dataset. EHCFE was compared to other models, and achieved a high rank in terms of prediction measures and quality clustering evaluation metrics. EHCFE achieves the best balance, aligning with true labels and producing well-defined clusters. This study shows that the use of only a subset of features may not help improve the results of the labeling and prediction process. Additionally, it shows the importance of using feature engineering, as compared to only using all features to train the model.

The proposed EHCFE provides educational administrators with a systematic approach to identifying students’ ongoing and potential academic status early in their educational journey, thereby facilitating prompt interventions to address the potential factors contributing to underperformance or dropout risk. As a concrete direction for future research, subsequent studies should ideally focus on scaling the EHCFE approach to accommodate large-scale and high-dimensional educational datasets, as well as evaluating its effectiveness across diverse academic contexts and populations. As future work, the ethical implications of automated dropout labeling should be examined, particularly the potential for biased outcomes among students with non-traditional learning patterns, which will require access to suitably diverse and representative datasets. Furthermore, the investigation of other feature engineering methods, such as interaction terms and logarithmic ratios, should also be explored.

Funding

This research received no external funding.

Data Availability Statement

The dataset analyzed during this study is publicly available at the following repository: https://doi.org/10.24432/C51G82 (accessed on 21 September 2025), https://www.kaggle.com/datasets/jayaantanaath/simulated-dataset-jee-dropout-after-class-12/data (accessed on 21 September 2025) and https://www.kaggle.com/datasets/thedevastator/higher-education-predictors-of-student-retention (accessed on 21 September 2025).

Conflicts of Interest

The author declares no conflicts of interest.

References

Kehm, B.M.; Larsen, M.R.; Sommersel, H.B. Student dropout from universities in Europe: A review of empirical literature. Hung. Educ. Res. J. 2019, 9, 147–164. [Google Scholar] [CrossRef]
Realinho, V.; Machado, J.; Baptista, L.; Martins, M.V. Predicting Student Dropout and Academic Success. Data 2022, 7, 146. [Google Scholar] [CrossRef]
Quinn, J. Drop-out and completion in higher education in Europe among students from under-represented groups. In An Independent Report Authored for the European Commission; European Commission: Brussels, Belgium, 2013. [Google Scholar]
Ahuja, R.; Jha, A.; Maurya, R.; Srivastava, R. Analysis of Educational Data Mining. In Proceedings of the Harmony Search and Nature Inspired Optimization Algorithms; Yadav, N., Yadav, A., Bansal, J.C., Deep, K., Kim, J.H., Eds.; Springer: Singapore, 2019; pp. 897–907. [Google Scholar]
Llanos, J.; Bucheli, V.A.; Restrepo-Calle, F. Early prediction of student performance in CS1 programming courses. PeerJ Comput. Sci. 2023, 9, e1655. [Google Scholar] [CrossRef]
Al-Ahmad, B.I.; Alzaqebah, A.; Alkhawaldeh, R.; Al-Zoubi, A.; Lo, H.; Ali, A. Predicting academic performance for students’ university: Case study from Saint Cloud State University. PeerJ Comput. Sci. 2025, 11, e3087. [Google Scholar] [CrossRef] [PubMed]
Baniata, L.H.; Kang, S.; Alsharaiah, M.A.; Baniata, M.H. Advanced deep learning model for predicting the academic performances of students in educational institutions. Appl. Sci. 2024, 14, 1963. [Google Scholar] [CrossRef]
Shou, Z.; Xie, M.; Mo, J.; Zhang, H. Predicting student performance in online learning: A multidimensional time-series data analysis approach. Appl. Sci. 2024, 14, 2522. [Google Scholar] [CrossRef]
Heaton, J. An empirical analysis of feature engineering for predictive modeling. In Proceedings of the SoutheastCon 2016; IEEE: New York, NY, USA, 2016; pp. 1–6. [Google Scholar] [CrossRef]
Alameri, F. Predicting Student Dropout Risk using Machine Learning; Rochester Institute of Technology: Rochester, NY, USA, 2025. [Google Scholar]
Mohamed Nafuri, A.F.; Sani, N.S.; Zainudin, N.F.A.; Rahman, A.H.A.; Aliff, M. Clustering Analysis for Classifying Student Academic Performance in Higher Education. Appl. Sci. 2022, 12, 9467. [Google Scholar] [CrossRef]
Pecuchova, J.; Drlik, M. Enhancing the Early Student Dropout Prediction Model Through Clustering Analysis of Students’ Digital Traces. IEEE Access 2024, 12, 159336–159367. [Google Scholar] [CrossRef]
Oeda, S.; Hashimoto, G. Log-Data Clustering Analysis for Dropout Prediction in Beginner Programming Classes. Procedia Comput. Sci. 2017, 112, 614–621. [Google Scholar] [CrossRef]
Palani, K.; Stynes, P.; Pathak, P. Clustering Techniques to Identify Low-engagement Student Levels. In Proceedings of the 13th International Conference on Computer Supported Education—Volume 2: CSEDU; INSTICC; SciTePress: Setúbal, Portugal, 2021; pp. 248–257. [Google Scholar] [CrossRef]
Valles-Coral, M.A.; Salazar-Ramírez, L.; Injante, R.; Hernandez-Torres, E.A.; Juárez-Díaz, J.; Navarro-Cabrera, J.R.; Pinedo, L.; Vidaurre-Rojas, P. Density-Based Unsupervised Learning Algorithm to Categorize College Students into Dropout Risk Levels. Data 2022, 7, 165. [Google Scholar] [CrossRef]
Ghosh, P.; Charit, A.; Banerjee, H.; Bandhu, D.; Ghosh, A.; Pal, A.; Goto, T.; Sen, S. DropWrap: A Neural Network Based Automated Model for Managing Student Dropout. Int. J. Networked Distrib. Comput. 2025, 13, 17. [Google Scholar] [CrossRef]
UDISE+ Data Dashboard Report. 2024. Available online: https://dashboard.udiseplus.gov.in/#/reportDashboard/sReport (accessed on 10 August 2024).
Kim, S.; Choi, E.; Jun, Y.K.; Lee, S. Student Dropout Prediction for University with High Precision and Recall. Appl. Sci. 2023, 13, 6275. [Google Scholar] [CrossRef]
Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Kim, Y.S.; Kim, M.K.; Fu, N.; Liu, J.; Wang, J.; Srebric, J. Investigating the impact of data normalization methods on predicting electricity consumption in a building using different artificial neural network models. Sustain. Cities Soc. 2025, 118, 105570. [Google Scholar] [CrossRef]
Murtagh, F.; Legendre, P. Ward’s hierarchical agglomerative clustering method: Which algorithms implement Ward’s criterion? J. Classif. 2014, 31, 274–295. [Google Scholar] [CrossRef]
Ali, P.J.M. Investigating the Impact of min-max data normalization on the regression performance of K-nearest neighbor with different similarity measurements. ARO-Sci. J. Koya Univ. 2022, 10, 85–91. [Google Scholar]
Han, J.; Kamber, M.; Pei, J. 3—Data Preprocessing. In Data Mining, 3rd ed.; Han, J., Kamber, M., Pei, J., Eds.; The Morgan Kaufmann Series in Data Management Systems; Morgan Kaufmann: Boston, MA, USA, 2012; pp. 83–124. [Google Scholar] [CrossRef]
Han, J.; Kamber, M.; Pei, J. 2—Getting to Know Your Data. In Data Mining: Concepts and Techniques, 3rd ed.; Han, J., Kamber, M., Pei, J., Eds.; The Morgan Kaufmann Series in Data Management Systems; Morgan Kaufmann: Boston, MA, USA, 2012; pp. 39–82. [Google Scholar] [CrossRef]
Nath, J. Simulated Dataset: JEE Dropout After Class 12. 2023. Available online: https://www.kaggle.com/datasets/jayaantanaath/simulated-dataset-jee-dropout-after-class-12/data (accessed on 21 September 2025).
Predict Students’ Dropout and Academic Success. 2023. Available online: https://www.kaggle.com/datasets/thedevastator/higher-education-predictors-of-student-retention (accessed on 21 September 2025).
Realinho, V.; Machado, J.; Baptista, L.; Martins, M.V. Predict Students’ Dropout and Academic Success. 2021. Available online: https://zenodo.org/records/5777340 (accessed on 21 September 2025).
Yılmaz, N.; Sekeroglu, B. Student Performance Classification Using Artificial Intelligence Techniques. In Proceedings of the 10th International Conference on Theory and Application of Soft Computing, Computing with Words and Perceptions—ICSCCW-2019; Aliev, R.A., Kacprzyk, J., Pedrycz, W., Jamshidi, M., Babanli, M.B., Sadikoglu, F.M., Eds.; Springer International Publishing: Cham, Switzerland, 2020; pp. 596–603. [Google Scholar]
Yilmaz, N.; Şekeroğlu, B. Higher Education Students Performance Evaluation; UCI Machine Learning Repository: Irvine, CA, USA, 2019. [Google Scholar] [CrossRef]
Hafzan, M.Y.N.N.; Safaai, D.; Asiah, M.; Saberi, M.M.; Syuhaida, S.S. Review on Predictive Modelling Techniques for Identifying Students at Risk in University Environment. MATEC Web Conf. 2019, 255, 03002. [Google Scholar] [CrossRef]
Hubert, L.J.; Arabie, P. Comparing partitions. J. Classif. 1985, 2, 193–218. [Google Scholar] [CrossRef]
Liu, Y.; Li, Z.; Xiong, H.; Gao, X.; Wu, J. Understanding of Internal Clustering Validation Measures. In Proceedings of the 2010 IEEE International Conference on Data Mining; IEEE: New York, NY, USA, 2010; pp. 911–916. [Google Scholar] [CrossRef]
Caliński, T.; Harabasz, J. A dendrite method for cluster analysis. Commun. Stat.-Theory Methods 1974, 3, 1–27. [Google Scholar] [CrossRef]
Rousseeuw, P.J. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 1987, 20, 53–65. [Google Scholar] [CrossRef]
Davies, D.L.; Bouldin, D.W. A Cluster Separation Measure. IEEE Trans. Pattern Anal. Mach. Intell. 1979, PAMI-1, 224–227. [Google Scholar] [CrossRef]
Kaufman, L.; Rousseeuw, P.J. Partitioning Around Medoids (Program PAM). In Finding Groups in Data; John Wiley & Sons, Ltd.: Hoboken, NJ, USA, 1990; Volume Chapter 2, pp. 68–125. [Google Scholar] [CrossRef]
Neave, H.R.; Worthington, P.L. Distribution-Free Tests; Routledge: London, UK, 1992. [Google Scholar]
Brazdil, P.B.; Soares, C. A Comparison of Ranking Methods for Classification Algorithm Selection. In Proceedings of the Machine Learning: ECML 2000; López de Mántaras, R., Plaza, E., Eds.; Springer: Berlin/Heidelberg, Germany, 2000; pp. 63–75. [Google Scholar]
McInnes, L.; Healy, J.; Melville, J. Umap: Uniform manifold approximation and projection for dimension reduction. arXiv 2018, arXiv:1802.03426. [Google Scholar]

Figure 1. The workflow of the proposed EHCFE approach to label students based on their academic status, where a red student icon represents “Dropout” status, and a blue student represents “Success” (e.g., Graduated or Pass) status.

Figure 2. Distribution of class labels across our selected datasets with the frequency of each class. (a) Joint Entrance Examination (JEE) dataset. (b) Academic Success and Dropout (ASD) dataset. (c) Engineering and Educational Sciences (EES) dataset.

Figure 3. The visualization of clustering results using UMAP for all models for the ASD dataset against the original dataset. For class labels by clustering models, red denotes cluster 1 for students who have a dropout status, and blue denotes cluster 2 for students who have successful status. (a) Original label dataset. (b) K-Means-All [11]. (c) PAM-All [36]. (d) HC-Avg-Subset [10]. (e) EHCFE (the proposed model).

Figure 4. The visualization of clustering results using UMAP for all models for the JEE dataset against the original dataset. For class labels by clustering models, red denotes cluster 1 for students who have a dropout status, and blue denotes cluster 2 for students who have successful status. (a) Original label dataset. (b) K-Means-All [11]. (c) PAM-All [36]. (d) HC-Avg-All [10]. (e) EHCFE (the proposed model).

Figure 5. The visualization of clustering results using UMAP for all models for the EES dataset against the original dataset. For class labels by clustering models, red denotes cluster 1 for students who have a dropout status, and blue denotes cluster 2 for students who have successful status. (a) Original label dataset. (b) K-Means-All [11]. (c) PAM-All [36]. (d) HC-Avg-All [10]. (e) EHCFE (the proposed model).

Figure 6. Average execution time across different data sizes.

Table 1. The number of features across different datasets, where features represents the current features in the dataset, new features represents the new features that are created from the important features, and the total features is the number of features and new features combined.

Datasets	Features	New Features	Total Features
JEE	14	5	19
ASD	34	2	36
EES	32	3	35

Table 2. The prediction measures F₁ score, AUC, and the measurement of quality clustering ARI, DBI, SC, and CHI for the JEE dataset for the proposed EHCFE approach against K-Means-All [11], PAM-All [36], and HC-Avg-All [10]. The value between parentheses is the rank of the value of the metric, whereas bold values indicate the best performance for each metric. “Avg rank” refers to the average rank of the evaluation metrics for each model.

Metric	Models
Metric	K-Means-All	PAM-All	HC-Avg-All	EHCFE (The Proposed Model)
ARI	0.0003 (2.5)	0.02 (1)	−0.015 (4)	0.0003 (2.5)
AUC	0.508 (1.5)	0.382 (3)	0.313 (4)	0.508 (1.5)
F₁ score	0.298 (1.5)	0.181 (3)	0.052 (4)	0.298 (1.5)
DBI	2.536 (1)	2.957 (4)	2.899 (3)	2.572 (2)
SC	0.134 (1)	0.103 (3)	0.1 (4)	0.13 (2)
CHI	776.377 (1)	571.64 (3)	545.72 (4)	754.659 (2)
Avg rank	1.42	2.83	3.83	1.92

Table 3. The prediction measures F₁ score, AUC, and the measurement of quality clustering ARI, DBI, SC, and CHI for the ASD dataset for the proposed EHCFE approach against K-Means-All [11], PAM-All [36], and HC-Avg-Subset [10]. The value between parentheses is the rank of the value of the metric, whereas bold values indicate the best performance for each metric. “Avg rank” refers to the average rank of the evaluation metrics for each model.

Metric	Models
Metric	K-Means-All	PAM-All	HC-Avg-Subset	EHCFE (The Proposed Model)
ARI	0.019 (3)	0.02 (2)	−0.001 (4)	0.333 (1)
AUC	0.357 (4)	0.564 (2)	0.499 (3)	0.766 (1)
F₁ score	0.111 (3)	0.492 (2)	0.001 (4)	0.709 (1)
DBI	2.471 (2)	2.552 (3)	1.258 (1)	2.586 (4)
SC	0.109 (4)	0.141 (3)	0.401 (1)	0.159 (2)
CHI	428.679 (3)	555.536 (1)	21.112 (4)	505.395 (2)
Avg rank	3.17	2.17	2.83	1.83

Table 4. The prediction measures F₁ score, AUC, and the measurement of quality clustering ARI, DBI, SC, and CHI for the EES dataset for the proposed EHCFE approach against K-Means-All [11], PAM-All [36], and HC-Avg-All [10]. The value between parentheses is the rank of the value of the metric, whereas bold values indicate the best performance for each metric. “Avg rank” refers to the average rank of the evaluation metrics for each model.

Metric	Models
Metric	K-Means-All	PAM-All	HC-Avg-All	EHCFE (The Proposed Model)
ARI	0.074 (4)	0.095 (3)	0.201 (1)	0.131 (2)
AUC	0.773 (2)	0.696 (3)	0.562 (4)	0.81 (1)
F₁ score	0.233 (2)	0.222 (3.5)	0.222 (3.5)	0.28 (1)
DBI	2.813 (4)	2.636 (3)	0.674 (1)	2.635 (2)
SC	0.114 (4)	0.119 (3)	0.205 (1)	0.13 (2)
CHI	17.023 (2)	15.81 (3)	2.169 (4)	17.416 (1)
Avg rank	3	3.08	2.42	1.5

Table 5. Prediction measures F₁ score, AUC, and the measurement of quality clustering ARI, DBI, SC, and CHI for the ASD dataset, evaluated for the EHCFE, HC-ward-5F, and HC-ward-10F models. Bold values indicate the best performance for each metric.

Models	ARI	AUC	F₁ Score	DBI	SC	CHI
HC-ward-5F	0.337	0.747	0.668	0.636	0.688	7937.597
HC-ward-10F	0.337	0.747	0.668	1.271	0.378	1684.447
EHCFE (the proposed model)	0.333	0.766	0.709	2.586	0.159	505.395

Table 6. Prediction measures F₁ score, AUC, and the measurement of quality clustering ARI, DBI, SC, and CHI for the ASD dataset, evaluated for the EHCFE, and HC-ward-All models. Bold values indicate the best performance for each metric.

Models	ARI	AUC	F₁ Score	DBI	SC	CHI
HC-ward-All	0.309	0.767	0.717	3.256	0.118	338.614
EHCFE (the proposed model)	0.333	0.766	0.709	2.586	0.159	505.395

Table 7. Prediction measures F₁ score, AUC, and the measurement of quality clustering ARI, DBI, SC, and CHI for the EES dataset, evaluated for the EHCFE, and HC-ward-All models. Bold values indicate the best performance for each metric.

Models	ARI	AUC	F₁ Score	DBI	SC	CHI
HC-ward-All	0.107	0.703	0.233	2.478	0.13	17.22
EHCFE (the proposed model)	0.131	0.81	0.28	2.635	0.13	17.416

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Alghanmi, N. EHCFE: Enhanced Hierarchical Clustering with Feature Engineering for Automating Labeling of Student Performance and Dropout Prediction. Electronics 2026, 15, 1265. https://doi.org/10.3390/electronics15061265

AMA Style

Alghanmi N. EHCFE: Enhanced Hierarchical Clustering with Feature Engineering for Automating Labeling of Student Performance and Dropout Prediction. Electronics. 2026; 15(6):1265. https://doi.org/10.3390/electronics15061265

Chicago/Turabian Style

Alghanmi, Nusaybah. 2026. "EHCFE: Enhanced Hierarchical Clustering with Feature Engineering for Automating Labeling of Student Performance and Dropout Prediction" Electronics 15, no. 6: 1265. https://doi.org/10.3390/electronics15061265

APA Style

Alghanmi, N. (2026). EHCFE: Enhanced Hierarchical Clustering with Feature Engineering for Automating Labeling of Student Performance and Dropout Prediction. Electronics, 15(6), 1265. https://doi.org/10.3390/electronics15061265

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

EHCFE: Enhanced Hierarchical Clustering with Feature Engineering for Automating Labeling of Student Performance and Dropout Prediction

Abstract

1. Introduction

Contribution

2. Related Works

2.1. Single Unsupervised Approaches

2.2. Hybrid Approaches

3. The Proposed Enhanced Hierarchical Clustering with Feature Engineering Approach (EHCFE)

4. Experiment Setting

4.1. Datasets

4.2. Evaluation Metrics

4.2.1. Ground Truth Dependent Metrics

4.2.2. Ground Truth Independent Metrics

5. Results and Discussion

5.1. Evaluation of the Proposed Model

5.2. Discussion

6. Ablation Study

6.1. Selection of t Importance Features

6.2. Feature Engineering: Features Derived

7. Conclusions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI