1. Introduction
Despite the central role of education in societal development, high school education remains one of the most vulnerable levels to student dropout in México. Although there has been a slight downward trend in dropout rates in recent years, this problem remains a latent challenge. According to data from the Public Education Secretary in México (SEP), the dropout rate at this level fell from 11.2% in 2023 to 10.8% in 2024 (
SEP, 2024). The decrease, although positive, fails to offset the cumulative effects of educational dropout, especially in contexts marked by adverse socioeconomic conditions, low institutional coverage, and academic weaknesses. In
Ledesma et al. (
2023), the authors identify conditions such as low family income, dysfunctional family structures, limited access to recreational spaces, and a lack of institutional attention to students’ socioeconomic environments as key triggers of school dropout. Additionally, the authors emphasize academic factors, including inadequate prior student preparation, limited assessment methods, and teachers with limited pedagogical training. Social factors related to school adaptation, as well as emotional situations such as anxiety or disinterest, are also mentioned. Moreover,
Martínez and Carmona (
2024) noted that the main obstacles faced by higher education are related to economic, institutional, and personal factors, the former being the most important explanatory factors in causing dropout.
Given the complexity and multicausality of school dropout, artificial intelligence-based models have emerged as an effective approach for developing models that integrate multiple variables and provide reliable results for processing large volumes of student data and identifying risk factors that may not be immediately apparent. Hence, some authors have confirmed the usefulness of artificial intelligence in addressing school dropout rates across educational levels through the identification of complex patterns in student records and the development of accurate predictive models to forecast school dropout. For example, in
Ajibade et al. (
2022) data from a learning management system (LMS) were analyzed to predict academic performance, a relevant indicator to prevent school dropout. The authors conclude that the behavioral features are key indicators in e-learning environments. Another example is the paper of
Song et al. (
2023) which presented a model based on historical student data, applicable across all academic years of a college, to identify students at risk of dropout. They prove that dropouts do not occur only in the first year but can arise at any point in a student’s trajectory. Additionally, the authors demonstrate that using longitudinal averages and the LightGBM technique significantly improves predictive performance, achieving 81% accuracy and an F1 Score of 0.79 when using data from multiple semesters. Further, the study also identified economic variables, such as the average number of scholarships and tuition fees, as key factors for dropout.
In the same way, related to dropout,
Sifuentes et al. (
2023) proposed a model for a high school in Latin America using the CRISP-DM method. They used data from 1374 students and algorithms like Random Forest and cost-sensitive classification. The authors conclude that balancing data makes models much more accurate and that data mining is a useful tool for institutions when making decisions about student retention. For its part, to identify students who dropped out,
Hernández et al. (
2023) proposed an approach of two phases: (1) segmentation of students based on their academic performance using a k-means algorithm, (2) dropout prediction through the application of classification techniques, including Random Forest, J48, and logistic regression. The authors achieved approximately 80% accuracy, demonstrating that prior classification significantly improves the model’s predictive performance. Moreover, in addition to the student dropout rate,
Sansores et al. (
2023) proposed a model based on an artificial neural network of three fully connected layers to predict failure rates at the UNAM Faculty of Medicine. They used data from approximately 2000 students enrolled in the Biomedical Informatics I course, collected via the Moodle platform between 2019 and 2021. The proposed model achieves a moderately high accuracy with a mean absolute error of 0.0158 and a kappa coefficient of 0.947. Additionally, the authors implemented a “traffic light” type visual system, allowing instructors to identify at-risk students early and facilitate timely pedagogical interventions.
On the other hand, some studies demonstrate that techniques such as decision tree-based models provide an effective and interpretable approach for predicting academic performance and identifying at-risk students at an early stage. Hence, by integrating academic, motivational, family, and socioeconomic variables, these approaches enable the discovery of meaningful patterns across diverse educational contexts and support evidence-based educational decision-making. For example, in the paper of
Castrillón et al. (
2020), the authors proposed a decision tree-based model to predict academic performance in a college by integrating factors such as educational, family, socioeconomic, and study-habit-related factors. The model was applied in a public university in Colombia and achieved an accuracy of 91.7%, enabling the classification of students into five performance levels and the early identification of those at academic risk. Similarly, in
Matzavela and Alepis (
2021), the authors analyzed student performance using decision trees, incorporating social and family context variables. They found that a higher academic performance is commonly associated with students from households with higher incomes and with parents who have greater educational attainment. In contrast, the authors observe that lower performance is associated with the need to work while studying and with less favorable socioeconomic environments. Furthermore,
Rojas et al. (
2023) proposed a decision-tree-based strategy to help with the dropout problem at the University of Quindío in Colombia. The authors used academic records for 10,705 students from different faculties and achieved an accuracy of 89%. The authors conclude that cumulative GPA, motivation, socioeconomic level, and employment status are important factors contributing to the dropout problem, which reaffirms that dropout is a multifactorial phenomenon that varies across institutional contexts.
As with decision tree-based models, some studies have demonstrated the effectiveness of the Naive Bayes classifier for predicting student dropout and academic performance in educational contexts. An example is the paper of
Alturki and Alturki (
2021), which assessed various techniques for predicting final academic performance among undergraduate students. The authors compared Naive Bayes with other supervised algorithms using variables such as cumulative GPA, failed courses, and performance in key subjects. The authors demonstrate the Naive Bayes efficiency, interpretability, and suitability for educational environments with limited computational resources. Similarly, to address the problem of identifying students at risk of dropout,
Páez and Ramírez (
2022) evaluated diverse techniques using data from engineering students at a public college in México. The authors achieved approximately 65% accuracy with Naive Bayes, outperforming all other evaluated classification techniques. Furthermore, the authors conclude that GPA is the most relevant variable for identifying at-risk students. Another example is the paper of
Martinez et al. (
2024) which applied algorithms such as Naive Bayes, KNN, SVM, logistic regression, and decision trees to predict student dropout at California State University, Fullerton, where Naive Bayes achieved an accuracy above 89% using engagement, demographic, and academic performance data collected through the Canvas platform, even with relatively small datasets.
The academic performance of the students has also been addressed by other artificial intelligence techniques, such as the one presented in
Adnan et al. (
2021), who focused on identifying at-risk students on online learning platforms by comparing multiple machine learning and deep learning algorithms. Their results showed that Random Forest consistently outperformed other models, achieving average performance values above 90%. In the same way, in the paper of
Palacios et al. (
2021), data mining and supervised machine learning techniques were applied to predict student retention in high school education. The authors made a comparison of diverse classifiers and found that Random Forest achieved the highest effectiveness. Another example is presented in the paper of
Ekhosuehi and Iduseri (
2022) which applied Random Forest with conditional inference trees to predict university dropout in Germany. By incorporating data from the first academic semesters, the proposed approach achieved an AUC of 0.86. The study identified key predictors, including high school final grades, student satisfaction, and academic self-concept.
For its part,
Hussain and Khan (
2023) conducted a study to predict academic performance using data from the Board of Intermediate and Secondary Education across seven regions of Pakistan. The authors considered 30 attributes related to the students’ academic history, trained a logistic regression model to estimate grades, and applied a classification process to categorize students into different performance levels. They conclude that the approach could improve educational planning and teaching quality for educational environments with limited computational resources. In the same way,
Khairy et al. (
2024) also proposed a model to predict academic performance using artificial intelligence. They used a dataset spanning 2016 to 2021, comprising 830 instances and six academic attributes on first-year undergraduate students in the Computer Science Department at Damietta University. The authors employed the Linear Discriminant Analysis (LDA) technique for feature extraction and assessed five algorithms in their proposal. The results showed that adding LDA improved the models by about 15%, with 250 out of 253 test cases correctly classified. Furthermore, in the paper of
Psyridou et al. (
2024) a longitudinal approach using classifiers such as Balanced Random Forest, Easy Ensemble, RSBoost, and Bagging Decision Tree was presented. They used a dataset spanning up to 13 years from kindergarten to ninth grade. The authors found that early academic performance, particularly in reading and mathematics, is a strong predictor of future dropout.
Based on the state of the art, artificial intelligence techniques are effective for developing systems to predict dropout and implement early intervention for educational institutions and authorities. However, it is relevant that most existing approaches rely on classification algorithms, such as decision trees, ensemble methods, neural networks, and distance-based classifiers, to identify vulnerable students and support retention strategies. In recent years, the growing availability of educational data and advances in artificial intelligence have driven the use of predictive models to support early identification of students at risk of dropout. As a result, research in Educational Data Mining has increasingly focused on discovering patterns associated with academic disengagement and on building data-driven early warning systems to assist institutional decision-making.
Some studies have shown that a combination of academic, demographic, and socioeconomic factors influences dropout. For example, in the paper of
Dasi and Kanakala (
2022), the authors reported that decision-tree and ensemble-based methods achieved superior results in identifying students at risk of dropout. Another example is the paper by
da Silva et al. (
2022), in which the authors identify that in a Portuguese secondary school, age and performance in highly demanding subjects were among the most influential variables for predicting dropout risk, even when only limited academic information was available.
Furthermore, some authors have shown that, despite the existence of historical academic records, these are often insufficient for dropout prediction, for example, as in
Cho et al. (
2023), where the authors analyzed academic records from 20,050 students using integrated institutional databases, in which the authors found a reduced subset of highly correlated variables for training predictive models. The authors demonstrated that using class-imbalance strategies such as SMOTE and ADASYN decreased the performance of almost all considered models, except Random Forest, whose improvement was not significant, and suggest applying more robust strategies to improve model performance. On the other hand, in the paper of
Quispe et al. (
2024), the authors highlighted the relevance of decision trees and ensemble methods for strengthening early warning systems by enabling the timely detection of vulnerable student profiles. Moreover, in the paper of
Delogu et al. (
2024), the authors demonstrated that Random Forest and Gradient Boosting Machine models achieved strong predictive performance using administrative data from undergraduate students in Italy, identifying first-year credit accumulation, prior academic achievement, family income, and previous school background as critical risk factors.
Despite the demonstrated effectiveness of artificial intelligence-based models for identifying students at risk of dropout, the performance of most models remains highly dependent on the quality and geometric structure of the data after preprocessing. In particular, distance-based classifiers are sensitive to feature scaling and representation, which can significantly affect neighborhood relationships and class separability. To reduce the impact of this situation, we propose a Supervised Feature-Weighted Metric (SFWM) to improve the effectiveness of a low-cost, distance-based classifier, providing educational institutions with a practical, scalable tool for data-driven decision-making to support early identification of dropout risk among high school students. Thus, this metric learning strategy assigns non-negative feature weights to reshape the feature space and enhance class separability. Therefore, our proposed approach introduces a supervised metric-learning strategy that adapts the contribution of each feature based on its relevance to class discrimination, thereby improving the representation of educational data for distance-based classification. Furthermore, our study uses school academic records from a high school in Ciudad Madero, Tamaulipas, Mexico, to assess the proposed metric, analyzing its effect on class separability and predictive performance. In this sense, our proposed approach is intended to support educational early-warning systems by providing a computationally low-cost, interpretable alternative that may facilitate the timely identification of students at academic risk and assist institutional decision-making.
2. Materials and Methods
This section seeks to achieve the following: (1) describe the dataset used in this paper, (2) describe the preprocessing steps applied to construct the final data representation, and (3) describe the proposed methodology for supervised metric learning. First, we describe the characteristics and temporal structure of the academic dataset. Then, we detail the preprocessing procedures adopted to ensure data anonymization, consistency, and interpretability. Finally, we formally define the problem addressed in this work and introduce the proposed Supervised Feature-Weighted Metric (SFWM), including its optimization strategy and objective function.
2.1. Dataset Description
To test the efficiency of the algorithms used in this paper, we used a dataset consisting of students’ academic records from a a public high school institution in Ciudad Madero, Tamaulipas. The dataset contains academic records spanning 11 academic years, from 2014 to 2025. In Mexico, each high school year consists of two periods: the first, from July of the current year to January of the following year, and the second, from February to June of the following year (
Table 1).
The academic records provided by the educational institution initially consisted of students’ personal data and institutional information, along with attributes such as semester, group, shift, partial grades, attendance for each student in each course among others. The dataset covers all technical careers offered by the institution, as summarized in
Table 2.
Furthermore, we considered integrating the institutional ”Propedeutic Component” into the dataset, since all students take this during the first semester, before they begin the modules associated with their chosen technical career. This component was included because the highest dropout rate at the institution occurs during the first academic semester.
Before the dataset analysis, we treated the dataset through a preprocessing process to: (1) remove confidential information, (2) ensure data consistency, and (3) address incomplete records. The preprocessing process was conducted to improve both interpretability and analytical reliability of the dataset.
2.1.1. Data Analysis
Academic instances are commonly presented in tabular form, as this format allows for a clearer understanding of the student’s educational trajectory (
Namoun & Alshanqiti, 2020). Among the attributes frequently considered for the instances, the students’ academic program is a relevant attribute, as each program offers different levels of rigor and academic workload, which can influence student retention or dropout (
Zou et al., 2025). Similarly, the semester attribute is relevant because it indicates the stage a student is at in relation to their academic progress within their program, facilitating the identification of the phase (initial, middle, or advanced) with the highest risk of dropping out (
Awedh & Mueen, 2025).
On the other hand, the subject name, academic period, and partial grades allow for identifying the content a student is learning, the moment of the academic year, and an approximate measure of their proficiency in the subject, facilitating the detection of potential difficulties in specific courses. For example,
Raftopoulos et al. (
2024) incorporate academic period and grades as predictors to identify early warning signs of risk. Additionally, including demographic variables such as age, sex, and disability status enriches the interpretation of predictive models and facilitates the design of more efficient interventions in the higher education context (
Lin et al., 2024).
Moreover,
Rastrollo-Guerrero et al. (
2020) suggest that characterizing student performance also relies on indicators such as attendance percentage, understood as the proportion of classes a student attends relative to the total scheduled. The attendance percentage allows for the evaluation of a student’s consistency and academic commitment, both of which are closely related to their retention in the education system.
Additionally,
Al-Alawi et al. (
2023) have demonstrated that incorporating students’ academic status with indicators such as their enrollment status or whether they are at risk, along with annual and semester performance indicators, is beneficial for studies. Similarly,
Guanin-Fajardo et al. (
2024) consider variables such as the number of courses enrolled, which reflects the academic workload; final grades, which represent the result obtained upon completion of a course and serve as a direct indicator of performance; and semester progression, understood as the actual progress in credits or courses passed in relation to the curriculum (
Fernandez-Garcia et al., 2021).
Therefore, based on the above and the analysis of the academic record data provided by the high school institution, we selected the attributes to constitute the instance used in this paper. Also, we considered that the generation of the attributes is_regular, is_re-enrolled, and attendance percentage was deemed appropriate based on the information present in the academic record.
For the treatment of the instance, as evidenced in
Buzducea et al. (
2026), the use of data mining and machine learning techniques in the educational environment is beneficial because it allows for the analysis of large volumes of data, contributing to more intelligent and informed decision-making by educational institutions through the analysis of performance variables, as well as factors that influence students’ academic trajectories. Machine learning techniques enable a better understanding of students’ academic situations and, consequently, reduce adverse outcomes such as low performance and school dropout. Thus, in this paper, we decided to use machine learning techniques for the treatment of the academic instance, particularly classification techniques based on the nature of the instance and the fact that each record describes the academic history of each student who took a subject at the institution.
2.1.2. Preprocessing
As an initial preprocessing step, all confidential information related to the institution and students was removed, including institutional identifiers, student names, student IDs, and CURPs (Unique Population Registry Codes). Once these attributes were removed, the dataset consisted of the attributes shown in
Table 3.
Next, records containing missing values were identified and analyzed. The missing values in the attributes Partial Grades and Partial Attendance were set to 0. Additionally, records with missing values in any other attribute were removed to avoid introducing noise into the dataset.
Once we addressed the missing records, we decided to combine the attributes Generation, Academic Year, and Academic Period to generate the attributes is_regular (IS_REG) and is_re-enrolled (IS_REEN), indicating whether the student is regular or irregular in each period and whether the student is a re-enrollment student. A student is regular if the cycle and period they are attending fall within the expected timeframe for completing their studies, given that all technical careers at the educational institution last 3 years (6 periods). On the other hand, a student is a re-enrollment student if they have not enrolled in at least one consecutive period.
Partial attendance attributes were aggregated into a single Attendance Percentage (ATT_PCT) value for each subject. Because professors do not always take attendance daily due to events, meetings, or holidays, we decided to sum the student attendance across the three partial evaluations and consider the student with the highest attendance in each group as 100% for each subject in each technical career. Based on this calculation, we estimate the attendance percentage for the remaining students in each group. After calculating student attendance percentages, we removed the group attribute because we did not consider it to influence student retention or dropout rates.
As next step, subjects were then grouped into academic departments (AD) according to the institutional classification, facilitating analysis and interpretation.
Table 4 presents the academic departments considered for this paper.
In
Figure 1, we illustrate the hierarchical organization of the dataset, showing how we grouped individual subjects into academic departments according to the institution’s guidelines.
For clarity, only representative subjects are shown, while the complete mapping is summarized at the academic department level.
Finally, one-hot encoding was applied to the nominal attributes Technical Career and Academic Department to facilitate numerical representation and model compatibility.
Table 5 shows a sample of the preprocessed dataset.
Due to the original dataset composition, the class associated with each record was the final grade attribute.
Table 6 shows the distribution across classes for the final grade attribute, providing a better understanding of the preprocessed dataset comprising 264,882 records.
Moreover, it is important to highlight that, besides, in our paper, we focused on the study of dropout-risk identification, although we use a supervised class associated with the student’s final grade rather than observed dropout events. We consider that the study of academic performance classification is relevant to early dropout warning systems, as the final grade could be treated as a feasible proxy for dropout risk in the institutional context. Additionally, it is especially important in educational institutions where, due to their internal policies, there are no formal historical records of school dropout events, as is the case with the institution considered in this paper.
To get a better idea of how well the proposed method performs, we use a stratified cross-validation scheme rather than a split-validation scheme. Cross-validation lets us use each instance of the dataset for both training and testing across multiple iterations (
Kesgin et al., 2025). Thus, the performance estimate is not as dependent on a single partition and is less likely to change. We use a stratified k-fold scheme that keeps the same number of examples from each class in each fold. Stratified k-fold ensures that the results are always the same, even when the classes are unbalanced. This method provides a rough idea of how the model performs across different groups, thereby improving the statistical stability of the results.
Furthermore, it is also important to note that each student in the dataset has more than one record, one for each course they took in a given semester. For this reason, when we perform stratified cross-validation, we check the class distribution to ensure each partition is balanced. In the same way, we maintain the distribution of technical careers and academic departments across the different subsets to avoid biases arising from certain record concentrations. We also ensured that records for the same student remained within the same fold, avoiding potential dependencies between training and validation sets and reducing the risk of overestimating model performance.
We use a 5-fold cross-validation split as a compromise between computational cost and robustness of the performance estimate. Cross-validation is a common method for assessing machine learning models, especially for comparing classifiers across datasets. The cross-validation method reduces the differences between a single train-test split (
Demšar, 2006). In studies of educational data mining with large datasets, 5-fold or 10-fold cross-validation is often used to find the best trade-off between cost and reliability (
Baker & Inventado, 2014;
Romero & Ventura, 2010).
The results show a mean accuracy of 0.9852 with a standard deviation of 0.0065 and a range of 0.0155 across folds. These results indicate that the proposed method maintains consistent performance under different data partitions.
2.2. Problem Definition
Let denote a labeled dataset, where N is the number of instances, is the feature vector of the i-th instance, d is the number of input features, and denotes its corresponding class label, with representing the set of possible classes. Distance-based classifiers, such as K-Nearest Neighbors (KNN), are commonly used to determine class membership based on a distance function. K-Nearest Neighbors (KNN) is a supervised learning algorithm that assigns class labels based on the classes of the k nearest neighbors under a distance metric. The KNN algorithm is characterized by its low training cost, as it does not require an explicit learning process. Thus, although its prediction cost can increase as the dataset grows, it remains viable in moderately sized educational scenarios.
Diverse authors have used the KNN algorithm to solve classification problems. For example, in the paper of
Nugroho et al. (
2020), the authors applied KNN to determine whether student complaints and their prior performance could predict their timely graduation, evaluating different values of
k. The results of the authors showed adequate accuracy and demonstrated that students with fewer complaints are more likely to complete their studies on time, suggesting its usefulness in reducing student dropout rates. For its part, in the paper of
Contreas-Bravo et al. (
2023), the authors evaluated several classification models to predict university academic performance in 2023, finding that KNN achieved the best results in most semesters, with an accuracy exceeding 77.5%, especially in even-numbered semesters.
KNN has proved that it constitutes an efficient, low-cost alternative for academic decision-making and the early monitoring of at-risk students. However, the performance of such classifiers strongly depends on the geometry of the feature space.
Standard preprocessing techniques, including normalization and dimensionality reduction, do not explicitly optimize class separability under a distance metric. To address this limitation, we propose a Supervised Feature-Weighted Metric (SFWM) that learns feature weights using label information to define a discriminative distance function.
2.3. Supervised Feature-Weighted Metric
This paper proposes a Supervised Feature-Weighted Metric (SFWM) to enhance distance-based classification. Given a labeled dataset , with , SFWM learns a non-negative weight vector that defines a supervised distance metric by reweighting the original feature space.
SFWM defines a weighted distance metric by assigning a non-negative weight to each feature. Given a weight vector
w, where:
The weighted representation of a data instance
is given by:
where ⊙ denotes element-wise (Hadamard) multiplication.
Thus, the distance between two instances
and
is computed as
where
denotes the weighted Euclidean distance induced by the feature-weight vector
w.
This formulation preserves the original dimensionality of the data while allowing the model to emphasize or attenuate individual features based on their relevance to class discrimination.
To better understand the proposed SFWM, we present its learning process in Algorithm 1. The algorithm has six steps described as follows:
As the first step, the algorithm initializes the population (Line 1), where each individual represents a potential metric that assigns different levels of relevance to individual features.
For each individual, a weighted representation of the dataset is constructed by multiplying each feature vector element by its corresponding weight (Lines 2–3). This process reshapes the geometry of the feature space while preserving the original dimensionality.
Next, the algorithm evaluates all initial individuals (Line 4) and selects the best-performing weight vector as the current solution (Line 6). We use the classification accuracy of a k-nearest neighbors (KNN) to evaluate the individuals. This evaluation directly reflects how well the learned metric supports distance-based decision rules.
Later, the optimization process is executed for a fixed number of iterations (Line 7). At each iteration:
- (a)
A new candidate weight vector is generated using a population-based, gradient-free optimization mechanism (Line 8). In this paper, Harmony Search fulfills this role; however, the formulation of SFWM remains independent of the specific optimizer.
- (b)
The candidate is evaluated by reconstructing the weighted dataset and computing the corresponding KNN accuracy (Lines 9–10).
- (c)
If the candidate improves upon the current best solution, the algorithm updates the optimal weight vector and replaces the worst individual in the population with the new individual (Lines 11–13). With the update strategy, we ensure a progressive refinement of the learned metric.
After completing the optimization process, the algorithm returns the optimal feature weight vector , which defines the SFWM metric (Line 16). SFWM does not generate synthetic samples or modify class labels; instead, it reshapes the geometry of the original feature space to improve class separability under distance-based decision rules. SFWM focuses on learning a supervised distance metric through feature reweighting rather than performing dimensionality reduction or data normalization.
| Algorithm 1 Supervised Feature-Weighted Metric Learning (SFWM) |
- Require:
Labeled dataset , - Require:
Population size P, maximum iterations T, number of neighbors k - Ensure:
Optimal feature weight vector - 1:
Initialize a population of weight vectors , with - 2:
for each do - 3:
Construct weighted samples - 4:
Evaluate objective function - 5:
end for - 6:
Set - 7:
for to T do - 8:
Generate a new candidate weight vector using a population-based optimizer - 9:
Construct weighted samples - 10:
Compute - 11:
if then - 12:
Update - 13:
Replace the worst candidate in W with - 14:
end if - 15:
end for - 16:
return
|
2.4. Objective Function
The objective of SFWM is to learn a feature weight vector
w that maximizes the performance of a distance-based classifier under the induced metric. In this paper, we define the objective function using the classification accuracy of a KNN classifier. Formally, let:
Denote the KNN classification accuracy obtained on the weighted dataset using
k nearest neighbors.
The optimization problem can then be expressed as:
This objective directly aligns the learned metric with the decision rule of KNN, ensuring that the optimization process explicitly targets improved neighborhood structure and class separability.
2.5. Optimization Strategy
The optimization problem defined above is non-convex and does not admit a closed-form solution. Moreover, the objective function depends on discrete classification outcomes, which makes gradient-based optimization unsuitable.
To address these challenges, since SFWM remains independent of the specific optimizer, a population-based method can be employed without altering the metric’s formulation. For this reason, we employ a population-based, gradient-free optimization strategy to search the space of feature weight vectors. Particularly, we decided to use the Harmony Search as the optimization mechanism due to its simplicity, robustness, and low computational cost.
The optimization process iteratively refines candidate weight vectors by evaluating their induced classification performance and retaining the most discriminative solutions.
2.5.1. Harmony Search Configuration
A Harmony Search algorithm is based on a population of candidate solutions commonly called the Harmony Memory (HM). Moreover, each candidate is called harmony and represents a feature-weight vector , where each component corresponds to the importance assigned to a particular feature.
The Harmony Memory is initialized randomly using a uniform distribution in the interval . Therefore, each initial harmony explores a different weighting configuration of the feature space. During the optimization process, new candidate solutions are generated by combining existing harmonies with stochastic adjustments according to the Harmony Search operators.
In this study, the algorithm was configured with a harmony memory size of 20, a maximum of 100 iterations, a harmony memory consideration rate , a pitch adjustment rate , and a bandwidth parameter . After each update, the weights are constrained to remain within the interval to ensure valid feature importance values.
This configuration allows the algorithm to explore different candidate metrics while progressively refining the feature weights to improve class separability under the proposed supervised metric learning framework.
2.5.2. Metric Learning and Evaluation Scheme
The SFWM metric is learned independently within each training fold of the cross-validation process. Specifically, for each fold, the feature-weight vector is optimized using only the training partition, while the validation fold remains completely unseen during the metric learning stage.
After learning the feature-weight vector, both the training and validation partitions are transformed using the fold-specific weights. The KNN classifier is then trained on the transformed training data and evaluated on the transformed validation data.
This procedure ensures that the reported performance estimates are not affected by information leakage between training and validation sets.
2.6. Computational Considerations
In the SFWM, while the optimization process involves multiple evaluations of the objective function, the dimensionality of the weight vector remains fixed at the number of features in the original dataset. In this sense, SFWM does not introduce a significant increase in computational complexity beyond that inherent in the KNN classifier.
In addition, for large-scale datasets, the evaluation of the geometric properties, such as intra- and inter-class distances or silhouette scores, can be performed using stratified sampling without altering the learned metric. This strategy ensures scalability while preserving the relative structure of the data.
Overall, SFWM learns a supervised distance metric by reweighting features to enhance class separability under distance-based decision rules. Unlike normalization or dimensionality reduction techniques, SFWM explicitly incorporates label information into the metric learning process. The method does not generate synthetic data, alter class labels, or reduce dimensionality. Instead, it reshapes the geometry of the original feature space to enhance discrimination for distance-based classification.
3. Results
The process for assessing the proposed Supervised Feature-Weighted Metric (SFWM) is presented in this section. Further, in this section, we analyze the impact of the metric on distance-based classification. For assessment, we compared SFWM with commonly used data representations, including the original feature space, standardized data, and PCA-based dimensionality reduction. To achieve a robust evaluation, we focus on classification performance, statistical significance, and geometric properties of the induced metric space.
For the development of the proposal presented in this paper, we used an HP ENVY Laptop 13-ad0xx, with an Intel Core i7-7500U CPU @ 2.70GHz (2.094 GHz), 8 GB DDR3 SDRAM, and Microsoft Windows 10 Home Single Language, version 10.0.18363. Moreover, all experiments were implemented in Python 3.13.1 using PyCharm Community Edition 2024.3.1.
The assessment of all data representations using the k-nearest neighbors (KNN) classifier, as SFWM explicitly modifies the distance structure of the feature space. A five-fold stratified cross-validation scheme was employed, with neighbors. Additionally, we use the classification accuracy as the main performance metric.
For the SFWM representation, the feature-weight vector was learned independently within each training fold. The validation partition was kept unseen during metric optimization, and the learned fold-specific weights were then applied to both the training and validation partitions before evaluating KNN.
To confirm the statistical significance of the results, we applied Friedman’s nonparametric test across all representations at the 95% confidence level. In addition, class separability was examined through intra-class and inter-class distance analysis and silhouette scores.
3.1. Classification Performance with KNN
Table 7 reports the mean classification accuracy obtained by the KNN classifier for each data representation.
Figure 2 visually summarizes the observed performance differences.
SFWM achieved a mean accuracy of 0.9852 with a standard deviation of 0.0065 across the five folds, indicating stable performance.
Compared with the original dataset, SFWM increases classification accuracy by more than 11%.
The improvement reaches approximately 14% relative to standardized data and remains substantial when compared with PCA.
These results indicate that the performance gains achieved by SFWM are not incremental but reflect a substantial enhancement of distance-based classification performance.
To provide a broader comparison with commonly used machine learning models, we evaluated several classifiers on the original preprocessed dataset, including Logistic Regression, Random Forest, and XGBoost. The results in
Table 8 show that ensemble methods achieve strong predictive performance, with XGBoost obtaining the highest accuracy. However, the KNN classifier operating under the proposed SFWM metric achieves comparable performance while maintaining the simplicity and interpretability of distance-based methods.
The obtained results indicate that the primary contribution of the proposed SFWM lies in improving the representation of the feature space rather than increasing classifier complexity.
In addition to predictive performance, we evaluated the computational cost of the proposed method. The average training time of KNN + SFWM was 19.53 s per fold, including the metric learning stage and the KNN fitting phase, as shown in
Table 9. For KNN + SFWM, the reported training time includes both the metric learning stage and the classifier training phase.
In comparison, the training time of Logistic Regression and Random Forest was considerably lower because these models do not require iterative optimization of feature weights. XGBoost required slightly longer training time due to the ensemble construction process.
Although SFWM introduces an additional optimization stage, this process is performed independently within each training fold. This increases the computational cost compared with classifiers that do not require metric learning, but it ensures a valid evaluation protocol in which the validation partition remains unseen during optimization.
To ensure a fair comparison, we measured the training time of all evaluated models under identical computational conditions.
Table 9 reports the average training time obtained using a five-fold cross-validation scheme.
The results show that traditional classifiers such as Logistic Regression and Random Forest require significantly less training time, as they do not involve an iterative optimization process. In contrast, the proposed SFWM method introduces an additional optimization stage for metric learning, which increases computational cost.
However, this cost is incurred only during the metric learning phase. Within each fold, once the feature-weight vector is obtained from the training partition, the transformed representation can be reused for inference with negligible additional computational cost.
To better understand which variables contribute most to the performance improvements obtained with SFWM, we conducted an ablation study in which different groups of features were selectively removed from the dataset.
Five scenarios were evaluated: (1) the complete feature set, (2) removal of partial grades (PG-1, PG-2, PG-3), (3) removal of attendance percentage (ATT_PCT), (4) demographic and structural variables only (AGE, SEX, SEM, IS_REG, IS_REEN), and (5) an early prediction scenario using only the first partial grade (PG-1).
In
Table 10, we present the analysis results, which show that partial grade attributes (PG-1, PG-2, PG-3) constitute the primary predictive signal for the students’ outcomes, because their removal leads to a substantial reduction in classification accuracy. For its part, the attendance percentage (ATT_PCT) provides additional predictive information, as its removal moderately affects the accuracy.
It is important to distinguish between the two complementary perspectives analyzed in this study. The ablation analysis measures how feature groups impact predictive performance. Learned feature weights show their importance in the optimized metric when all features are used together.
Furthermore, we observe the lowest accuracy when using only demographic and structural attributes, suggesting that their use in isolation is insufficient to accurately predict student performance. Finally, the early prediction scenario using only the first partial grade (PG-1) achieves moderate predictive performance, indicating that meaningful signals about final academic outcomes are already present in early evaluation stages.
To better understand which attributes contribute most to the learning metric, we analyzed the feature-weight vector obtained from the SFWM optimization. The learned weights indicate the relative importance assigned to each feature when constructing the supervised distance metric.
In
Table 11, we present the features with the highest model weights. The results show that academic performance attributes, particularly partial grades (PG-1, PG-2, and PG-3), receive the highest importance values, followed by the attributes associated with the academic domain of the subject (AD_*), indicating that the curricular context contributes to the discrimination of student performance patterns.
Additionally, we observed that the attribute enrollment (IS_REG) status related to student academic trajectory has lower relative importance than the academic performance attributes, suggesting that its predictive signal is partially captured by stronger-correlated attributes.
Hence, the analysis of the learned feature weights highlighted the factors associated with student academic performance and potential dropout risk.
The highest weights correspond to academic performance attributes, particularly partial grades (PG-1, PG-2, PG-3), indicating that early academic evaluation results are the strongest indicators of final outcomes.
In the same way, attributes associated with the academic domain of the subject (AD_*) also received relatively high importance values, suggesting that the curricular context influences student performance patterns.
Moreover, we observed that the enrollment attribute (IS_REG) provides complementary information about the academic trajectory of the students, although its contribution is limited once academic performance attributes are included.
These findings align with previous studies in educational data mining, where academic performance and course progression are consistently reported as key predictors of dropout risk.
These results led us to complement the geometric analysis of the learned metric presented in the following section, where we examine how the optimized feature weighting modifies the separability between student performance classes.
3.2. Statistical Significance Analysis
To support the experimental results and verify whether the observed differences are statistically significant, we decided to apply the nonparametric Friedman test across all evaluated representations. The test yielded a Friedman statistic of 15.0 with a p-value of 0.0018.
This result rejects the null hypothesis at the 0.05 significance level, confirming the presence of statistically significant differences among the representations.
Given the consistent dominance of SFWM across all cross-validation folds, the statistical analysis supports the conclusion that the proposed feature-weighted metric yields a reliable and meaningful improvement in distance-based classification.
To further explain the performance gains observed with KNN, we analyzed the geometric structure induced by each representation. Specifically, we estimated intra-class and inter-class distances using stratified random sampling.
Table 12 summarizes the results.
Since standardization only rescales individual features without incorporating label information, it does not modify the relative discriminative structure of the data. Consequently, standardized data provide limited insight into changes in class separability beyond scale normalization.
As a result, including the standardized representation in the geometric analysis does not provide additional insight into class separability (
Bellet et al., 2013;
Demšar, 2006). For this reason, we restrict the intra-class, inter-class, and silhouette analyses to representations that explicitly reshape the geometry of the feature space.
We further evaluated class separability using the silhouette score, computed on stratified samples to ensure computational feasibility.
Table 13 reports the results.
The results of the silhouette score analysis demonstrate that: (1) the original representation obtained a negative silhouette score, which indicates a substantial class overlap; (2) PCA slightly improves the silhouette score of the original representation but remains close to zero, indicating weak separation; and (3) SFWM achieves a positive silhouette score, indicating the formation of more coherent and well-separated class clusters.
Based on the experimental results, we observed that SFWM consistently outperforms traditional preprocessing and dimensionality reduction techniques in distance-based classification. In this sense, the results demonstrate that SFWM effectively learns a supervised metric that enhances class separability rather than merely normalizing features or reducing dimensionality.
3.3. Data-Driven Indicators of Dropout Risk
Based on the ablation study and the learned feature weights, we identify the main factors that influence student academic outcomes, showing that academic performance attributes, particularly partial grades, dominate the predictive signal for outcomes.
Therefore, partial grades (PG-1, PG-2, PG-3) obtained the highest relative relevance. A consistent pattern was observed between increases in partial-grade performance and final grade (FG), confirming that early and sustained academic performance is a key indicator associated with dropout risk. This finding suggests that the progressive decline in performance acts as a cumulative signal of academic vulnerability.
The above observation is consistent with previous research in Educational Data Mining, which has identified early academic performance indicators as the most reliable predictors of final academic outcomes and dropout risk (
Baker & Inventado, 2014;
Rojas et al., 2023). Thus, we consider that monitoring partial grades throughout the academic cycle may provide valuable signals for early intervention systems that identify students at risk of academic failure.
Hence, our proposed SFWM aims to improve classification performance while also identifying the attributes that most strongly influence academic outcomes. Both are essential to a better understanding of student performance dynamics.
The SFWM enabled us to estimate the relative contributions of each attribute in distinguishing between levels of academic performance and indicators of school dropout risk. The results show a clear hierarchical structure in the importance of the attributes, with academic attributes having the greatest direct impact. In contrast, structural and demographic attributes exhibit complementary but significant effects.
Besides the partial grades, the attendance percentage (ATT_PCT) had an intermediate but stable weight, indicating that academic commitment, as measured by class attendance, is a complementary factor in student performance. Its contribution indicates that low performance is not an isolated occurrence; instead, it frequently coexists with trends of progressive disengagement by the students.
On the other hand, although the sex attribute (SEX) had a lower weight than the academic performance attributes, this attribute showed a consistent trend across performance levels. The above behavior suggests structural differences in the distribution of academic performance by gender. In this sense, although the sex attribute does not dominate the model, we observed that its inclusion provides additional information beyond academic attributes, providing evidence that certain groups are more vulnerable, possibly related to contextual, sociocultural, or differentiated academic expectations.
Moreover, the shift attribute (SHF) contributed moderately but consistently. We observed a relationship between certain shifts and performance levels, which suggests that organizational conditions, resource availability, external workload, or characteristics of the school environment can indirectly influence dropout risk. Thus, although the effect of the shift is small, it indicates that the institutional context is structurally relevant.
For its part, the semester attribute (SEM) showed progressive variation across performance levels. Prior, possibly due to a possible maturation or academic filtering effect, suggesting that the risk associated with this attribute is not homogeneous across the academic trajectory, and that the first semesters may be associated with greater vulnerability, particularly when low performance manifests early. For example, students in advanced semesters have overcome implicit natural selection processes, which could explain a positive association with higher performance levels.
The age attribute (AGE) showed few variations across performance levels; however, although age does not emerge as a determining attribute, its potential impact may manifest indirectly or interact with other attributes, particularly in the early stages of their academic trajectory, thus this attribute could be helpful in some situations. For example, older students might have different academic trajectories (e.g., prior academic delays, readmission) than those already captured by their partial grades.
Similarly, the attributes is_regular (IS_REG) and is_re-enrolled (IS_REEN) had weights close to zero. This result indicates that, after accounting for academic and contextual attributes, the administrative status does not significantly contribute to the risk of school dropout.
Therefore, the results reveal that, in addition to partial grades, attendance percentage helps explain the main risk associated with low academic performance; however, attributes such as sex, shift, and semester also contribute to shaping the student vulnerability profile. Thus, we can conclude that academic risk is multidimensional, with all the attributes acting as contextual modulators that interact with academic performance.
4. Discussion
In education, it is common that traditional preprocessing techniques may be insufficient to reveal meaningful separations among student profiles, due to academic, behavioral, and contextual factors that often overlap. In this sense, beyond the choice of classifier, the geometric organization of the feature space plays a decisive role in accurately identifying students at risk of dropout. This observation is consistent with previous studies in Educational Data Mining, where the representation of educational data has been identified as a critical factor influencing predictive performance and the effectiveness of early-warning systems (
Baker & Inventado, 2014;
Romero & Ventura, 2010).
For that reason, we proposed an SFWM approach that, based on experimental results, shows promise for analyzing complex educational data on school dropout. Hence, the proposed SFWM shows improved classification accuracy compared with baseline representations. These findings are aligned with prior studies that demonstrated the usefulness of artificial intelligence techniques for identifying students at academic risk through the analysis of complex educational patterns (
Ajibade et al., 2022;
Sifuentes et al., 2023;
Song et al., 2023). However, unlike most previous approaches, which mainly focused on comparing classification algorithms, our proposal focuses on improving the geometry of the feature space itself through supervised metric learning.
The high predictive performance is particularly relevant in the context of early-warning systems, where for an institution, an accurate identification of at-risk students is crucial for timely intervention. In this sense, our results are comparable to those reported for ensemble and tree-based models in the literature, such as Random Forest, XGBoost, and decision-tree approaches, which have shown strong predictive performance in educational environments (
Adnan et al., 2021;
Castrillón et al., 2020;
Rojas et al., 2023). Nevertheless, SFWM achieves competitive performance while preserving the simplicity and interpretability of a distance-based classifier. Although the magnitude of this improvement may appear large, it can be explained by the role of supervised metric learning in explicitly reshaping the geometry of the feature space.
Traditional preprocessing techniques, such as normalization or principal component analysis (PCA), commonly operate independently of class labels and therefore do not directly optimize class separability. In studies such as
Khairy et al. (
2024), the authors have found that feature transformation techniques improved predictive performance by enhancing class discrimination. However, the proposed SFWM incorporates label information during optimization, allowing the learned metric to emphasize features most discriminative of the target outcome. Thus, unlike dimensionality reduction approaches such as LDA or PCA, SFWM preserves the original data dimensionality while adapting feature relevance to the predictive objective. In this sense, the dominance of academic attributes, such as partial grades, can be interpreted as an expected yet meaningful outcome, as these attributes directly reflect students’ academic performance and progression over time, making them strong predictors of dropout risk. This finding is also coherent with the observations of
Psyridou et al. (
2024),
Páez and Ramírez (
2022) and
Ekhosuehi and Iduseri (
2022), who identified academic performance indicators and prior grades as some of the strongest predictors associated with dropout risk.
Thus, the outperformance of SFWM over normalization and principal component analysis suggests that incorporating label information directly into the metric learning process is critical for capturing patterns associated with dropout risk in the feature space. In this way, SFWM yields distance relationships that more accurately reflect students’ academic trajectories, rather than relying solely on variance preservation or scale normalization. The above is particularly important in education for early-warning systems, where misclassification can lead to missed opportunities for timely intervention. Similarly,
Quispe et al. (
2024) highlighted the importance of predictive models capable of strengthening institutional early-warning systems through timely identification of vulnerable students. In our case, the proposed supervised metric contributes to this objective by improving the separability between performance-related student profiles. Additionally, this result highlights the importance of aligning data representation techniques with the predictive objective, rather than relying exclusively on unsupervised transformations.
Moreover, the geometric analysis shows that increases in the inter-class-to-intra-class distance ratio (
Table 12) and the transition from negative to positive silhouette values (
Table 13) indicate that the learned metric produces a more structured representation of the data. Furthermore, the improvement in KNN accuracy reflects a more discriminative organization of the feature space rather than an increase in model complexity. This result differs from many previous studies in the literature, where improvements were primarily associated with the use of more sophisticated classifiers such as ensemble methods or neural networks (
Delogu et al., 2024;
Palacios et al., 2021). Instead, our findings suggest that substantial performance improvements can also emerge from improving data representation while maintaining a computationally simple classifier. In this sense, the improvement in the accuracy is mostly attributable to better data representation of the instance rather than to more complex classification algorithms.
For its part, through the reshaping of the metric space, SFWM produces more compact structures, facilitating the identification of at-risk and non–at-risk students and providing a better way to understand how groups of students relate to one another in terms of dropout risk. We consider that this structural clarity may also support decision-making processes by providing more interpretable groupings of students based on their academic trajectories. This characteristic is particularly relevant in educational environments, where interpretability and transparency are important for supporting institutional decision-making and intervention planning. In this sense, our findings complement previous work that incorporated interpretable predictive approaches, such as decision trees and Naive Bayes models, for educational monitoring (
Alturki & Alturki, 2021;
Matzavela & Alepis, 2021).
It is important to note that, despite the encouraging results obtained in this study, it was based on a single high school dataset. For this reason, the generalizability of our findings may be limited by the specific institutional context, as well as by institutional, cultural, and educational system differences that can influence student behavior and dropout patterns. This limitation is also acknowledged in related literature, where several studies emphasize that dropout behavior varies considerably depending on institutional and socioeconomic conditions (
Martínez & Carmona, 2024;
Rojas et al., 2023). However, our approach is general and scalable, with a potential applicability across diverse educational environments, provided that it is validated with data from multiple institutions and contexts.
Regarding the development of our approach, we face two limitations: (1) Our study focuses on a Mexican high school, which may limit the generalization of results to other regions. (2) Due to institutional restrictions, we only use academic historical records, without incorporating socioeconomic or psychological attributes that are commonly associated with student dropout.
In this sense, although the absence of socioeconomic or psychological attributes may limit the model’s explanatory power, as dropout is a multifactorial phenomenon influenced by both academic and non-academic factors, the findings of our approach could be interpreted as data-driven indicators derived from available academic information rather than as comprehensive or causal risk factors of dropout. This observation is coherent with previous studies that identified socioeconomic, motivational, and family-related variables as important complementary factors for understanding dropout behavior (
Castrillón et al., 2020;
Delogu et al., 2024;
Matzavela & Alepis, 2021). In this sense, the results provide a partial but valuable perspective centered on academic performance as a key dimension of analysis.
Furthermore, it is important to emphasize that improved classification accuracy does not necessarily imply causal inference, but rather enhanced discrimination capability within the feature space. Hence, this distinction is essential to avoid overinterpreting predictive performance of the algorithms as evidence of causal relationships.
We consider that integrating socioeconomic or psychological attributes could provide a more comprehensive understanding of the dropout risk and further validate the effectiveness of our proposed approach. In this sense, for future work, we would like to extend the proposed approach by incorporating additional contextual attributes and assessing its robustness and scalability across other institutions.
5. Conclusions
School dropout represents a latent risk to the progress of society and the globalized world. In this regard, institutions are increasingly concerned with making decisions that will allow them to reduce the number of students who abandon their studies. Therefore, to address this problem, we propose a supervised metric-learning approach to improve distance-based classification. Hence, our approach focuses on enhancing class separability in educational datasets, enabling the data-driven identification of students at risk of school dropout. More precisely, in this paper, we focused on students with different academic performance levels that may be associated with potential dropout risk, without relying on general-purpose feature transformations.
The Supervised Feature-Weighted Metric (SFWM) proposed achieved a statistically significant improvement in classification accuracy when applied to KNN for academic records, where, due to the original dataset composition, the class labels corresponded to students’ final grade rather than directly observed dropout events.
Besides standardization and principal component analysis, which either degrade performance or yield only limited improvements. SFWM reshapes the feature space to align distance-based decision boundaries. Additionally, SFWM learns a discriminative distance metric that better captures the latent structure of student performance patterns and dropout risk profiles. Thus, SFWM has the potential to detect early indicators of school dropout.
To provide a better explanation of the obtained results, we conducted a geometric analysis that shows that SFWM significantly increases inter-class separation while compacting intra-class structures, which is particularly relevant for preventing school dropout, as distance-based classifiers rely heavily on neighborhood relationships to distinguish between students with different academic trajectories. Moreover, the geometric analyses show that SFWM effectively compacts students with similar academic trajectories while increasing separation between distinct performance-based groups that can be interpreted as differentiated dropout risk group profiles.
Furthermore, as expected in large-scale, multi-class educational problems, although the silhouette analysis shows a moderate absolute value, the transition from negative to positive values indicates a meaningful improvement in class structure. Hence, the positive silhouette score obtained for SFWM indicates a more organized and interpretable feature space, overcoming the original dataset, which exhibits substantial class overlap, and PCA, which provides only marginal improvements.
The results indicate that the proposed SFWM introduces minimal additional model complexity and does not rely on gradient information, and remains computationally efficient, making it suitable for large-scale academic datasets. In this sense, from a pedagogical perspective, we consider that the educational institutions could use our approach to develop early warning systems that enable them to support timely academic interventions, allowing educators to identify students who may require additional support based on their performance patterns rather than waiting for terminal outcomes such as dropout.
We can conclude that SFWM is a lightweight, effective supervised metric learning strategy for educational analytics and constitutes a robust, scalable approach. SFWM provides practical support for early data-driven identification of students at risk of school dropout by leveraging academic performance as an operational proxy within the available institutional data, enabling more timely, targeted intervention strategies through improving the reliability of distance-based classification.
Preserving supervised metric information during dimensionality reduction constitutes a promising direction for further research. For future work, we would like to extend the application of SFWM to other distance-sensitive educational tasks, such as student clustering, academic trajectory retrieval, and early anomaly detection, as well as to incorporate datasets with explicitly observed dropout events to further validate the relationship between performance-based classifications and actual dropout behavior.