Addressing Class Imbalance in Predicting Student Performance Using SMOTE and GAN Techniques
Abstract
1. Introduction
2. Related Work
2.1. Prediction of Student Performance
2.2. Class-Imbalance Problem in Educational Settings
3. Coursera Dataset
4. Experimental Design and Evaluation Measures
- Experiment A (Balancing using SMOTE): Various SMOTE techniques were employed to balance the Coursera dataset, and machine learning models were implemented to compare their performances against the baseline. The Coursera dataset was balanced using the SMOTE techniques for this investigation with the same number of input features and the same target variable as in the original dataset.
- Experiment B (Synthetic Data using GANs): Generative models were utilized to augment the Coursera dataset with simulated samples, and the impact of simulated data was evaluated by implementing machine learning models.
5. Results
5.1. Experiment A—Class Balancing Using SMOTE Techniques
5.2. Experiment B—Synthetic Data Using GANs
6. Discussions, Limitations and Future Opportunities
7. EduPredictor: Revolutionizing VLE Through Predictive Analytics
8. Conclusions
Author Contributions
Funding
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Appendix A. Background to Sampling Techniques and Generative Models
Appendix A.1. Synthetic Minority Over-Sampling Technique (SMOTE)
Appendix A.2. Generative Adversarial Networks (GANs)
GAN Implementation Details
Appendix B. Evaluation Metrics
References
- Szymkowiak, A.; Melović, B.; Dabić, M.; Jeganathan, K.; Kundi, G.S. Information technology and Gen Z: The role of teachers, the internet, and technology in the education of young people. Technol. Soc. 2021, 65, 101565. [Google Scholar] [CrossRef]
- Gavilanes-Sagnay, F.; Loza-Aguirre, E.; Riofrío-Luzcando, D.; Segura-Morales, M. A systematic literature review of indicators for the understanding of interactions in Virtual Learning Environments. In Proceedings of the 2018 International Conference on Computational Science and Computational Intelligence (CSCI), Las Vegas, NV, USA, 12–14 December 2018; pp. 596–600. [Google Scholar]
- Torres Martín, C.; Acal, C.; El Honrani, M.; Mingorance Estrada, Á.C. Impact on the virtual learning environment due to COVID-19. Sustainability 2021, 13, 582. [Google Scholar] [CrossRef]
- Darusman, A.H.; Omar, Y. Enhancing Student Engagement in VLE Platform: Student Perceptions Towards Programming Course Learning Resources. Psychol. Educ. 2021, 58, 5607–5612. [Google Scholar]
- Shahiri, A.M.; Husain, W.; Rashid, N.A. A review on predicting student’s performance using data mining techniques. Procedia Comput. Sci. 2015, 72, 414–422. [Google Scholar] [CrossRef]
- Hellas, A.; Ihantola, P.; Petersen, A.; Ajanovski, V.V.; Gutica, M.; Hynninen, T.; Knutas, A.; Leinonen, J.; Messom, C.; Liao, S.N. Predicting academic performance: A systematic literature review. In Proceedings of the Proceedings Companion of the 23rd Annual ACM Conference on Innovation and Technology in Computer Science Education, Larnaca, Cyprus, 2–4 July 2018; pp. 175–199. [Google Scholar]
- Sekeroglu, B.; Dimililer, K.; Tuncal, K. Student performance prediction and classification using machine learning algorithms. In Proceedings of the 2019 8th International Conference on Educational and Information Technology, Cambridge, UK, 2–4 March 2019; pp. 7–11. [Google Scholar]
- Elbadrawy, A.; Studham, R.S.; Karypis, G. Collaborative multi-regression models for predicting students’ performance in course activities. In Proceedings of the Fifth International Conference on Learning Analytics and Knowledge, Poughkeepsie, NY, USA, 16–20 March 2015; pp. 103–107. [Google Scholar]
- Yee-King, M.; Grimalt-Reynes, A.; d’Inverno, M. Predicting student grades from online, collaborative social learning metrics using K-NN. In Proceedings of the EDM, Raleigh, NC, USA, 29 June–2 July 2016; pp. 654–655. [Google Scholar]
- Al-Shehri, H.; Al-Qarni, A.; Al-Saati, L.; Batoaq, A.; Badukhen, H.; Alrashed, S.; Alhiyafi, J.; Olatunji, S.O. Student performance prediction using support vector machine and k-nearest neighbor. In Proceedings of the 2017 IEEE 30th Canadian Conference on Electrical and Computer Engineering (CCECE), Windsor, ON, Canada, 30 April–3 May 2017; pp. 1–4. [Google Scholar]
- Iqbal, Z.; Qadir, J.; Mian, A.N.; Kamiran, F. Machine learning based student grade prediction: A case study. arXiv 2017, arXiv:1708.08744. [Google Scholar] [CrossRef]
- Hussain, M.; Zhu, W.; Zhang, W.; Abidi, S.M.R. Student engagement predictions in an e-learning system and their impact on student course assessment scores. Comput. Intell. Neurosci. 2018, 2018, 6347186. [Google Scholar] [CrossRef] [PubMed]
- Heuer, H.; Breiter, A. Student success prediction and the trade-off between big data and data minimization. In DeLFI 2018-Die 16. E-Learning Fachtagung Informatik; Gesellschaft für Informatik e.V.: Bonn, Germany, 2018; pp. 219–230. [Google Scholar]
- El Fouki, M.; Aknin, N.; El Kadiri, K.E. Multidimensional Approach Based on Deep Learning to Improve the Prediction Performance of DNN Models. Int. J. Emerg. Technol. Learn. 2019, 14, 30. [Google Scholar] [CrossRef]
- Hussain, S.; Muhsin, Z.; Salal, Y.; Theodorou, P.; Kurtoğlu, F.; Hazarika, G. Prediction model on student performance based on internal assessment using deep learning. Int. J. Emerg. Technol. Learn. 2019, 14, 4–22. [Google Scholar] [CrossRef]
- Ajibade, S.S.M.; Ahmad, N.B.B.; Shamsuddin, S.M. Educational data mining: Enhancement of student performance model using ensemble methods. Iop Conf. Ser. Mater. Sci. Eng. 2019, 551, 012061. [Google Scholar] [CrossRef]
- Tomasevic, N.; Gvozdenovic, N.; Vranes, S. An overview and comparison of supervised data mining techniques for student exam performance prediction. Comput. Educ. 2020, 143, 103676. [Google Scholar] [CrossRef]
- Hooshyar, D.; Pedaste, M.; Yang, Y.; Malva, L.; Hwang, G.J.; Wang, M.; Lim, H.; Delev, D. From gaming to computational thinking: An adaptive educational computer game-based learning approach. J. Educ. Comput. Res. 2021, 59, 383–409. [Google Scholar] [CrossRef]
- Waheed, H.; Hassan, S.U.; Aljohani, N.R.; Hardman, J.; Alelyani, S.; Nawaz, R. Predicting academic performance of students from VLE big data using deep learning models. Comput. Hum. Behav. 2020, 104, 106189. [Google Scholar] [CrossRef]
- Barros, T.M.; Neto, P.A.S.; Silva, I.; Guedes, L.A. Predictive Models for Imbalanced Data: A School Dropout Perspective. Educ. Sci. 2019, 9, 275. [Google Scholar] [CrossRef]
- Mduma, N. Data Balancing Techniques for Predicting Student Dropout Using Machine Learning. Data 2023, 8, 49. [Google Scholar] [CrossRef]
- Wongvorachan, T.; He, S.; Bulut, O. A Comparison of Undersampling, Oversampling, and SMOTE Methods for Dealing with Imbalanced Classification in Educational Data Mining. Information 2023, 14, 54. [Google Scholar] [CrossRef]
- Chen, W.; Yang, K.; Yu, Z.; Shi, Y.; Chen, C.P. A survey on imbalanced learning: Latest research, applications and future directions. Artif. Intell. Rev. 2024, 57, 137. [Google Scholar] [CrossRef]
- Altalhan, M.; Algarni, A.; Alouane, M.T.H. Imbalanced data problem in machine learning: A review. IEEE Access 2025, 13, 13686–13699. [Google Scholar] [CrossRef]
- El-Deeb, O.M.; Elbadawy, W.; Elzanfaly, D.S. The effect of imbalanced classes on students’ academic performance prediction. Int. J. e-Collab. 2022, 18, 1–20. [Google Scholar] [CrossRef]
- Alija, S.; Beqiri, E.; Gaafar, A.S.; Hamoud, A.K. Predicting students performance using supervised machine learning based on imbalanced dataset and wrapper feature selection. Informatica 2023, 47, 11–19. [Google Scholar] [CrossRef]
- Fachrie, M.; Musdholifah, A.; Pulungan, R. Effectiveness of data resampling and ensemble learning in multiclass imbalance learning. Appl. Soft Comput. 2024, 146, 110596. [Google Scholar] [CrossRef]
- Jain, A.; Dubey, A.K.; Khan, S.; Panwar, A.; Alkhatib, M.; Alshahrani, A.M. A PSO weighted ensemble framework with SMOTE balancing for student dropout prediction in smart education systems. Sci. Rep. 2025, 15, 97506. [Google Scholar] [CrossRef]
- Koller, D.; Ng, A. The online revolution: Education for everyone. In Proceedings of the 22nd ACM International Conference on Conference on Information & Knowledge Management, San Francisco, CA, USA, 27 October–1 November 2013. [Google Scholar]
- Sokolova, M.; Lapalme, G. A systematic analysis of performance measures for classification tasks. Inf. Process. Manag. 2009, 45, 427–437. [Google Scholar] [CrossRef]
- Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
- Mansourifar, H.; Shi, W. Deep synthetic minority over-sampling technique. arXiv 2020, arXiv:2003.09788. [Google Scholar] [CrossRef]
- Saxena, D.; Cao, J. Generative adversarial networks (GANs) challenges, solutions, and future directions. ACM Comput. Surv. (CSUR) 2021, 54, 1–42. [Google Scholar] [CrossRef]
- Creswell, A.; White, T.; Dumoulin, V.; Arulkumaran, K.; Sengupta, B.; Bharath, A.A. Generative adversarial networks: An overview. IEEE Signal Process. Mag. 2018, 35, 53–65. [Google Scholar] [CrossRef]











| Feature | Description |
|---|---|
| hits_count | Total number of interactions or clicks performed by a learner |
| partic_count | Number of participatory actions (e.g., forum or discussion activity) |
| video_duration | Total duration of video content available or consumed |
| assessment_type_id_6 | Binary indicator for participation in assessment type 6 |
| assessment_type_id_7 | Binary indicator for participation in assessment type 7 |
| video_count | Total number of course videos accessed by the learner |
| quiz_count | Number of quizzes attempted by the learner |
| total_quiz_grade | Aggregate score obtained across all quizzes |
| Course Passed | Binary target variable (0 = Fail, 1 = Pass) |
| Classifier | Hyperparameter | Search Space |
|---|---|---|
| MLP | activation | identity, logistic, tanh, relu |
| solver | lbfgs, sgd, adam | |
| learning_rate_init | 0.1, 0.01, 0.001 | |
| Decision Tree | splitter | best, random |
| criterion | gini, entropy | |
| min_samples_split | 2, 10, 30, 50, 100 | |
| min_samples_leaf | 1, 10, 50, 100 | |
| kNN | n_neighbors | 5, 10, 50, 100 |
| weights | uniform, distance | |
| algorithm | auto, ball_tree, kd_tree, brute | |
| leaf_size | 10, 30, 50, 100 | |
| Random Forest | n_estimators | 10, 30, 50, 100 |
| criterion | gini, entropy | |
| min_samples_split | 2, 10, 30, 50, 100 | |
| min_samples_leaf | 1, 10, 50, 100 | |
| XGBoost | booster | gbtree, gblinear, dart |
| max_depth | 3, 10, 50, 100 | |
| learning_rate | 0.1, 0.01, 0.001 | |
| n_estimators | 2, 10, 100 | |
| CatBoost | iterations | 150 |
| learning_rate | 0.1, 0.01, 0.001 | |
| depth | 4, 5 | |
| l2_leaf_reg | 0.5, 1 | |
| SVM | degree | 1, 3, 5 |
| gamma | scale, auto |
| Model | Hyperparameters |
|---|---|
| MLP | activation: logistic, learning_rate_init: 0.001, solver: sgd, iter = 500 |
| DT | criterion: entropy, min_samples_leaf: 100, min_samples_split: 2, splitter: random |
| KNN | algorithm: auto, leaf_size: 10, n_neighbors: 100, weights: uniform |
| RF | criterion: gini, min_samples_leaf: 100, min_samples_split: 10, n_estimators: 100 |
| XGBoost | booster: gbtree, learning_rate: 0.1, max_depth: 3, n_estimators: 10 |
| CATBoost | depth: 4, iterations: 150, l2_leaf_reg: 1, learning_rate: 0.001 |
| SVC | degree: 1, gamma: scale |
| Models | Accuracy | F1 Macro | F1 Positive | F1 Negative | Precision | Recall | Sensitivity | Specificity | MCC |
|---|---|---|---|---|---|---|---|---|---|
| Baseline | |||||||||
| MLP | 0.81 | 0.76 | 0.66 | 0.86 | 0.80 | 0.77 | 0.87 | 0.66 | 0.55 |
| DT | 0.77 | 0.72 | 0.62 | 0.83 | 0.74 | 0.72 | 0.87 | 0.50 | 0.47 |
| KNN | 0.79 | 0.74 | 0.63 | 0.85 | 0.77 | 0.74 | 0.87 | 0.62 | 0.51 |
| RF | 0.78 | 0.73 | 0.62 | 0.85 | 0.79 | 0.74 | 0.88 | 0.61 | 0.51 |
| XGBoost | 0.80 | 0.74 | 0.61 | 0.86 | 0.79 | 0.74 | 0.90 | 0.57 | 0.52 |
| CATBoost | 0.79 | 0.74 | 0.64 | 0.85 | 0.79 | 0.75 | 0.86 | 0.64 | 0.53 |
| SVC | 0.71 | 0.65 | 0.54 | 0.76 | 0.74 | 0.69 | 0.75 | 0.62 | 0.40 |
| Borderline SMOTE | |||||||||
| MLP | 0.78 | 0.78 | 0.78 | 0.78 | 0.78 | 0.78 | 0.80 | 0.76 | 0.56 |
| DT | 0.79 | 0.78 | 0.79 | 0.78 | 0.80 | 0.79 | 0.70 | 0.81 | 0.57 |
| KNN | 0.76 | 0.75 | 0.77 | 0.74 | 0.77 | 0.76 | 0.68 | 0.83 | 0.52 |
| RF | 0.78 | 0.78 | 0.80 | 0.78 | 0.79 | 0.78 | 0.73 | 0.84 | 0.58 |
| XGBoost | 0.79 | 0.79 | 0.80 | 0.78 | 0.80 | 0.79 | 0.73 | 0.85 | 0.59 |
| CATBoost | 0.80 | 0.79 | 0.80 | 0.78 | 0.80 | 0.80 | 0.74 | 0.85 | 0.60 |
| SVC | 0.74 | 0.74 | 0.75 | 0.72 | 0.75 | 0.74 | 0.67 | 0.81 | 0.49 |
| SMOTE | |||||||||
| MLP | 0.79 | 0.79 | 0.79 | 0.80 | 0.79 | 0.79 | 0.81 | 0.77 | 0.58 |
| DT | 0.76 | 0.76 | 0.78 | 0.76 | 0.77 | 0.76 | 0.68 | 0.82 | 0.55 |
| KNN | 0.77 | 0.77 | 0.79 | 0.75 | 0.78 | 0.77 | 0.69 | 0.85 | 0.55 |
| RF | 0.79 | 0.78 | 0.80 | 0.79 | 0.79 | 0.79 | 0.73 | 0.86 | 0.60 |
| XGBoost | 0.80 | 0.80 | 0.79 | 0.81 | 0.80 | 0.80 | 0.85 | 0.75 | 0.60 |
| CATBoost | 0.80 | 0.80 | 0.81 | 0.79 | 0.80 | 0.80 | 0.75 | 0.85 | 0.60 |
| SVC | 0.74 | 0.74 | 0.75 | 0.72 | 0.75 | 0.74 | 0.89 | 0.80 | 0.49 |
| SMOTE NN | |||||||||
| MLP | 0.88 | 0.87 | 0.88 | 0.86 | 0.88 | 0.88 | 0.83 | 0.91 | 0.75 |
| DT | 0.88 | 0.88 | 0.88 | 0.88 | 0.88 | 0.88 | 0.86 | 0.89 | 0.76 |
| KNN | 0.88 | 0.88 | 0.87 | 0.88 | 0.88 | 0.88 | 0.88 | 0.87 | 0.76 |
| RF | 0.89 | 0.89 | 0.89 | 0.89 | 0.89 | 0.89 | 0.85 | 0.92 | 0.78 |
| XGBoost | 0.88 | 0.88 | 0.88 | 0.89 | 0.88 | 0.88 | 0.89 | 0.87 | 0.78 |
| CATBoost | 0.89 | 0.89 | 0.89 | 0.90 | 0.89 | 0.89 | 0.91 | 0.87 | 0.78 |
| SVC | 0.86 | 0.86 | 0.86 | 0.86 | 0.87 | 0.86 | 0.85 | 0.87 | 0.73 |
| SMOTE Tomek | |||||||||
| MLP | 0.80 | 0.80 | 0.80 | 0.80 | 0.80 | 0.80 | 0.82 | 0.79 | 0.61 |
| DT | 0.79 | 0.79 | 0.81 | 0.78 | 0.81 | 0.80 | 0.72 | 0.87 | 0.60 |
| KNN | 0.79 | 0.79 | 0.80 | 0.77 | 0.80 | 0.79 | 0.73 | 0.85 | 0.58 |
| RF | 0.81 | 0.81 | 0.81 | 0.80 | 0.82 | 0.81 | 0.76 | 0.85 | 0.62 |
| XGBoost | 0.81 | 0.81 | 0.82 | 0.80 | 0.82 | 0.81 | 0.76 | 0.86 | 0.63 |
| CATBoost | 0.82 | 0.82 | 0.82 | 0.81 | 0.82 | 0.82 | 0.78 | 0.86 | 0.64 |
| SVC | 0.76 | 0.76 | 0.77 | 0.74 | 0.77 | 0.76 | 0.70 | 0.81 | 0.53 |
| SVM SMOTE | |||||||||
| MLP | 0.79 | 0.79 | 0.77 | 0.79 | 0.79 | 0.79 | 0.81 | 0.76 | 0.57 |
| DT | 0.78 | 0.77 | 0.79 | 0.76 | 0.78 | 0.78 | 0.70 | 0.85 | 0.56 |
| KNN | 0.77 | 0.77 | 0.78 | 0.76 | 0.78 | 0.77 | 0.72 | 0.83 | 0.55 |
| RF | 0.78 | 0.78 | 0.81 | 0.79 | 0.79 | 0.78 | 0.76 | 0.83 | 0.60 |
| XGBoost | 0.79 | 0.79 | 0.80 | 0.79 | 0.80 | 0.79 | 0.76 | 0.83 | 0.59 |
| CATBoost | 0.79 | 0.79 | 0.80 | 0.79 | 0.80 | 0.79 | 0.75 | 0.84 | 0.60 |
| SVC | 0.75 | 0.74 | 0.76 | 0.73 | 0.76 | 0.75 | 0.69 | 0.81 | 0.50 |
| Models | Accuracy | F1 Macro | F1 Positive | F1 Negative | Precision | Recall | Sensitivity | Specificity | MCC |
|---|---|---|---|---|---|---|---|---|---|
| GAN-Simulated Coursera Dataset | |||||||||
| MLP | 0.87 | 0.80 | 0.68 | 0.91 | 0.84 | 0.79 | 0.93 | 0.65 | 0.62 |
| DT | 0.82 | 0.74 | 0.63 | 0.90 | 0.77 | 0.74 | 0.89 | 0.59 | 0.55 |
| KNN | 0.85 | 0.77 | 0.64 | 0.90 | 0.83 | 0.77 | 0.92 | 0.62 | 0.58 |
| RF | 0.84 | 0.76 | 0.62 | 0.90 | 0.83 | 0.76 | 0.92 | 0.60 | 0.56 |
| XGBoost | 0.86 | 0.78 | 0.68 | 0.92 | 0.83 | 0.79 | 0.93 | 0.65 | 0.62 |
| CATBoost | 0.86 | 0.77 | 0.64 | 0.91 | 0.84 | 0.76 | 0.94 | 0.59 | 0.59 |
| SVC | 0.86 | 0.78 | 0.65 | 0.91 | 0.84 | 0.79 | 0.93 | 0.66 | 0.61 |
| Original + GAN-Simulated Coursera Dataset | |||||||||
| MLP | 0.80 | 0.72 | 0.59 | 0.88 | 0.79 | 0.71 | 0.93 | 0.51 | 0.52 |
| DT | 0.82 | 0.75 | 0.62 | 0.88 | 0.78 | 0.74 | 0.90 | 0.58 | 0.52 |
| KNN | 0.80 | 0.71 | 0.56 | 0.87 | 0.77 | 0.71 | 0.91 | 0.51 | 0.47 |
| RF | 0.82 | 0.74 | 0.60 | 0.88 | 0.79 | 0.74 | 0.91 | 0.55 | 0.52 |
| XGBoost | 0.83 | 0.77 | 0.56 | 0.88 | 0.82 | 0.76 | 0.92 | 0.60 | 0.49 |
| CATBoost | 0.81 | 0.74 | 0.6 | 0.88 | 0.78 | 0.73 | 0.92 | 0.55 | 0.51 |
| SVC | 0.80 | 0.71 | 0.55 | 0.87 | 0.77 | 0.71 | 0.92 | 0.50 | 0.46 |
| SMOTENN-GAN Coursera Dataset | |||||||||
| MLP | 0.92 | 0.92 | 0.91 | 0.89 | 0.92 | 0.92 | 0.89 | 0.95 | 0.82 |
| DT | 0.88 | 0.88 | 0.90 | 0.87 | 0.89 | 0.88 | 0.83 | 0.92 | 0.78 |
| KNN | 0.92 | 0.92 | 0.93 | 0.91 | 0.93 | 0.92 | 0.87 | 0.97 | 0.85 |
| RF | 0.92 | 0.91 | 0.92 | 0.90 | 0.93 | 0.91 | 0.86 | 0.96 | 0.83 |
| XGBoost | 0.91 | 0.91 | 0.90 | 0.87 | 0.92 | 0.91 | 0.85 | 0.97 | 0.80 |
| CATBoost | 0.91 | 0.90 | 0.91 | 0.90 | 0.91 | 0.91 | 0.90 | 0.91 | 0.81 |
| SVC | 0.90 | 0.90 | 0.91 | 0.89 | 0.91 | 0.90 | 0.88 | 0.93 | 0.81 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Alnassar, F.M.; Blackwell, T.; Homayounvala, E.; Yee-king, M. Addressing Class Imbalance in Predicting Student Performance Using SMOTE and GAN Techniques. Appl. Sci. 2026, 16, 3274. https://doi.org/10.3390/app16073274
Alnassar FM, Blackwell T, Homayounvala E, Yee-king M. Addressing Class Imbalance in Predicting Student Performance Using SMOTE and GAN Techniques. Applied Sciences. 2026; 16(7):3274. https://doi.org/10.3390/app16073274
Chicago/Turabian StyleAlnassar, Fatema Mohammad, Tim Blackwell, Elaheh Homayounvala, and Matthew Yee-king. 2026. "Addressing Class Imbalance in Predicting Student Performance Using SMOTE and GAN Techniques" Applied Sciences 16, no. 7: 3274. https://doi.org/10.3390/app16073274
APA StyleAlnassar, F. M., Blackwell, T., Homayounvala, E., & Yee-king, M. (2026). Addressing Class Imbalance in Predicting Student Performance Using SMOTE and GAN Techniques. Applied Sciences, 16(7), 3274. https://doi.org/10.3390/app16073274

