Exploring Predictive Insights on Student Success Using Explainable Machine Learning: A Synthetic Data Study
Abstract
1. Introduction
2. State of the Art
3. Materials and Methods
3.1. Proposed Framework
3.2. Dataset
3.3. Data Analysis
3.4. Data Preprocessing
3.4.1. Null Values Treatment
3.4.2. Categorical Variables Encoding
3.5. Machine Learning Models
3.6. Explainability Methods
4. Results
4.1. Model Performance
4.2. Model Explainability
5. Discussion
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Appendix A. Categorical Variable Encodings
Variable | Encoding Mapping |
---|---|
Parental_Involvement | Low = 0, Medium = 1, High = 2 |
Access_to_Resources | Low = 0, Medium = 1, High = 2 |
Motivation_Level | Low = 0, Medium = 1, High = 2 |
Family_Income | Low = 0, Medium = 1, High = 2 |
Teacher_Quality | Low = 0, Medium = 1, High = 2 |
Parental_Education_Level | High School = 0, College = 1, Postgraduate = 2 |
Distance_from_Home | Near = 0, Moderate = 1, Far = 2 |
Peer_Influence | Negative = 0, Neutral = 1, Positive = 2 |
School_Type | Private = 0, Public = 1 |
Extracurricular_Activities | No = 0, Yes = 1 |
Internet_Access | No = 0, Yes = 1 |
Learning_Disabilities | No = 0, Yes = 1 |
Gender | Female = 0, Male = 1 |
References
- Xu, Y.; Ma, L.; Wang, Z.; Sun, S. Explainable Machine Learning for Early Warning in Education: A SHAP-based Analysis. In Proceedings of the 15th International Conference on Educational Data Mining (EDM), Durham, UK, 23 July 2022; pp. 456–461. [Google Scholar]
- Harron, K.; Dibben, C.; Boyd, J.; Hjern, A.; Azimaee, M.; Barreto, M.L.; Goldstein, H. Challenges in administrative data linkage for research. Big Data Soc. 2017, 4, 2053951717745678. [Google Scholar] [CrossRef]
- Holstein, K.; Wortman Vaughan, J.; Daumé III, H.; Dudik, M.; Wallach, H. Improving fairness in machine learning systems: What do industry practitioners need? In Proceedings of the CHI ’19: CHI Conference on Human Factors in Computing Systems, Glasgow, UK, 4–9 May 2019; pp. 1–16. [Google Scholar]
- Lundberg, S.M.; Erion, G.; Lee, S.I. Consistent individualized feature attribution for tree ensembles. Nat. Mach. Intell. 2020, 2, 252–259. [Google Scholar]
- Baker, R.S.; Siemens, G. Educational data mining and learning analytics. In Cambridge Handbook of the Learning Sciences; Cambridge University Press: Cambridge, UK, 2019; pp. 253–274. [Google Scholar]
- Romero, C.; Ventura, S. Data mining in education. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2013, 3, 12–27. [Google Scholar] [CrossRef]
- Jordan, M.I.; Mitchell, T.M. Machine learning: Trends, perspectives, and prospects. Science 2015, 349, 255–260. [Google Scholar] [CrossRef]
- Salal, Y.K.; Abdullaev, S.M. Deep learning based ensemble approach to predict student academic performance: Case study. In Proceedings of the 2020 3rd International Conference on Intelligent Sustainable Systems (ICISS), Thoothukudi, India, 3–5 December 2020; IEEE: Palladam, India, 2020; pp. 191–198. [Google Scholar]
- Lipton, Z.C. The mythos of model interpretability. Queue 2018, 16, 31–57. [Google Scholar] [CrossRef]
- Doshi-Velez, F.; Kim, B. Towards a rigorous science of interpretable machine learning. arXiv 2017, arXiv:1702.08608. [Google Scholar] [CrossRef]
- Lundberg, S.M.; Lee, S.I. A unified approach to interpreting model predictions. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
- Ribeiro, M.T.; Singh, S.; Guestrin, C. “Why Should I Trust You?”: Explaining the Predictions of Any Classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), San Francisco, CA, USA, 13–17 August 2016; ACM: New York, NY, USA, 2016; pp. 1135–1144. [Google Scholar]
- Bodily, R.; Dellinger, J.D.; Wiley, D. Toward content-agnostic learning analytics: Using patterns in students’ usage data to inform course design and facilitate personalized learning. Internet High. Educ. 2020, 45, 100728. [Google Scholar]
- Carvalho, D.V.; Pereira, E.M.; Cardoso, J.S. Machine learning interpretability: A survey on methods and metrics. Electronics 2019, 8, 832. [Google Scholar] [CrossRef]
- Viberg, O.; Hatakka, M.; Bälter, O.; Mavroudi, A. The current landscape of learning analytics in higher education: A systematic review. Comput. Hum. Behav. 2018, 89, 98–109. [Google Scholar] [CrossRef]
- Kuh, G.D.; Kinzie, J.L.; Buckley, J.A.; Bridges, B.K.; Hayek, J.C. What Matters to Student Success: A Review of the Literature; National Postsecondary Education Cooperative: Washington, DC, USA, 2006; Volume 8. [Google Scholar]
- Tinto, V. Leaving College: Rethinking the Causes and Cures of Student Attrition; University of Chicago Press: Chicago, IL, USA, 2012. [Google Scholar]
Thematic Group | Variable | Data Type | Description |
---|---|---|---|
Academic Performance | |||
Hours Studied | Numeric | Number of hours spent studying per week. | |
Attendance | Numeric | Percentage of classes attended. | |
Previous Scores | Numeric | Scores from previous exams. | |
Exam Score | Numeric | Final exam score. | |
Psychosocial and Demographic Factors | |||
Motivation Level | Categorical/Ordinal | Student’s level of motivation. | |
Peer Influence | Categorical/Ordinal | Influence of peers on academic performance. | |
Learning Disabilities | Categorical/Binary | Presence of learning disabilities. | |
Gender | Categorical/Binary | Gender of the student. | |
Lifestyle and Health | |||
Sleep Hours | Numeric | Average number of hours of sleep per night. | |
Physical Activity | Numeric | Average number of hours of physical activity per week. | |
Extracurricular Activities | Categorical/Binary | Participation in extracurricular activities. | |
Parental and Family Context | |||
Parental Involvement | Categorical/Ordinal | Level of parental involvement in the student’s education. | |
Parental Education Level | Categorical/Ordinal | Highest education level of parents. | |
Family Income | Categorical/Ordinal | Family income level. | |
Resource Access and Support | |||
Access to Resources | Categorical/Ordinal | Availability of educational resources. | |
Internet Access | Categorical/Binary | Availability of internet access. | |
Tutoring Sessions | Numeric | Number of tutoring sessions attended per month. | |
School and Contextual Factors | |||
Teacher Quality | Categorical/Ordinal | Quality of the teachers. | |
School Type | Categorical/Binary | Type of school attended. | |
Distance from Home | Categorical/Ordinal | Distance from home to school. |
Variable | Mean | Std. Dev. | Min | Median | Max |
---|---|---|---|---|---|
Hours Studied | 19.98 | 5.99 | 1 | 20 | 44 |
Attendance (%) | 79.98 | 11.55 | 60 | 80 | 100 |
Sleep Hours | 7.03 | 1.47 | 4 | 7 | 10 |
Previous Scores | 75.07 | 14.40 | 50 | 75 | 100 |
Tutoring Sessions | 1.49 | 1.23 | 0 | 1 | 8 |
Physical Activity | 2.97 | 1.03 | 0 | 3 | 6 |
Exam Score (%) | 67.24 | 3.89 | 55 | 67 | 100 |
Attribute | % of Missing Values |
---|---|
Teacher quality | 1.18% |
Parental education level | 1.36% |
Distance from home | 1.01% |
Aspect | Global Explainability | Local Explainability |
---|---|---|
Scope | Entire dataset | Single instance |
Purpose | Understand overall feature importance and trends | Explain why a specific prediction was made |
Techniques | SHAP summary plot, feature importance ranking | SHAP waterfall plot, local surrogate models |
Useful for | Policy design, feature selection, model trust | Individual recommendations, personalized feedback |
Stakeholders | Researchers, administrators, policymakers | Teachers, tutors, academic advisors |
Model | RMSE | MAE | R2 |
---|---|---|---|
Random Forest | 2.175 | 1.084 | 0.665 |
XGBoost | 2.226 | 0.986 | 0.649 |
LightGBM | 1.957 | 0.804 | 0.729 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Santana-Perera, B.; García-Barceló, C.; González Arcas, M.; Gil, D. Exploring Predictive Insights on Student Success Using Explainable Machine Learning: A Synthetic Data Study. Information 2025, 16, 763. https://doi.org/10.3390/info16090763
Santana-Perera B, García-Barceló C, González Arcas M, Gil D. Exploring Predictive Insights on Student Success Using Explainable Machine Learning: A Synthetic Data Study. Information. 2025; 16(9):763. https://doi.org/10.3390/info16090763
Chicago/Turabian StyleSantana-Perera, Beatriz, Carmen García-Barceló, Mauricio González Arcas, and David Gil. 2025. "Exploring Predictive Insights on Student Success Using Explainable Machine Learning: A Synthetic Data Study" Information 16, no. 9: 763. https://doi.org/10.3390/info16090763
APA StyleSantana-Perera, B., García-Barceló, C., González Arcas, M., & Gil, D. (2025). Exploring Predictive Insights on Student Success Using Explainable Machine Learning: A Synthetic Data Study. Information, 16(9), 763. https://doi.org/10.3390/info16090763