# On Developing Generic Models for Predicting Student Outcomes in Educational Data Mining

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Review of Related Studies

#### 2.1. Related Study Exploring Prediction Using Generalised Model

#### 2.2. Related Study Exploring Early Prediction of Student Performance

## 3. Datasets

#### 3.1. Action Logs from Moodle

#### 3.2. Enrolment Management System

#### 3.3. Student Management System

## 4. Method

#### 4.1. Predictive Modelling

#### 4.2. Machine Learning Procedure

#### 4.3. Model Evaluation

#### 4.4. Experimental Design

## 5. Results

#### 5.1. Overall Performance Comparison of Classifiers

#### 5.2. Classifier Performance Snapshots

#### 5.3. Feature Importance

## 6. Discussion

## 7. Conclusions

## Author Contributions

## Funding

## Informed Consent Statement

## Data Availability Statement

## Conflicts of Interest

## References

- The New Media CoNsorTiuM. Available online: http://www.hp.com (accessed on 15 June 2021).
- Junco, R.; Clem, C. Predicting course outcomes with digital textbook usage data. Internet High. Educ.
**2015**, 27, 54–63. [Google Scholar] [CrossRef] - Schumacher, C.; Ifenthaler, D. Features students really expect from learning analytics. Comput. Hum. Behav.
**2018**, 78, 397–407. [Google Scholar] [CrossRef] [Green Version] - Yang, C.C.Y.; Chen, I.Y.L.; Ogata, H. International Forum of Educational Technology & Society Toward Precision Education. Educ. Technol. Soc.
**2021**, 24, 152–163. [Google Scholar] [CrossRef] - Cavus, N. Distance Learning and Learning Management Systems. Procedia-Soc. Behav. Sci.
**2015**, 191, 872–877. [Google Scholar] [CrossRef] [Green Version] - Romero, C.; Ventura, S. Educational Data Mining: A Review of the State of the Art. IEEE Trans. Syst. Man Cybern. Part C (Appl. Rev.)
**2010**, 40, 601–618. [Google Scholar] [CrossRef] - Conijn, R.; Snijders, C.; Kleingeld, A.; Matzat, U. Predicting Student Performance from LMS Data: A Comparison of 17 Blended Courses Using Moodle LMS. IEEE Trans. Learn. Technol.
**2017**, 10, 17–29. [Google Scholar] [CrossRef] - Lust, G.; Elen, J.; Clarebout, G. Students’ tool-use within a web enhanced course: Explanatory mechanisms of students’ tool-use pattern. Comput. Hum. Behav.
**2013**, 29, 2013–2021. [Google Scholar] [CrossRef] - López-Zambrano, J.; Lara, J.A.; Romero, C. Towards Portability of Models for Predicting Students’ Final Performance in University Courses Starting from Moodle Logs. Appl. Sci.
**2020**, 10, 354. [Google Scholar] [CrossRef] [Green Version] - Namoun, A.; Alshanqiti, A. Predicting student performance using data mining and learning analytics techniques: A systematic literature review. Appl. Sci.
**2021**, 11, 237. [Google Scholar] [CrossRef] - Chen, F.; Cui, Y. Utilizing Student Time Series Behaviour in Learning Management Systems for Early Prediction of Course Performance. J. Learn. Anal.
**2020**, 7, 1–17. [Google Scholar] [CrossRef] - Nakayama, M.; Mutsuura, K.; Yamamoto, H. The possibility of predicting learning performance using features of note taking activities and instructions in a blended learning environment. Int. J. Educ. Technol. High. Educ.
**2017**, 14, 6. [Google Scholar] [CrossRef] [Green Version] - Gašević, D.; Dawson, S.; Rogers, T.; Gasevic, D. Learning analytics should not promote one size fits all: The effects of instructional conditions in predicting academic success. Internet High. Educ.
**2016**, 28, 68–84. [Google Scholar] [CrossRef] [Green Version] - Riestra-González, M.; Paule-Ruíz, M.d.; Ortin, F. Massive LMS log data analysis for the early prediction of course-agnostic student performance. Comput. Educ.
**2021**, 163, 104108. [Google Scholar] [CrossRef] - Queiroga, E.; Lopes, J.L.; Kappel, K.; Aguiar, M.S.; Araujo, R.M.; Munoz, R.; Villarroel, R.; Cechinel, C. A Learning Analytics Approach to Identify Students at Risk of Dropout: A Case Study with a Technical Distance Education Course. Appl. Sci.
**2020**, 10, 3998. [Google Scholar] [CrossRef] - Zhao, Q.; Wang, J.-L.; Pao, T.-L.; Wang, L.-Y. Modified Fuzzy Rule-Based Classification System for Early Warning of Student Learning. J. Educ. Technol. Syst.
**2020**, 48, 385–406. [Google Scholar] [CrossRef] - Ramaswami, G.S.; Susnjak, T.; Mathrani, A.; Umer, R. Predicting Students Final Academic Performance using Feature Selection Approaches. In Proceedings of the 2020 IEEE Asia-Pacific Conference on Computer Science and Data Engineering (CSDE), Gold Coast, Australia, 16–18 December 2020. [Google Scholar] [CrossRef]
- Howard, E.; Meehan, M.; Parnell, A. Contrasting prediction methods for early warning systems at undergraduate level. Internet High. Educ.
**2018**, 37, 66–75. [Google Scholar] [CrossRef] [Green Version] - Wolpert, D.H.; Macready, W.G. No Free Lunch Theorems for Optimization. IEEE Trans. Evol. Comput.
**1997**, 1, 67–82. [Google Scholar] [CrossRef] [Green Version] - Tayebinik, M.; Puteh, M. Blended Learning or E-learning? Available online: http://ssrn.com/abstract=2282881 (accessed on 30 September 2021).
- Estacio, R.R.; Raga, R.C., Jr. Analyzing students online learning behavior in blended courses using Moodle. Asian Assoc. Open Univ. J.
**2017**, 12, 52–68. [Google Scholar] [CrossRef] - Japkowicz, N.; Stephen, S. The class imbalance problem: A systematic study1. Intell. Data Anal.
**2002**, 6, 429–449. [Google Scholar] [CrossRef] - Dorogush, A.V.; Ershov, V.; Gulin, A.; CatBoost: Gradient Boosting with Categorical Features Support. October 2018. Available online: http://arxiv.org/abs/1810.11363 (accessed on 30 June 2021).
- Hancock, J.T.; Khoshgoftaar, T.M. CatBoost for big data: An interdisciplinary review. J. Big Data
**2020**, 7, 94. [Google Scholar] [CrossRef] - Mingyu, Z.; Sutong, W.; Yanzhang, W.; Dujuan, W. An interpretable prediction method for university student academic crisis warning. Complex Intell. Syst.
**2021**, 1–14. [Google Scholar] [CrossRef] - Breiman, L. Random forests. Mach. Learn.
**2001**, 45, 5–32. [Google Scholar] [CrossRef] [Green Version] - Domingos, P.; Pazzani, M. On the Optimality of the Simple Bayesian Classifier under Zero-One Loss. Mach. Learn.
**1997**, 29, 103–130. [Google Scholar] [CrossRef] - Hosmer, D.W.; Lemeshow, S. Applied Logistic Regression; John Wiley & Sons: New York, NY, USA, 2000. [Google Scholar]
- Hechenbichler, K.; Schliep, K. Weighted k-Nearest-Neighbor Techniques and Ordinal Classification Projektpartner Weighted k-Nearest-Neighbor Techniques and Ordinal Classification. 2004. Available online: http://epub.ub.uni-muenchen.de/ (accessed on 4 October 2021).
- Fabianpedregosa, F.P. Scikit-Learn: Machine Learning in Python Gaël Varoquaux Bertrand Thirion Vincent Dubourg Alexandre Passos Pedregosa, Varoquaux, Gramfort et al. Matthieu Perrot. 2011. Available online: http://scikit-learn.sourceforge.net (accessed on 7 October 2021).
- Ferri, C.; Hernández-Orallo, J.; Modroiu, R. An experimental comparison of performance measures for classification. Pattern Recognit. Lett.
**2009**, 30, 27–38. [Google Scholar] [CrossRef] - Fawcett, T. An introduction to ROC analysis. Pattern Recognit. Lett.
**2006**, 27, 861–874. [Google Scholar] [CrossRef] - Rice, M.E.; Harris, G.T. Comparing effect sizes in follow-up studies: ROC Area, Cohen’s d, and r. Law Hum. Behav.
**2005**, 29, 615–620. [Google Scholar] [CrossRef] - Lundberg, S.M.; Allen, P.G.; Lee, S.-I. A Unified Approach to Interpreting Model Predictions. Available online: https://github.com/slundberg/shap (accessed on 12 October 2021).

**Figure 2.**F-measure and AUC plot of CatBoost on the Computer Applications and the Information Age—Semester 03 hold-out dataset.

**Figure 5.**Two SHAP scatter dependence plots: (

**a**) left graph depicting the interaction between LMS engagement scores and assignment scores, (

**b**) right graph depicting the interaction between learner age and LMS engagement scores.

Authors | Prediction Goal | Evaluation Measures | Methods Compared | Best Performers |
---|---|---|---|---|

Prediction Using Generalised Model | ||||

[11] | Binary classification | AUC | LSTM | LSTM |

[9] | Binary classification | AUC, AUC loss | DT | Proposed method |

[7] | Binary classification | Accuracy | Linear and logistic regression | Proposed method |

[12] | Course grades | R-squares and prediction error | Support vector regression | Proposed method |

[13] | Binary classification | AUC | LR | Proposed method |

Early Prediction of Students’ Performance | ||||

[19] | Binary classification | AUC, F-measures | DT, NB, LR, MLP neural network, and SVM | MLP |

[14] | Multiclass classification | AUC | DT, RF, MLP, LR, ADA, GA | GA |

[15] | Binary classification | F-measures | FRBCS and modified FRBCS | modified FRBCS |

[16] | Binary classification | F-measures, accuracy | kNN, RF, NB, and LR | LR |

[17] | Final grades | MAE | RF, BART, PCR, KNN, NN, and SVM | BART |

Course Name | Semester | Course Size | Number of Assessments | Grade Distribution | Logged Activities in % | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

Male | Female | High Risk | Low Risk | Forum | Quiz | Folder | Assign | Resource | Book | URL | Page | |||

Introduction to finance | 1 | 39 | 73 | 3 | 65 | 47 | 30 | 34.1 | 9.8 | 8.9 | 1.4 | 8.6 | 3.7 | 1.6 |

Introduction to finance | 2 | 51 | 73 | 3 | 67 | 57 | 27 | 31.3 | 12.8 | 6.9 | 1.1 | 12.6 | 4.6 | 2.5 |

Computer Applications and the Information Age | 1 | 67 | 48 | 4 | 94 | 21 | 42.1 | - | 1.1 | 24 | 21.1 | 7 | 0.9 | 2.9 |

Computer Applications and the Information Age | 3 | 21 | 12 | 4 | 20 | 13 | 43.5 | - | 0.8 | 23.9 | 22.1 | 5 | 0.9 | 2.9 |

Fundamentals of Information Technology | 1 | 78 | 15 | 3 | 68 | 25 | 34.2 | - | - | 23.1 | 17.4 | 4.9 | 16.6 | 3.1 |

Fundamentals of Information Technology | 2 | 75 | 42 | 6 | 71 | 46 | 35.8 | - | - | 22.2 | 16.6 | 5.8 | 17.2 | 1.5 |

Application Software Development | 1 | 87 | 10 | 6 | 58 | 39 | 23.4 | - | - | 34.7 | 34.3 | - | 7.5 | - |

Internet Programming | 2 | 61 | 5 | 4 | 37 | 29 | 33.7 | - | 0.3 | 17 | 37.1 | - | 7.3 | 4.2 |

System Analysis and Modelling | 2 | 68 | 25 | 4 | 65 | 28 | 33.1 | 7 | 4.1 | 8.4 | 21.2 | 16.5 | 6.6 | 0.09 |

Feature Name | Description | Type |
---|---|---|

Average score of prior courses | The mean score achieved by a student from across all previous course scores | Numerical |

Maximum score achieved in prior course | The maximum score achieved by a student from their previous courses | Numerical |

Prior course deviation score | The Z-score of a student in respect to the deviation of the cohort mean | Numerical |

Assignment score | The assignment scores received by a student | Numerical |

Assignment deviation score | The Z-score of the student’s mean assignment score as a deviation from the cohort mean | Numerical |

Prior role description | Student’s previous year’s primary activity | Numerical |

LMS deviation score | The engagement score expressed as a Z-score of a student as a deviation from the cohort mean. | Numerical |

LMS engagement score | The count of all activities performed by a student on the Moodle platform. | Numerical |

Citizenship | The nationality of the student | Categorical |

Age | Age of a person | Categorical |

Highest school qualification | Highest school qualification at admission | Categorical |

Study mode | Study by distance/online or on-campus | Categorical |

Gender | Gender of the student | Categorical |

English proficiency test | English proficiency | Categorical |

Classifiers | F-Measure | Accuracy | AUC |
---|---|---|---|

CatBoost | 0.77 ± 0.024 | 75 ± 2.1 | 0.87 ± 0.023 |

Random Forest | 0.67 ± 0.025 | 67 ± 2.4 | 0.74 + 0.015 |

Naïve Bayes | 0.67 ± 0.023 | 68 ± 2.3 | 0.71 ± 0.034 |

Logistic Regression | 0.68 ± 0.031 | 67 ± 3 | 0.73 ± 0.025 |

K-Nearest Neighbors | 0.71 ± 0.02 | 71 ± 2.4 | 0.72 ± 0.022 |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Ramaswami, G.; Susnjak, T.; Mathrani, A.
On Developing Generic Models for Predicting Student Outcomes in Educational Data Mining. *Big Data Cogn. Comput.* **2022**, *6*, 6.
https://doi.org/10.3390/bdcc6010006

**AMA Style**

Ramaswami G, Susnjak T, Mathrani A.
On Developing Generic Models for Predicting Student Outcomes in Educational Data Mining. *Big Data and Cognitive Computing*. 2022; 6(1):6.
https://doi.org/10.3390/bdcc6010006

**Chicago/Turabian Style**

Ramaswami, Gomathy, Teo Susnjak, and Anuradha Mathrani.
2022. "On Developing Generic Models for Predicting Student Outcomes in Educational Data Mining" *Big Data and Cognitive Computing* 6, no. 1: 6.
https://doi.org/10.3390/bdcc6010006