# Towards Predicting Student’s Dropout in University Courses Using Different Machine Learning Techniques

^{*}

## Abstract

**:**

## Featured Application

**The found model with the best values of the performance metrics, found as the result of comparing several machine learning classifiers, can identify students at risk even though the set of educational data obtained during the course run is scarce. As a result, a suitable form of intervention at the individual e-learning course level can be applied in time.**

## Abstract

## 1. Introduction

## 2. Related Works

## 3. Data Preparation

#### 3.1. Business Understanding

#### 3.2. Data Understanding

#### 3.3. Data Cleaning and Feature Selection

- Accesses represent the total number of course views by a student in the observed period.
- Assignments represent a total score from different types of evaluated activities within the observed period.
- Tests represent a total score from the midterm and final tests during the semester.
- Exam and Project features were also considered. While offering high analytic value, these were not included since they were realized in the last third of the term, and their values are directly tied to completing the course. Those values were not available at the time of expected prediction (during a semester when features Project and Exam are unknown yet).

- there is a medium correlation between the key attributes, with the values correlating to the resulting number of points at the end of the course. This finding is not surprising,
- variables project and tests have a slightly stronger correlation with result_points than the other attributes. It can be deduced that they have a more critical impact on the final student’s outcome. Therefore, they should be considered thoroughly concerning their occurrence in the course. If used later in the course, they would have a more significant impact on prediction accuracy.

## 4. Model

- A decision tree (DT) is defined by recursively partitioning the input space and defining a local model in each resulting region of input space. This algorithm divides this complex classification task into a set of simple classification tasks [31].
- Support Vector Machine (SVM) was initially designed for binary classification by constructing or setting an ideal hyperplane that divides the dataset into classes [34].

## 5. Results and Evaluation

#### 5.1. Performance Metrics Comparison

#### 5.2. Comparison of Used Machine Learning Classifiers

## 6. Discussion

## 7. Conclusions

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Conflicts of Interest

## References

- Narayanasamy, S.K.; Elçi, A. An Effective Prediction Model for Online Course Dropout Rate. Int. J. Distance Educ. Technol.
**2020**, 18, 94–110. [Google Scholar] [CrossRef] - Wang, W.; Yu, H.; Miao, C. Deep Model for Dropout Prediction in MOOCs. In Proceedings of the 2nd International Conference on Cryptography, Security and Privacy, Beijing, China, 6–9 July 2017; pp. 26–32. [Google Scholar] [CrossRef]
- Prenkaj, B.; Velardi, P.; Stilo, G.; Distante, D.; Faralli, S. A Survey of Machine Learning Approaches for Student Dropout Prediction in Online Courses. ACM Comput. Surv.
**2020**, 53, 1–34. [Google Scholar] [CrossRef] - Queiroga, E.M.; Lopes, J.L.; Kappel, K.; Aguiar, M.; Araújo, R.M.; Munoz, R.; Villarroel, R.; Cechinel, C. A Learning Analytics Approach to Identify Students at Risk of Dropout: A Case Study with a Technical Distance Education Course. Appl. Sci.
**2020**, 10, 3998. [Google Scholar] [CrossRef] - Lu, X.; Wang, S.; Huang, J.; Chen, W.; Yan, Z. What Decides the Dropout in MOOCs? In Proceedings of the International Conference on Database Systems for Advanced Applications, Suzhou, China, 27–30 March 2017; pp. 316–327. [Google Scholar] [CrossRef]
- Yang, B.; Qu, Z. Feature Extraction and Learning Effect Analysis for MOOCS Users Based on Data Mining. Educ. Sci. Theory Pract.
**2018**, 18, 1138–1149. [Google Scholar] [CrossRef] - Moreno-Marcos, P.M.; Alario-Hoyos, C.; Munoz-Merino, P.J.; Kloos, C.D. Prediction in MOOCs: A Review and Future Research Directions. IEEE Trans. Learn. Technol.
**2019**, 12, 384–401. [Google Scholar] [CrossRef] - Mubarak, A.A.; Cao, H.; Zhang, W. Prediction of students’ early dropout based on their interaction logs in online learning environment. Interact. Learn. Environ.
**2020**. [Google Scholar] [CrossRef] - Jin, C. MOOC student dropout prediction model based on learning behavior features and parameter optimization. Interact. Learn. Environ.
**2020**. [Google Scholar] [CrossRef] - Drlik, M.; Munk, M.; Skalka, J. Identification of Changes in VLE Stakeholders’ Behavior Over Time Using Frequent Patterns Mining. IEEE Access
**2021**, 9, 23795–23813. [Google Scholar] [CrossRef] - Shaun, R.; Baker, J.D.; Inventado, P.S. Educational Data Mining and Learning Analytics; Springer: Berlin/Heidelberg, Germany, 2014; Chapter 4; pp. 61–75. [Google Scholar]
- Siemens, G.; Baker, R.S.J.D. Learning analytics and educational data mining. In Proceedings of the 2nd International Conference on Cryptography, Security and Privacy, Vancouver, BC, Canada, 29 April–2 May 2012; pp. 252–254. [Google Scholar]
- Alamri, A.; AlShehri, M.; Cristea, A.; Pereira, F.D.; Oliveira, E.; Shi, L.; Stewart, C. Predicting MOOCs Dropout Using Only Two Easily Obtainable Features from the First Week’s Activities. In Proceedings of the International Conference on Intelligent Tutoring Systems, Kingston, Jamaica, 3–7 June 2019; Springer Science and Business Media LLC: Berlin/Heidelberg, Germany, 2019; Volume 11528, pp. 163–173. [Google Scholar]
- Romero, C.; Ventura, S.; Baker, R.; Pechenizkiy, M. Handbook of Educational Data Mining (Chapman & Hall/CRC Data Mining and Knowledge Discovery Series); CRC Press: Boca Raton, FL, USA, 2010; ISBN 1584888784. [Google Scholar]
- Skalka, J.; Drlik, M. Automated Assessment and Microlearning Units as Predictors of At-risk Students and Students’ Outcomes in the Introductory Programming Courses. Appl. Sci.
**2020**, 10, 4566. [Google Scholar] [CrossRef] - Ifenthaler, D.; Yau, J.Y.-K. Utilising learning analytics to support study success in higher education: A systematic review. Educ. Technol. Res. Dev.
**2020**, 68, 1961–1990. [Google Scholar] [CrossRef] - Drlik, M.; Munk, M. Understanding Time-Based Trends in Stakeholders’ Choice of Learning Activity Type Using Predictive Models. IEEE Access
**2018**, 7, 3106–3121. [Google Scholar] [CrossRef] - Hämäläinen, W.; Vinni, M. Classiers for educational data mining. In Educational Data Mining Handbook; Romero, C., Ventura, S., Pechenizkiy, M., Baker, R.S.J.D., Eds.; CRC Press: Boca Raton, FL, USA, 2010; pp. 57–74. [Google Scholar]
- Lang, C.; Siemens, G.; Wise, A.; Gasevic, D. (Eds.) Handbook of Learning Analytics; SOLAR: Beaumont, AB, Canada, 2017. [Google Scholar]
- Kloft, M.; Stiehler, F.; Zheng, Z.; Pinkwart, N. Predicting MOOC Dropout over Weeks Using Machine Learning Methods. In Proceedings of the EMNLP 2014 Workshop on Analysis of Large Scale Social Interaction in MOOCs, Doha, Qatar, 25–29 October 2014; pp. 60–65. [Google Scholar] [CrossRef][Green Version]
- Uden, L. Learning Technology for Education in Cloud. Commun. Comput. Inf. Sci.
**2015**, 533, 43–53. [Google Scholar] [CrossRef] - Baneres, D.; Rodriguez-Gonzalez, M.E.; Serra, M. An Early Feedback Prediction System for Learners At-Risk Within a First-Year Higher Education Course. IEEE Trans. Learn. Technol.
**2019**, 12, 249–263. [Google Scholar] [CrossRef] - Kennedy, G.; Coffrin, C.; De Barba, P.; Corrin, L. Predicting success. In Proceedings of the Fifth International Conference on Tangible, Embedded, and Embodied Interaction, Poughkeepsie, NY, USA, 16–20 March 2015; pp. 136–140. [Google Scholar] [CrossRef]
- Olivé, D.M.; Huynh, D.Q.; Reynolds, M.; Dougiamas, M.; Wiese, D. A supervised learning framework: Using assessment to identify students at risk of dropping out of a MOOC. J. Comput. High. Educ.
**2019**, 32, 9–26. [Google Scholar] [CrossRef] - Benko, L.; Reichel, J.; Munk, M. Analysis of student behavior in virtual learning environment depending on student assessments. In Proceedings of the 2015 13th International Conference on Emerging eLearning Technologies and Applications (ICETA), Stary Smokovec, Slovakia, 26–27 November 2015; IEEE: Piscataway, NJ, USA; pp. 1–6. [Google Scholar]
- Herodotou, C.; Rienties, B.; Boroowa, A.; Zdrahal, Z.; Hlosta, M.; Naydenova, G. Implementing predictive learning analytics on a large scale. In Proceedings of the Seventh International Learning Analytics & Knowledge Conference, Vancouver, BC, Canada, 13–17 March 2017; pp. 267–271. [Google Scholar] [CrossRef]
- Márquez-Vera, C.; Cano, A.; Romero, C.; Noaman, A.Y.M.; Fardoun, H.M.; Ventura, S. Early dropout prediction using data mining: A case study with high school students. Expert Syst.
**2016**, 33, 107–124. [Google Scholar] [CrossRef] - Charitopoulos, A.; Rangoussi, M.; Koulouriotis, D. On the Use of Soft Computing Methods in Educational Data Mining and Learning Analytics Research: A Review of Years 2010–2018. Int. J. Artif. Intell. Educ.
**2020**, 30, 371–430. [Google Scholar] [CrossRef] - Romero, C.; Espejo, P.G.; Zafra, A.; Romero, J.R.; Ventura, S. Web usage mining for predicting final marks of students that use Moodle courses. Comput. Appl. Eng. Educ.
**2010**, 21, 135–146. [Google Scholar] [CrossRef] - Rastrollo-Guerrero, J.L.; Gómez-Pulido, J.A.; Durán-Domínguez, A. Analyzing and Predicting Students’ Performance by Means of Machine Learning: A Review. Appl. Sci.
**2020**, 10, 1042. [Google Scholar] [CrossRef][Green Version] - Xing, W.; Chen, X.; Stein, J.; Marcinkowski, M. Erratum: Corrigendum to “Temporal predication of dropouts in MOOCs: Reaching the low hanging fruit through stacking generalization” (Computers in Human Behavior (2016) 58 (119–129)(S074756321530279X)(10.1016/j.chb.2015.12.007)). Comput. Human Behav.
**2017**, 66, 409. [Google Scholar] [CrossRef] - Youssef, M.; Mohammed, S.; Hamada, E.K.; Wafaa, B.F. A predictive approach based on efficient feature selection and learning algorithms’ competition: Case of learners’ dropout in MOOCs. Educ. Inf. Technol.
**2019**, 24, 3591–3618. [Google Scholar] [CrossRef] - Obonya, J.; Kapusta, J. Identification of Important Activities for Teaching Programming Languages by Decision Trees. In Proceedings of the 12th International Scientific Conference on Distance Learning in Applied Informatics (DIVAI), Štúrovo, Slovakia, 2–4 May 2018; Turcani, M., Balogh, Z., Munk, M., Kapusta, J., Benko, L., Eds.; Kluiwert: Sturovo, Slovakia, 2018; pp. 481–490. [Google Scholar]
- Hagedoorn, T.R.; Spanakis, G. Massive Open Online Courses Temporal Profiling for Dropout Prediction. In Proceedings of the 2017 IEEE 29th International Conference on Tools with Artificial Intelligence (ICTAI), Boston, MA, USA, 6–8 November 2017; pp. 231–238. [Google Scholar]
- Lacave, C.; Molina, A.I.; Cruz-Lemus, J.A. Learning Analytics to identify dropout factors of Computer Science studies through Bayesian networks. Behav. Inf. Technol.
**2018**, 37, 993–1007. [Google Scholar] [CrossRef] - Doleck, T.; Lemay, D.J.; Basnet, R.B.; Bazelais, P. Predictive analytics in education: A comparison of deep learning frameworks. Educ. Inf. Technol.
**2020**, 25, 1951–1963. [Google Scholar] [CrossRef] - Ali, M. PyCaret: An Open Source, Low-Code Machine Learning Library in Python. 2021. Available online: https://pycaret.readthedocs.io/en/latest/index.html (accessed on 10 February 2021).
- Dietterich, T.G. Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms. Neural Comput.
**1998**, 10, 1895–1923. [Google Scholar] [CrossRef] [PubMed][Green Version] - Ferguson, R.; Cooper, A.; Drachsler, H.; Kismihók, G.; Boyer, A.; Tammets, K.; Monés, A.M. Learning analytics. In Proceedings of the Fifth International Conference on Tangible, Embedded, and Embodied Interaction, Poughkeepsie, NY, USA, 16–20 March 2015; pp. 69–72. [Google Scholar]

**Figure 1.**Distribution of students’ dropout over several academic years (orange—number of failed students, blue—number of students who successfully finished the course).

**Figure 3.**Comparison of the performance measures by area under the receiver operating characteristic curve (AUC-ROC) curves.

Access | Tests | Assignments | Exam | Project | |
---|---|---|---|---|---|

count | 261.000 | 261.000 | 216.000 | 261.000 | 261.000 |

mean | 680.295 | 37.941 | 18.077 | 11.555 | 65.513 |

std | 374.662 | 13.113 | 10.831 | 6.303 | 30.615 |

min | 13.000 | 0.000 | 0.000 | 0.000 | 0.000 |

25% | 440.000 | 34.079 | 11.000 | 9.000 | 66.410 |

50% | 605.000 | 40.500 | 14.400 | 14.000 | 75.140 |

70% | 898.750 | 46.500 | 24.000 | 16.000 | 84.610 |

max | 2392.000 | 57.000 | 40.000 | 20.000 | 99.480 |

Access | Tests | Assignments | Exam | Project | |
---|---|---|---|---|---|

count | 60.000 | 60.000 | 60.000 | 60.000 | 60.000 |

mean | 925.533 | 45.114 | 25.533 | 11.842 | 74.405 |

std | 309.606 | 14.320 | 10.357 | 5.192 | 33.238 |

min | 260.000 | 0.000 | 0.000 | 0.000 | 0.000 |

25% | 694.250 | 35.00 | 20.000 | 11.334 | 76.800 |

50% | 846.000 | 49.525 | 28.000 | 13.375 | 88.470 |

70% | 1152.500 | 56.360 | 34.00 | 14.901 | 94.555 |

max | 1886.000 | 68.100 | 39.000 | 19.400 | 105.000 |

Model | Accuracy | Recall | Precision | F1 Score |
---|---|---|---|---|

Naïve Bayes (NB) | 0.77 | 0.72 | 0.93 | 0.82 |

Random Forest (RF) | 0.93 | 0.96 | 0.86 | 0.91 |

Neural Network (NN) | 0.88 | 0.89 | 0.86 | 0.88 |

Logistic Regression (LR) | 0.93 | 0.98 | 0.79 | 0.90 |

Support Vector Machines (SVM) | 0.92 | 0.96 | 0.79 | 0.88 |

Decision Tree (DT) | 0.90 | 0.98 | 0.71 | 0.85 |

Model | LR | DT | RF | NB | SVM |
---|---|---|---|---|---|

LR | - | p = 1.000 | p = 0.500 | p = 0.001 | p = 1.000 |

DT | - | - | p = 0.250 | p = 0.000 | p = 1.000 |

RF | - | - | - | p = 0.007 | p = 0.250 |

NB | - | - | - | - | p = 0.000 |

SVM | - | - | - | - | - |

Model | Execution Time |
---|---|

LR | 0.36 |

DR | 0.84 |

RF | 0.46 |

NB | 0.38 |

SVM | 0.52 |

NN | 8.95 |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Kabathova, J.; Drlik, M. Towards Predicting Student’s Dropout in University Courses Using Different Machine Learning Techniques. *Appl. Sci.* **2021**, *11*, 3130.
https://doi.org/10.3390/app11073130

**AMA Style**

Kabathova J, Drlik M. Towards Predicting Student’s Dropout in University Courses Using Different Machine Learning Techniques. *Applied Sciences*. 2021; 11(7):3130.
https://doi.org/10.3390/app11073130

**Chicago/Turabian Style**

Kabathova, Janka, and Martin Drlik. 2021. "Towards Predicting Student’s Dropout in University Courses Using Different Machine Learning Techniques" *Applied Sciences* 11, no. 7: 3130.
https://doi.org/10.3390/app11073130