Exploiting the Regularized Greedy Forest Algorithm Through Active Learning for Predicting Student Grades: A Case Study

Tsiakmaki, Maria; Kostopoulos, Georgios; Kotsiantis, Sotiris

doi:10.3390/knowledge4040028

Open AccessArticle

Exploiting the Regularized Greedy Forest Algorithm Through Active Learning for Predicting Student Grades: A Case Study

by

Maria Tsiakmaki

¹,

Georgios Kostopoulos

²

and

Sotiris Kotsiantis

^1,*

¹

Educational Software Development Laboratory (ESDLab), Department of Mathematics, University of Patras, 265 04 Rio, Greece

²

School of Social Sciences, Hellenic Open University, 263 31 Patras, Greece

^*

Author to whom correspondence should be addressed.

Knowledge 2024, 4(4), 543-556; https://doi.org/10.3390/knowledge4040028

Submission received: 13 June 2024 / Revised: 16 October 2024 / Accepted: 22 October 2024 / Published: 24 October 2024

Download

Browse Figures

Versions Notes

Abstract

Student performance prediction is a critical research challenge in the field of educational data mining. To address this issue, various machine learning methods have been employed with significant success, including instance-based algorithms, decision trees, neural networks, and ensemble methods, among others. In this study, we introduce an innovative approach that leverages the Regularized Greedy Forest (RGF) algorithm within an active learning framework to enhance student performance prediction. Active learning is a powerful paradigm that utilizes both labeled and unlabeled data, while RGF serves as an effective decision forest learning algorithm acting as the base learner. This synergy aims to improve the predictive performance of the model while minimizing the labeling effort, making the approach both efficient and scalable. Moreover, applying the active learning framework for predicting student performance focuses on the early and accurate identification of students at risk of failure. This enables targeted interventions and personalized learning strategies to support low-performing students and improve their outcomes. The experimental results demonstrate the potential of our proposed approach as it outperforms well-established supervised methods using a limited pool of labeled examples, achieving an accuracy of 81.60%.

Keywords:

educational data mining; active learning; pool-based scenario; margin sampling strategy; boosting; regularized greedy forest algorithm; performance prediction

1. Introduction

Educational data mining (EDM) has received extensive attention in recent decades [1], while being an important branch of data mining. Apart from extracting interesting and useful knowledge from students’ raw data, EDM tackles significant educational applications [2]. Examples of these applications include the development of various methodologies and systems, including student performance prediction and dropout prevention, course recommendation systems, personalized learning, collaborative learning, curriculum design, and many more [3,4]. The ultimate objective of these efforts is to advance education and improve its outcomes. As such, EDM has become a highly active research field [5,6,7].

One important task of EDM is the prediction of whether a student is going to pass or fail a certain course. The prediction of students at risk of failure benefits both students and educational environments [8,9,10]. To avoid failure and improve student performance, remedial actions and advising interventions could be organized to encourage such students [11]. At the same time, teachers can critically analyze the reasons for failures, determine the preferences and difficulties that occur in learning, and refine their teaching strategies [1]. Finally, institutions can enhance the learning experience, develop effective methods, and ultimately reduce failure rates [12]. Consequently, the focus of our study is on predicting student grades based on current performance and behaviors.

Related studies encompass various methods of data mining to calculate the student dropout rate and the pass/fail predictions. They include probabilistic models, Bayesian learning, optimization algorithms, Support Vector Machines (SVMs), tree-based and rule-based algorithms, artificial neural networks, as well as ensemble learning methods like bagging and boosting. Ensemble methods, which combine multiple learners to improve their performance, have been widely explored in the literature. An alternative to tree boosting is the Regularized Greedy Forest (RGF) algorithm [13]. RGF integrates two main ideas into the learning formulation: (1) searches for the optimum structure change to the current forest that minimizes the loss function (i.e., splits an existing leaf node or starts a new tree), and (2) adjusts the leaf weights of the entire forest to minimize the loss function. The implemented fully corrective regularization strategy yields promising results.

Moreover, acquiring high-quality data samples is essential for training an effective model [14]. Some data may be redundant or unnecessary, or there may be cases where there are insufficient training samples. In this direction, active learning [15,16] identifies and selects the most valuable samples for the training process, while preserving or enhancing the model’s performance. In the pool-based scenario, given a large pool of unlabeled samples, an active learning algorithm iteratively selects the most valuable examples according to a certain query strategy and asks an expert to provide their true labels. In this manner, it aims to select high-quality data samples and eliminate the need for large, costly datasets. The advancement of educational technologies has enabled the collection of extensive learning data generated by the participants within educational environments [17]. Therefore, active learning can play a crucial role in navigating through educational datasets, selecting the most relevant and informative samples for further analysis and model training.

The current study has two primary objectives. Firstly, we aim to conduct a comparative analysis of the efficacy of the Regularized Greedy Forest (RGF) algorithm within the field of educational data mining (EDM). To achieve this, we evaluate six well-established classifiers against the RGF algorithm for predicting student academic performance using log data from an open educational dataset. A series of experiments are carried out, yielding promising results for accurately identifying students at risk of failure. Secondly, we propose a hybrid method that harnesses the power of active learning while integrating the RGF algorithm. Specifically, we investigate the efficiency of a margin sampling strategy, particularly in datasets where the decision boundary between classes is not well defined. The experimental results from the selected dataset indicate the development of a highly accurate predictive model. To the best of our knowledge, this study is the first to simultaneously employ both the RGF classifier and active learning methodologies for predicting student performance, marking a significant step in this direction.

The remainder of this paper is organized as follows: Section 2 reviews recent research on the implementation of machine learning and active learning methods in educational contexts. Section 3 details the research methodology, while Section 4 outlines the experimental procedure. Section 5 analyzes the results and Section 6 concludes the paper with considerations for future research directions.

2. Related Works

Forecasting students’ learning behaviors and outcomes is one of the most critical tasks in the field of educational data mining (EDM). While traditional supervised learning methods have been widely studied, there is a notable scarcity of research investigating the effectiveness of boosting techniques for predicting students’ academic performance. Furthermore, studies incorporating active learning methodologies are even fewer, highlighting the need for deeper exploration in this area.

2.1. Recent Works on EDM

Mai et al.’s study [17] explores entropy’s applicability in EDM and learning analytics (LA). The authors propose a method for calculating learning behavioral entropy to track student progress. By utilizing data from 391 university students over three academic years in a programming course, they monitored student interactions on an online platform. The analysis included 2305 transition frequency features from which a comprehensive data matrix was constructed. The resulting entropy values served as indicators of student engagement, revealing that low entropy was associated with poor performance. This research underscores the potential of the entropy metric in educational research, highlighting its importance in understanding student behavior and learning outcomes in online educational environments.

The study by Altaf et al. [18] aimed to achieve the objective of precision education. To achieve this, the authors developed a hybrid deep learning model that integrates a Convolutional Neural Network (CNN) with a Long Short-Term Memory (LSTM) network. The model was trained over 50 epochs. They conducted a survey consisting of 30 questions and gathered a dataset of 14,000 instances from college and university students. The resulting prediction model achieved an impressive accuracy of 98.8%.

The authors of [19] support that valuable EDM applications, such as students’ marks and grade forecasting, are closely related to the goals of sustainable education. The final dataset used in their experiments comprised 30 attributes of 90,000 students. These instances were exported from the Board of Intermediate and Secondary Education, a system for conducting intermediate and secondary education examinations in Peshawar. The authors worked on regression and classification models with k-NN (k = 10) and decision tree algorithms in conjunction with genetic algorithms (GA). The results revealed that the proposed GA-based decision tree classifier and regression outperformed, scoring 96.64% accuracy and a 5.34 Root Mean Square Error.

The research by Villegas-Ch et al. [20] emphasizes the need for tools to detect factors influencing student learning. The researchers collected educational data from a private university’s learning management system (LMS) in Ecuador, along with surveys regarding students’ use of IT tools and their perceptions of the hybrid learning modality. They employed various machine learning techniques, including J48, Naive Bayes, SMO-SVM, multilayer perceptron, and simple K-means clustering. The study identified five key factors influencing learning outcomes: the frequency of studying (measured by time spent using the LMS), the educational resources provided, the time dedicated to work activities, the learning environment, and the perceived value of study hours.

The CatBoost machine learning algorithm was utilized by Asad et al. in their recent research on personalized precision education [21]. CatBoost utilizes gradient boosting on decision trees, with modifications designed to mitigate overfitting. In this study, a dataset was collected via a survey administered to 4000 higher education students in various universities in Pakistan. The survey comprised 30 questions focused on assessing factors related to student performance. The resulting prediction model achieved an impressive accuracy rate of 96.8%.

The use of clickstream data in predictive analysis is the focus of Liu et al.’s research [22]. In the context of education, clickstream data refer to the paths students take through various learning resources. For this study, the researchers selected demographic and interaction data, utilizing pass/fail labels for analysis. They generated two feature sets for the training models: one based on weekly click counts and another based on monthly click counts. Their experiments included both traditional machine learning and deep learning methods. The results indicated that the Long Short-Term Memory (LSTM) algorithm outperformed the other methods, achieving a classification accuracy of up to 90.25%.

The feasibility of using predictive analytics in students’ learning is explored in the research by Xing et al. [23]. The authors propose a methodological workflow aimed at assisting struggling students. To achieve this, they utilized Radial Basis Function-based Support Vector Machines (RBF-SVMs) to develop performance prediction models and employed the Extra Trees classification method to rank the importance of various features. The dataset was constructed using log files from the Energy3D software, an engineering tool that allows students to design and test prototype projects in renewable energy. The results demonstrated that RBF-SVMs outperformed other baseline algorithms, including k-NN, Naïve Bayes, and decision trees.

2.2. Related Works on Active Learning

One of the first studies dealing with the implementation of active learning in the educational field is presented in [24]. A set of six familiar classification algorithms was used as base learners in a pool-based sampling scenario employing the Margin Sampling Query strategy for predicting student performance. The experimental results revealed that the Sequential Minimal Optimization algorithm achieved an accuracy measure of 80.82% before the final examinations, thus promoting the early identification of low performers. In a similar study [25], the same active learning framework was employed for predicting student dropout in distance higher education. Therefore, seven supervised algorithms were used as base learners. The active learner using the NB classifier as a base learner proved to be the most efficient, with an accuracy measure ranging from 66.29% at the beginning of the academic year to 84.56% in the middle of the academic year.

In a recent study [26], an active learning-based approach was proposed for categorizing students from a Canadian public university according to their social and cognitive presence. Hence, the random forest classifier was employed as a base learner building two active learning models. The results indicated that the proposed approach achieved similar results to the classical supervised classification method in terms of accuracy and Cohen’s kappa while using only 20% of the fully labeled dataset.

Five active learning methods were employed in [27] for detecting student affect from data collected in real classrooms, namely uncertainty sampling, random sampling, Expected Variance Reduction, Model Change, and the Linear Minimum Mean Square Error Estimator (L-MMSE). The experiments indicated that the L-MMSE outperformed the others in terms of AUC. In cases where the active learning model does not have access to sufficiently labeled data for picking the “best” instance to augment the initially labeled pool of data, three active learning methods were proposed in [28] based on the uncertainty sampling strategy, the random sampling strategy, and the L-MMSE. Uncertainty sampling and random sampling proved to be more robust for detecting student affect and improving the model performance when new data are entered.

3. Research Methodology

3.1. Research Motivation

The motivation of this research is to investigate the potential yield of efficient and scalable machine learning techniques to improve the prediction of student academic performance, particularly in identifying students at risk of failure. Despite their advantages, traditional classifiers have limitations in terms of predictive performance and require extensively labeled data [13,29,30]. By examining the Regularized Greedy Forest (RGF) algorithm within an active learning framework, the study aims to improve prediction performance while reducing the labeling effort. To the best of our knowledge, no previous study has employed this approach for predicting student performance.

3.2. The RGF Classification Algorithm

The Regularized Greedy Forest (RGF) algorithm [13] forms an effective decision forest learning algorithm for dealing with a wide range of ML problems. A known issue with decision trees is their tendency to overfit. Ensemble learning techniques, such as bagging and boosting, promise to address this issue by using a number of different models. Boosting involves sequentially training weak models, while each model learns from the mistakes of the previous model. In general, boosted decision trees are considered one of the most effective off-the-shelf nonlinear learning methods for a wide variety of application problems.

The key ideas of RGF are as follows: Like Friedman’s general gradient descent boosting paradigm [31], the RGF algorithm can handle any differentiable loss function. To achieve better performance, the algorithm uses a fully corrective greedy search algorithm to optimize all the coefficients of the decision rules in the forest. In addition, to prevent overfitting, the algorithm imposes explicit regularization via the structured sparsity concept. As a result, by working directly with the underlying forest structure, it results in more accurate model building.

3.3. Active Learning

Active learning or query learning is a typical paradigm for learning with both labeled and unlabeled data [15]. The active learning framework is illustrated in Figure 1. Initially, a classification algorithm (the base learner) is trained on a small pool of labeled data (L) to create a learning model (the Classifier). This classifier is then applied to a pool of unlabeled data (U) from which the most informative instances are selected. An oracle provides the true labels for these instances, allowing the labeled pool (L) to be augmented. The classification algorithm is subsequently retrained, and this process is repeated until a predetermined stopping criterion is met.

Several scenarios and query strategies have been proposed to enhance classifier performance. In this study, we adopted a pool-based scenario along with the margin sampling strategy. The pool-based scenario involves a small pool of labeled data (L) and a larger pool of unlabeled data (U). The most informative instances are selected using the margin sampling strategy, which is a widely used approach for classification problems.

3.4. Proposed Methodology

This study explores the potential of enhancing student grade prediction models by incorporating the RGF algorithm [13] as the base learner in the active learning framework [15]. Learning a decision forest through a fully corrective regularized greedy search has been shown to yield superior results compared to other boosting approaches. Active learning focuses on selecting the most informative unlabeled examples, thereby reducing the number of labeled instances required for training an accurate classification model. In this study, we first explore the use of the Regularized Greedy Forest (RGF) algorithm to enhance the prediction of student performance based on their interactions with an e-learning management system. Next, we investigate various active learning scenarios, employing different metrics to evaluate classification quality. Our objective is to determine whether, and to what extent, active learning can reduce the number of labeled examples needed to train an effective learning model for educational data mining purposes. Ultimately, we demonstrate that these innovative methods can be successfully applied in educational settings. The pseudo-code for the proposed method is presented in Table 1.

4. Experiments and Results

4.1. Dataset

The dataset (https://www.kaggle.com/datasets/aljarah/xAPI-Edu-Data (accessed on 5 August 2024)) used in the present study was detailed at [32]. The dataset consists of educational data collected from a learning management system called Kal-board 360, along with an activity tracker tool known as xAPI. This tracker adheres to the established Experience API (xAPI) specification for learning technology, enabling the collection of educational data on a wide range of student activities throughout the learning process. Examples of the data that xAPI can gather for each student include the frequency of accessing learning materials, participation in discussion forums or chats, and the time taken to complete assessments. Given these specific characteristics, we selected this dataset because it closely aligns with our research objectives.

The dataset comprises 16 features, which are categorized into four main groups: demographic features, academic background features, parental participation features, and behavioral features related to students’ interactions with the e-learning management system. Table 2 outlines the features within each category along with the types of data utilized.

Table 3 lists the features of the dataset, providing their descriptions and typical values to give an overview of the data characteristics.

The dataset contains records from a total of 480 students, comprising 175 females and 305 males. The output class represents the students’ total grades, categorized as {low, medium, high}, which frames the problem as a multiclass classification task. The distribution of the output classes is as follows: 26% for low, 44% for medium, and 30% for high (see Table 4). This distribution indicates that the dataset is balanced.

4.2. Experimental Setup

The purpose of our study is to predict students’ performance given features collected from the online learning management platform. The experiments were executed in two steps. First, we examined the predictive performance of six different classifiers in order to investigate the effectiveness of the RGF algorithm in educational environments compared to the others. Next, we examined the RGF algorithm in active learning scenarios using the margin sampling strategy. Margin sampling focuses on instances that the model finds most ambiguous [33,34].

The performance of the RGF algorithm was tested in the first step against familiar classification algorithms, which are briefly presented below:

The C4.5 decision tree algorithm: A well-known extension of ID3, which was introduced by Quinlan [35]. We used the J4.8 decision tree learning implementation.
Multilayer perceptron: an artificial neural network architecture (ANN) [36] that utilizes feedforward connections and backpropagation for training learning models.
Naïve Bayes: a probabilistic machine learning algorithm that is based on the Bayes theorem [37].
Bagging: The bagging predictor first generates multiple instances of a predictive model and then combines them to obtain a more accurate predictor [38]. We tested bagging with Naïve Bayes, multilayer perceptron, and C4.5 classifiers.
Boosting: A technique that sequentially trains weak predictive models to improve the overall predictive performance of the classifier [10,39]. We tested boosting with Naïve Bayes, multilayer perceptron, and C4.5 classifiers.
Random forests (RF): an ensemble learning method consisting of multiple decision trees, widely utilized for both classification and regression problems [40].

The primary goal of the first phase was to examine the effectiveness of the RGF algorithm against common categories of learning algorithms and ensemble methods using educational data. A label encoder was used to transform non-numerical labels into numerical labels. All the experiments were conducted using the 10-fold cross-validation method. To evaluate the performance of each classifier, we calculated four metrics: accuracy, precision, recall, and F1 score. Feature selection was not performed in this study.

The experiments were conducted using the WEKA [41] and the RGF wrapper implementation for Python. The implementation has a scikit-learn interface and supports usage for multiclass classification problems. Table 5 lists the values of the RGF’s hyperparameters used in our experiments in order to balance model complexity, generalization, and computational efficiency.

In the second phase, we assessed the effectiveness of the Regularized Greedy Forest (RGF) classifier within the framework of active learning. To accomplish this, we utilized the margin sampling strategy, employing the RGF classifier from the previous phase as the base learner for each iteration.

Initially, the dataset was divided into two subsets: 30% (144 instances) were designated as the testing set, while the remaining 70% (336 instances) served as the training set, resulting in a total of 480 instances. From the training set, 20 random examples were selected to form the initial labeled pool, with the remaining instances constituting the unlabeled pool. In each iteration, the most informative unlabeled example was chosen until all training instances were labeled. This process was repeated 10 times, with a different train–test split used for each iteration. The experiments were conducted using the modAL active learning library (modAL: A modular active learning framework for Python3: https://modal-python.readthedocs.io, accessed on 5 August 2024).

5. Results

5.1. First Phase of Experiments: Comparing the Performance of the RGF Classifier

To evaluate the performance of the classifiers, we calculated several metrics, including accuracy, recall, precision, and F1 score. Table 6 summarizes the results for each classifier, with the highest score for each metric indicated by a star. Overall, the Regularized Greedy Forest (RGF) classifier outperformed the other classifiers across all metrics. Specifically, for this dataset, the RGF classifier achieved an accuracy of 81.60%, a recall of 81.81%, a precision of 82.29%, and an F1 score of 81.86%. Notably, it demonstrated superior performance compared to other boosting algorithms.

Consequently, we evaluated the predictive performance of the selected classifiers using the Friedman Aligned Ranks non-parametric test [30] with a significance level of a = 0.05. Therefore, the classifiers are ordered in descending order (the lowest rank value corresponds to the highest performance) (Table 7). Overall, it is confirmed that the RGF is the best-performing classifier.

5.2. Second Phase of Experiments: Incorporating the RGF Classifier Within an Active Learning Scenario

In the second phase of the experiments, we assessed the effectiveness of the Regularized Greedy Forest (RGF) classifier within the context of active learning. For this purpose, we employed a pool-based scenario where the RGF classifier served as the base learner, utilizing the margin sampling strategy.

Figure 2 illustrates the averaged learning curves of the active learning algorithm across four metrics. In all cases, the metrics show an increasing trend with the number of labeled examples. The learning curves indicate that accuracy reaches the performance level of the full dataset after 216 iterations, while precision, recall, and F1 score achieve their optimal values after 210 iterations. This demonstrates that the active learning approach is as effective as traditional learning models, utilizing only two-thirds of the dataset, underscoring its robustness.

Consequently, this study highlights that the combination of RGF with an active learning framework represents a promising methodology for accurately identifying students at risk of failure.

6. Discussion

Using the Regularized Greedy Forest (RGF) algorithm within an active learning framework offers several advantages:

Enhanced Prediction Accuracy: The Regularized Greedy Forest (RGF) is recognized for its effective feature selection capabilities, which enhance the construction of accurate predictive models by concentrating on the most relevant features. Additionally, RGF is adept at handling complex interactions and non-linearities within the data, thereby improving the model’s predictive accuracy.
Efficient Use of Data: Active learning optimizes the training process by selecting the most informative data points for labeling, ensuring that the model is trained on instances that yield the greatest improvements. This approach reduces the amount of labeled data required while strategically querying only the most valuable samples, thereby conserving resources and minimizing effort.
Regularization Benefits: The Regularized Greedy Forest (RGF) algorithm incorporates regularization techniques, such as L1 and L2 regularization, to prevent overfitting and ensure that the model generalizes effectively to new, unseen data. These regularization methods help maintain model simplicity by penalizing complexity, which enhances interpretability and further reduces the risk of overfitting.
Scalability: The Regularized Greedy Forest (RGF) algorithm is capable of efficiently processing large datasets with high-dimensional feature spaces, making it particularly suitable for educational datasets that frequently encompass numerous variables. Additionally, the iterative nature of the active learning framework enables incremental updates to the model, enhancing its scalability and adaptability to new data.
Adaptability: The active learning framework enables the model to dynamically adapt to new data and evolving patterns, ensuring high prediction accuracy over time. Its versatility allows it to accommodate different types of data and acquisition functions, making it suitable for a wide range of educational settings and requirements.
Although our study focused on one strategy, we propose to explore other methods, such as uncertainty sampling, query by committee, and density-weighted strategies, in future research to provide more comprehensive results.

7. Conclusions

In this paper, we explore the potential of integrating the Regularized Greedy Forest (RGF) algorithm with an active learning framework to enhance the prediction of student grades. Our approach capitalizes on RGF’s robust feature selection and its ability to handle complex interactions, while also harnessing the efficiency and effective data utilization that active learning offers. This synergy enables the development of highly accurate predictive models while significantly reducing the labeling effort required—an important advantage in educational settings where data collection can be resource-intensive.

Our experiments with an educational dataset reveal that the proposed method significantly outperforms traditional models in prediction accuracy. The active learning component strategically selects the most informative data points, ensuring that each iteration of model training is both effective and efficient. This approach not only reduces the overall volume of labeled data required but also accelerates the model development process.

Moreover, the regularization techniques inherent in RGF effectively prevent overfitting, ensuring that the models generalize well to new data. This capability is essential for maintaining the reliability of predictions in dynamic educational environments. Additionally, the iterative nature of active learning allows the model to remain adaptable and responsive to new information, facilitating continuous improvement in predictive performance.

In conclusion, the integration of the Regularized Greedy Forest (RGF) algorithm with active learning represents a powerful approach for predicting student grades, yielding substantial enhancements in accuracy, efficiency, and scalability. This methodology offers significant potential for educational institutions aiming to implement data-driven strategies for student support and resource allocation. Future work can explore the extension of this framework to other educational outcomes, multiple datasets, and its application in real-time adaptive learning systems.

Author Contributions

Conceptualization, M.T. and G.K.; methodology, S.K.; software, M.T.; validation, G.K.; formal analysis, S.K.; investigation, M.T.; resources, M.T.; data curation, M.T.; writing—original draft preparation, M.T.; writing—review and editing, G.K.; visualization, M.T.; supervision, S.K.; project administration, S.K.; funding acquisition, S.K. All authors have contributed equally to the work. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

https://www.kaggle.com/datasets/aljarah/xAPI-Edu-Data (accessed on 5 August 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Romero, C.; Ventura, S. Educational data mining: A survey from 1995 to 2005. Expert Syst. Appl. 2007, 33, 135–146. [Google Scholar] [CrossRef]
Rahman, M.M.; Watanobe, Y.; Kiran, R.U.; Thang, T.C.; Paik, I. Impact of practical skills on academic performance: A data-driven analysis. IEEE Access 2021, 9, 139975–139993. [Google Scholar] [CrossRef]
Romero, C.; Ventura, S. Educational data mining: A review of the state of the art. IEEE Trans. Syst. Man Cybern. Part C (Appl. Rev.) 2010, 40, 601–618. [Google Scholar] [CrossRef]
Charitopoulos, A.; Rangoussi, M.; Koulouriotis, D. On the use of soft computing methods in educational data mining and learning analytics research: A review of years 2010–2018. Int. J. Artif. Intell. Educ. 2020, 30, 371–430. [Google Scholar] [CrossRef]
Kabathova, J.; Drlik, M. Towards predicting students dropout in university courses using different machine learning techniques. Appl. Sci. 2021, 11, 3130. [Google Scholar] [CrossRef]
Du, X.; Yang, J.; Shelton, B.E.; Hung, J.-L.; Zhang, M. A systematic meta-review and analysis of learning analytics research. Behav. Inf. Technol. 2021, 40, 49–62. [Google Scholar] [CrossRef]
Rafique, A.; Khan, M.S.; Jamal, M.H.; Tasadduq, M.; Rustam, F.; Lee, E.; Washington, P.B.; Ashraf, I. Integrating learning analytics and collaborative learning for improving students academic performance. IEEE Access 2021, 9, 167812–167826. [Google Scholar] [CrossRef]
Wolff, A.; Zdrahal, Z.; Herrmannova, D.; Knoth, P. Predicting student performance from combined data sources. In Educational Data Mining: Applications and Trends; Springer: Cham, Switzerland, 2014; pp. 175–202. [Google Scholar]
ANDRADE, T.L.D.; Rigo, S.J.; Barbosa, J.L.V. Active Methodology, Educational Data Mining and Learning Analytics: A Systematic Mapping Study. Inform. Educ. 2021, 20, 171–204. [Google Scholar] [CrossRef]
Dien, T.T.; Luu, S.H.; Thanh-Hai, N.; Thai-Nghe, N. Deep Learning with Data Transformation and Factor Analysis for Student Performance Prediction. Int. J. Adv. Comput. Sci. Appl. (IJACSA) 2020, 11, 711–721. [Google Scholar] [CrossRef]
Campbell, J.P.; DeBlois, P.B.; Oblinger, D.G. Academic analytics: A new tool for a new era. EDUCAUSE Rev. 2007, 42, 40. [Google Scholar]
Vachkova, S.N.; Petryaeva, E.Y.; Kupriyanov, R.B.; Suleymanov, R.S. School in digital age: How big data help to transform the curriculum. Information 2021, 12, 33. [Google Scholar] [CrossRef]
Johnson, R.; Zhang, T. Learning nonlinear functions using regularized greedy forest. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 36, 942–954. [Google Scholar] [CrossRef] [PubMed]
Zhang, J.; Dai, Q. A cost-sensitive active learning algorithm: Toward imbalanced time series forecasting. Neural Comput. Appl. 2022, 34, 6953–6972. [Google Scholar] [CrossRef]
Settles, B. Active Learning Literature Survey; Department of Computer Sciences, University of Wisconsin-Madison: Madison, Wisconsin, 2009. [Google Scholar]
Settles, B. From theories to queries: Active learning in practice. PMLR 2011, 16, 1–18. [Google Scholar]
Mai, T.T.; Crane, M.; Bezbradica, M. Students learning behaviour in programming education analysis: Insights from entropy and community detection. Entropy 2023, 25, 1225. [Google Scholar] [CrossRef]
Altaf, S.; Asad, R.; Ahmad, S.; Ahmed, I.; Abdollahian, M.; Zaindin, M. A Hybrid Framework of Deep Learning Techniques to Predict Online Performance of Learners during COVID-19 Pandemic. Sustainability 2023, 15, 11731. [Google Scholar] [CrossRef]
Hussain, S.; Khan, M.Q. Student-performulator: Predicting students academic performance at secondary and intermediate level using machine learning. Ann. Data Sci. 2023, 10, 637–655. [Google Scholar] [CrossRef]
Villegas-Ch, W.; Mera-Navarrete, A.; García-Ortiz, J. Data Analysis Model for the Evaluation of the Factors That Influence the Teaching of University Students. Computers 2023, 12, 30. [Google Scholar] [CrossRef]
Asad, R.; Altaf, S.; Ahmad, S.; Mohamed, A.S.N.; Huda, S.; Iqbal, S. Achieving personalized precision education using the Catboost model during the COVID-19 lockdown period in Pakistan. Sustainability 2023, 15, 2714. [Google Scholar] [CrossRef]
Liu, Y.; Fan, S.; Xu, S.; Sajjanhar, A.; Yeom, S.; Wei, Y. Predicting student performance using clickstream data and machine learning. Educ. Sci. 2022, 13, 17. [Google Scholar] [CrossRef]
Xing, W.; Li, C.; Chen, G.; Huang, X.; Chao, J.; Massicotte, J.; Xie, C. Automatic assessment of students engineering design performance using a Bayesian network model. J. Educ. Comput. Res. 2021, 59, 230–256. [Google Scholar] [CrossRef]
Kostopoulos, G.; Lipitakis, A.-D.; Kotsiantis, S.; Gravvanis, G. Predicting student performance in distance higher education using active learning. In Engineering Applications of Neural Networks. EANN 2017. Communications in Computer and Information Science; Springer: Cham, Switzerland, 2017. [Google Scholar]
Kostopoulos, G.; Kotsiantis, S.; Ragos, O.; Grapsa, T.N. Early dropout prediction in distance higher education using active learning. In Proceedings of the 2017 8th International Conference on Information, Intelligence, Systems & Applications (IISA), Larnaca, Cyprus, 27–30 August 2017. [Google Scholar]
Rolim, V.; Mello, R.F.; Nascimento, A.; Lins, R.D.; Gasevic, D. Reducing the size of training datasets in the classification of online discussions. In Proceedings of the 2021 International Conference on Advanced Learning Technologies (ICALT), Tartu, Estonia, 12–15 July 2021. [Google Scholar]
Yang, T.-Y.; Baker, R.S.; Studer, C.; Heffernan, N.; Lan, A.S. Active learning for student affect detection. In Proceedings of the 12th International Conference on Educational Data Mining, EDM 2019, Montréal, QC, Canada, 2–5 July 2019. [Google Scholar]
Karumbaiah, S.; Lan, A.; Nagpal, S.; Baker, R.S.; Botelho, A.; Heffernan, N. Using past data to warm start active machine learning: Does context matter? In LAK21: 11th International Learning Analytics and Knowledge Conference; Association for Computing Machinery: New York, NY, USA, 2021. [Google Scholar]
Hamalainen, W.; Vinni, M. Classifiers for educational data mining. In Handbook of Educational Data Mining, Chapman & Hall/CRC Data Mining and Knowledge Discovery Series; CRC Press: Boca Raton, FL, USA, 2021; pp. 57–71. [Google Scholar]
Hodges, J.; Lehmann, E. Ranks methods for combination of independent experiments in analysis of variance. Ann. Math. Stat. 1962, 33, 482–497. [Google Scholar] [CrossRef]
Friedman, J.H. Greedy function approximation: A gradient boosting machine. Ann. Stat. 2001, 29, 1189–1232. [Google Scholar] [CrossRef]
Amrieh, E.A.; Hamtini, T.; Aljarah, I. Mining educational data to predict students academic performance using ensemble methods. Int. J. Database Theory Appl. 2016, 9, 119–136. [Google Scholar] [CrossRef]
Campbell, C.; Cristianini, N.; Smola, A. Query learning with large margin classifiers. In ICML ‘00: Proceedings of the Seventeenth International Conference on Machine Learning; Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 2000. [Google Scholar]
Schohn, G.; Cohn, D. Less is more: Active learning with support vector machines. In ICML ‘00: Proceedings of the Seventeenth International Conference on Machine Learning; Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 2000. [Google Scholar]
Quinlan, J.R. C4.5 Programs for Machine Learning; Morgan Kaufmann: Burlington, MA, USA, 1993. [Google Scholar]
Witten, I.H.; Frank, E.; Hall, M.A. Data Mining: Practical Machine Learning Tools and Techniques, 3rd ed.; Morgan Kaufmann: Burlington, MA, USA, 2011. [Google Scholar]
Mitchell, T.M. Machine Learning; McGraw-Hill: New York, NY, USA, 1997. [Google Scholar]
Breiman, L. Bagging predictors. Mach. Learn. 1996, 24, 123–140. [Google Scholar] [CrossRef]
Freund, Y.; Schapire, R.E. A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 1997, 55, 119–139. [Google Scholar] [CrossRef]
Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Holmes, G.; Donkin, A.; Witten, I.H. Weka: A machine learning workbench. In Proceedings of the ANZIIS ‘94—Australian New Zealnd Intelligent Information Systems Conference, Brisbane, QLD, Australia, 29 November–2 December 1994. [Google Scholar]

Figure 1. The active learning framework.

Figure 2. Learning curves of the RFG classifier. (a) Average accuracy learning curve of RFG classifier, (b) average F1 learning curve of RFG classifier, (c) average precision learning curve of RFG classifier, (d) average recall learning curve of RFG classifier.

Table 1. The pseudo-code for the proposed approach.

Initialize:
- Set of labeled data (X_l, y_l)
- Set of unlabeled data X_u
- Set of base learners (trees) F = {}
- Initial model f(x) = 0
- Learning rate η
- Regularization parameters λ1 and λ2
- Maximum number of iterations T
- Budget B (number of data points to label)
- Acquisition function A(x, f) to measure informativeness of data points
Algorithm:
1. While budget B is not exhausted:
a. Train the RGF model on the labeled data (X_l, y_l):
i. For t = 1 to T:
- Compute gradients and hessians for current model predictions:
- Gradient: g_i = ∂L(y_i, f(x_i))/∂f(x_i)
- Hessian: h_i = ∂²L(y_i, f(x_i))/∂f(x_i)²
- Initialize new tree:
- Tree t_new = {}
- For each node split in the tree:
- Select the best split that maximizes the gain:
- Gain = ∑ g_i²/(∑ h_i + λ2) − λ1
- Choose split with the highest gain
- Update the tree structure with the best split
- Add the new tree to the forest:
- F = F ∪ {t_new}
- Update the model with the new tree:
- f(x) = f(x) + η * t_new(x)
b. Select the most informative unlabeled data points using the acquisition function:
- Calculate informativeness score for each x in X_u: A(x, f)
- Select top B’ points with highest scores (B’ ≤ B)
c. Query the oracle to obtain labels for the selected points:
- Get labels y_selected for X_selected
d. Update the labeled and unlabeled datasets:
- X_l = X_l ∪ X_selected
- y_l = y_l ∪ y_selected
- X_u = X_u\X_selected
e. Reduce the budget:
- B = B − B’
2. Output the final model:
- f(x) = ∑ t∈F η * t(x)

Table 2. Categories of features of the dataset.

Category	Features	Value Types
Demographics	Student’s nationality, gender, place of birth, parent responsible (4 features)	Categorical
Academic background	educational stage, grade level, section, semester, course topic, absence days (6 features)	Categorical
Parent’s participation	parent answering survey, parent’s school satisfaction (2 features)	Categorical
Behavioral	discussion groups, visited resources, raised hand in class, announcement viewing (4 features)	Numerical
Class	student performance {high, medium, low}	Categorical

Table 3. Description of features of the dataset.

Feature	Description	Values
nationality	student’s nationality	{Egypt, Kuwait, …, USA, Venzuela}
gender	student’s gender	{m, f}
place of birth	student’s place of birth	{Egypt, Kuwait, …, USA, Venzuela}
parent responsible	student’s parent responsible	{father, mother}
educational stage	student’s level of schooling	{primary, middle, high}
grade level	student’s grade	{G-01, G-02, …, G-12}
section	student’s classroom	{A, B, C}
semester	school year semester	{first, second}
course topic	course topic	{math, biology, …, science}
absence days	total student’s absence days	{under 7 days, above 7 days}
parent answering survey	whether parent answers the surveys provided at school	{y, n}
parent school satisfaction	the degree of parent’s satisfaction with the school	{good, bad}
discussion groups	total number of visiting discussion groups	numeric
visited resources	total number of visiting resources	numeric
raised hand in class	total number of hand raises in class	numeric
announcement viewing	total number of viewing announcements	numeric
class	students’ total grade	{high, medium, low}

Table 4. Descriptive statistics of the dataset.

	Female	Male	Low	Medium	High	Total
Dataset	175	305	127	211	142	480
	36.5%	63.5%	26%	44%	30%	100%

Table 5. Turned RGF hyperparameters for training.

Tuned RGF Hyperparameters for Training
max_leaf = 10,000, test_interval = 100, algorithm = “RGF_Sib”, loss = “Log”, reg_depth = 1.0, l2 = 1.0, sl2 = 0.001, min_samples_leaf = 10, learning_rate = 0.001

Table 6. Results for each classifier.

Classifiers		Evaluation Measure
Classifiers		Accuracy	Recall	Precision	F1
C4.5 (J4.8)		0.7583	0.758	0.760	0.759
ANN		0.7937	0.794	0.793	0.793
NB		0.6770	0.677	0.675	0.671
Bagging	J4.8	0.7437	0.744	0.743	0.743
	ANN	0.7812	0.781	0.781	0.781
	NB	0.6770	0.677	0.676	0.672
Boosting	J4.8	0.7791	0.779	0.779	0.779
	ANN	0.7937	0.794	0.793	0.793
	NB	0.7229	0.723	0.724	0.718
RF		0.7666	0.767	0.766	0.766
RGF		0.8160 *	0.8181 *	0.8229 *	0.8186 *

Table 7. Friedman Aligned Ranks test results (significance level of 0.05)—Statistic: 39.41323, p-value: 0.00002.

Rank	Algorithm
2.50000	RGF
8.50000	Boosting (ANN)
8.50000	ANN
14.50000	Bagging (ANN)
18.50000	Boosting (J4.8)
22.50000	RF
26.50000	C4.5 (J4.8)
30.50000	Bagging (J4.8)
34.50000	Boosting (NB)
40.25000	Bagging (NB)
40.75000	NB

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Tsiakmaki, M.; Kostopoulos, G.; Kotsiantis, S. Exploiting the Regularized Greedy Forest Algorithm Through Active Learning for Predicting Student Grades: A Case Study. Knowledge 2024, 4, 543-556. https://doi.org/10.3390/knowledge4040028

AMA Style

Tsiakmaki M, Kostopoulos G, Kotsiantis S. Exploiting the Regularized Greedy Forest Algorithm Through Active Learning for Predicting Student Grades: A Case Study. Knowledge. 2024; 4(4):543-556. https://doi.org/10.3390/knowledge4040028

Chicago/Turabian Style

Tsiakmaki, Maria, Georgios Kostopoulos, and Sotiris Kotsiantis. 2024. "Exploiting the Regularized Greedy Forest Algorithm Through Active Learning for Predicting Student Grades: A Case Study" Knowledge 4, no. 4: 543-556. https://doi.org/10.3390/knowledge4040028

APA Style

Tsiakmaki, M., Kostopoulos, G., & Kotsiantis, S. (2024). Exploiting the Regularized Greedy Forest Algorithm Through Active Learning for Predicting Student Grades: A Case Study. Knowledge, 4(4), 543-556. https://doi.org/10.3390/knowledge4040028

Article Menu

Exploiting the Regularized Greedy Forest Algorithm Through Active Learning for Predicting Student Grades: A Case Study

Abstract

1. Introduction

2. Related Works

2.1. Recent Works on EDM

2.2. Related Works on Active Learning

3. Research Methodology

3.1. Research Motivation

3.2. The RGF Classification Algorithm

3.3. Active Learning

3.4. Proposed Methodology

4. Experiments and Results

4.1. Dataset

4.2. Experimental Setup

5. Results

5.1. First Phase of Experiments: Comparing the Performance of the RGF Classifier

5.2. Second Phase of Experiments: Incorporating the RGF Classifier Within an Active Learning Scenario

6. Discussion

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI