Implementing AutoML in Educational Data Mining for Prediction Tasks

Educational Data Mining (EDM) has emerged over the last two decades, concerning with the development and implementation of data mining methods in order to facilitate the analysis of vast amounts of data originating from a wide variety of educational contexts. Predicting students’ progression and learning outcomes, such as dropout, performance and course grades, is regarded among the most important tasks of the EDM field. Therefore, applying appropriate machine learning algorithms for building accurate predictive models is of outmost importance for both educators and data scientists. Considering the high-dimensional input space and the complexity of machine learning algorithms, the process of building accurate and robust learning models requires advanced data science skills, while is time-consuming and error-prone in most cases. In addition, choosing the proper method for a given problem formulation and configuring the optimal parameters’ values for a specific model is a demanding task, whilst it is often very difficult to understand and explain the produced results. In this context, the main purpose of the present study is to examine the potential use of advanced machine learning strategies on educational settings from the perspective of hyperparameter optimization. More specifically, we investigate the effectiveness of automated Machine Learning (autoML) for the task of predicting students’ learning outcomes based on their participation in online learning platforms. At the same time, we limit the search space to tree-based and rule-based models in order to achieving transparent and interpretable results. To this end, a plethora of experiments were carried out, revealing that autoML tools achieve consistently superior results. Hopefully our work will help nonexpert users (e.g., educators and instructors) in the field of EDM to conduct experiments with appropriate automated parameter configurations, thus achieving highly accurate and comprehensible results.


Introduction
Educational Data Mining (EDM) is the research field of using data mining methods and tools in educational settings [1,2]. Its main objective is to analyze these environments in order to find appropriate solutions to educational research issues [3], all of which are directed to improve teaching and learning [4]. Their results help students improve their learning performance, provide personalized recommendations, enhance the teaching performance, evaluate learning effectiveness, organize institutional resources and educational offer and many more [1,5].
Three common concerns that employ EDM techniques are the detection of whether a student is going to pass or fail a certain course, the prediction of students' final marks and the identification of students that are likely to drop out. The ability to predict students' performance and their underlying learning difficulties is a significant task and leads to benefits for both students and educational

Related Work on Predicting Student Grade
Personalized multiple regression-based methods and matrix factorization approaches based on recommender systems were used by Elbadrawy et al. (2016) to forecast students' grades in future courses and in-class assessments [29]. Briefly, the first method was the course-specific regression, which predicted the grade that a student will achieve in a specific course as a sparse linear combination of the grades that the student obtained in past courses. The second method was the personalized linear multiregression, which employed a linear combination of k regression models, weighted on a per-student basis. The third method was a standard matrix factorization approach that approximated the observed entries of the student-course grade matrix. The fourth method was matrix factorization based on factorization machines. The evaluations showed that the factorization machines produced lower error rates for the next-term grade prediction.
Predicting student performance was also the main focus of the study conducted by Xu et al. (2017) [30]. Their goal was to predict the final cumulative Grade Point Average (GPA) of a student, given his/her background and performance states of the known grades and the predictions for the courses that have not been taken. For enabling such progressive predictions, the authors proposed a two-layer architecture. The first layer implements the base predictors for each course, given the performance state of graduate students on courses relevant to the targeted course. For discovering the relevant courses, a course clustering method was developed. In the second layer, ensemble-based predictors were developed, able to keep improving themselves by accumulating new student data over time. The authors' architecture was compared with four classic machine learning algorithms, named Linear Regression (LR), Logistic Regression (LogR), RF and kNN. The proposed method yielded the best prediction performance.
Predicting students' final grade was one of the two research goals also by Strecht et al. (2015) [31]. The authors evaluated various popular regression algorithms, i.e., Ordinary Least Squares, SVM, CART, kNN, RF and AdaBoost R2. The experiments were carried out using administrate data from the university's Student Information System (SIS) of Porto, concerning approximately 700 courses. The algorithms with best results overall were SVM, RF and AdaBoost R2.
The proposed method by Meier et al. (2015) made personalized and timely predictions of the grade of each student in a class [32]. Using data obtained from a pilot course, the authors' methodology suggested that it was effective to perform early in-class assessments such as quizzes, which result in timely performance prediction for each student. The study compared their proposed algorithm against four different prediction methods: two simple benchmarks; Single performance assessment and past assessments and weights and two well-known data mining algorithms; Linear Regression (LR) employing the Ordinary Least Squares (OLS) method and kNN algorithm (k = 7). The error of the proposed method decreased approximately linearly as more homework and in-class exam results were added to the model. Sweeney et al. (2016) also presented the problem of student performance prediction as a regression task [33]. They explored three classes of methods for predicting the next-term grade of students. These were (1) simple baselines, (2) Matrix Factorization (MF)-based methods, and (3) common regression models. For the third category of methods, four different regression models were tested: RF, Stochastic Gradient Descent (SGD), kNN and personalized LR. The obtained results revealed that a hybrid of the RF model and the MF-based Factorization Machine (FM) was the best performer.
The first study that applied Semi-Supervised Regression (SSR) methods for regression tasks in educational settings was carried out by Kostopoulos et al. (2019) [34]. In order to predict final grades in a distance learning course, the authors proposed a Multi-Scheme Semi-Supervised Regression Approach (MSSRA) employing RF and a set of three k-NN algorithms as the base regressors. A plethora of features related to students' characteristics, academic performance and interactions within the learning platform throughout the academic year formed the training set. The results indicated that the proposed algorithm outperformed typical classical regression methods.
Finally, a recent study utilized eight familiar supervised learning algorithms − LR, Random Forests (RF), Sequential Minimal Optimization algorithm for regression problems (SMOreg), 5-NN, M5 Rules, M5, Gaussian processes (GP), Bagging − for predicting students' marks [35]. The training data contained selected demographic variables, students' first semester grades along with the number of examination attempts per course. The reported results seem rather satisfactory, ranging from 1.217 to 1.943. It was observed that RF, Bagging and SMOreg took precedence over the other methods.

Related Work on Predicting Student Dropout
One interested study used data gathered from 419 high schools students in Mexico [36]. The authors carried out experiments to predict dropout at different steps of the course, to select the best indicators of dropout. Results showed that their classifier (named ICRM2) could predict student dropout within the first 4-6 weeks of the course.
Student retention and the identification of potential problems as early as possible was the main aim of Zhang et al. (2010) [37]. The authors used data from the Thames Valley University systems that were related to the background and the academic activities of the students. Three algorithms, namely NB, SVM and DTS, were chosen and different configurations for each algorithm were tested in order to find the optimum result. Finally, NB was reported to have achieved the highest prediction accuracy.
Moreover, Delen (2010) used five years of institutional data along with several popular data mining techniques (four individual and three ensemble techniques), in order to build models to predict and explain the reasons behind students dropping out [38]. The data contained variables related to students' academic, financial, and demographic characteristics. The SVM produced the best results when compared to ANN, DTS and LogR. The information fusion-type ensemble model produced the best results when compared with the Bagging and Boosting ensembles. Lykourentzou et al. (2009) presented a dropout prediction method for e-learning courses, based on three machine learning techniques: NNs, SVM and the probabilistic ensemble simplified fuzzy ARTMAP [39]. The results of these techniques were combined using three decision schemes. The dataset consisted of demographic attributes, prior academic performances, time-varying characteristics depicting the students' progress during the courses, as well as their level of engagement with the e-learning procedure. The decision scheme where a student was considered to be a dropout if at least one technique has classified this student as such, was reported to be the most appropriate solution for achieving and maintaining high accuracy, sensitivity and precision results in predicting at-risk students. Superby et al. (2006) applied NNs, discriminant analysis, DTS and RF on survey data from three universities, to classify new students in low-risk, medium-risk, and high-risk categories [40]. The authors found that the scholastic history and socio-family background were the most significant predictors of students at risk. The least bad result of the four methods was reported by the NNs method, reaching the total rate of correctly classified students up to 57.35%.
Herzog's study (2006) examined the predictive accuracy of the DTS and NNs over the problem of predicting college freshmen retention [41]. The author used three sources to produce the data set: the institutional student information system for student demographic; the American College Test (ACT)'s Student Profile Section for parent income data; and the National Student Clearinghouse for identifying transfer-out students. Overall prediction results showed that the DTS and NNS performed substantially better to the LR baseline. Also, the different results from the three NNs variations confirmed the importance of exploring available setup options.
Finally, a recent study explored the usage of semi-supervised techniques for the task of drop out prediction [42]. The dataset consisted of 2 classes of 344 instances characterized by 12 attributes. The authors compared familiar SSL techniques, Self-Training, Co-Training, Democratic Co-Training, Tri-Training, RASCO and Rel-RASCO. In their two separate experiments, C4.5 and NB were the base classifiers, while C4.5 was the dominant supervised algorithm. The results revealed that Tri-Training (C4.5) algorithm outperformed the rest SSL algorithms as well as the supervised C4.5 Decision Tree.
To sum up, various researchers have investigated the problem of student's performance prediction employing a plethora of data mining techniques. The results reveal that there is a strong relationship between students' logged activities in LMSs and their academic achievements. Most of the proposed prediction models achieved notable results (accuracy is more than 80%). However, a variation in the outperformers is observed; i.e., there is no method that can be thought of or shown to be better than others for educational settings (Table 1). Even more, to the best of our knowledge, there is no research that demonstrates the use of advanced machine learning strategies to these settings, such as the autoML.

Introduction to Bayesian Optimization for Hyperparameter Optimization
For the prediction of students' academic performance, we explore the use of autoML to automatically find the optimal learning model without human intervention. The task of constructing a learning model usually includes supplementary processes; the attributes selection, learning algorithm selection, and their hyperparameter optimization. Therefore, to model the problem of automatically tuning the machine learning pipeline for obtaining the optimal performance result (i.e., the goal of autoML), the overall hyperparameter configuration space covers the choice between various preprocessing and machine learning algorithms along with their relevant hyperparameters.
This optimization problem is currently addressed by various techniques. During this section, we will briefly discuss algorithms that are part of a powerful and popular approach, referred to as "black-box optimization" techniques. More specifically, our focus will be the Bayesian optimization algorithm, as this is the method that we employ for our experiments. We will also refer to prominent autoML software packages and to the importance of autoML, especially for the non-ML-expert users. At first, we provide some definitions related to both optimization and hyperparameter optimization problem.

Definitions
In general, optimization refers to the process of finding the best result under specified circumstances [43]. More formally, it is the (automatic) process of finding the value or a set of values of a function (a real valued function called the objective function) that maximizes (or minimizes) its result. It is consistent with the principle of maximum expected utility or minimum expected loss (risk) [44]. It is well known that there is no single method that can solve every optimization problem efficiently. It can be challenging to choose the best method for a given problem formulation, usually complex and computationally expensive processes are required. However, optimization methods can obtain high quality results with reasonable efforts.
In machine learning systems, the targets of automation include mechanisms that optimize machine learning pipelines, such as the feature engineering, model selection, hyperparameter selection, etc. The problem of identifying values for the hyperparameters that optimize the system's performance is called the problem of hyperparameter optimization [45]. Hyperparameters are the parameters whose values are set before the beginning of the learning process, e.g., the number of neighbor's k in the nearest neighbor algorithm, or the depth of the tree in tree-structured algorithms. In contrast, model parameters are parameters that are learned during the training process, e.g., the weights of neurons in a neural network. The central focus of autoML is the hyperparameter optimization (HPO) of machine learning processes.
Formally, the general statement of the hyperparameter optimization problem is defined as [44]: where f is the objective function, given a set of hyperparameters x from a hyperparameters space X.
We are interested in finding x * set that minimizes the expected loss (the value of f (x)). One important property of this function f is that its evaluation is expensive (costly) or even impossible to compute [46]. To make f more clear, consider the context of machine learning applications, where, for example, the function f can be a system that predicts student dropout rates (e.g., a set of preprocessing and classification algorithms) with adjustable parameters x (e.g., the learning algorithm or the depth of the tree when tree structured algorithms are tested), and an observable metric y = f (x), on data collected from learning systems (e.g., data from a learning management system).
As manual tuning is an error-prone process that also requires time and experts in the field [47], various automatic configuration methods have been proposed. A popular approach is to treat the problem as black-box optimization. Grid search has been the traditional and the most basic, yet extremely costly method. A fairly efficient alternative is random search [48]. Other families of methods that have been applied are gradient-based algorithms [49], racing algorithms, e.g., [50], evolutionary optimization algorithms [51], and population-based search, e.g., [52]. Next, we will try to focus on another strategy employed to obtain the optimal set of hyperparameters, that of Bayesian optimization. Our research also leverages the advances of this method.

Bayesian Optimization
Bayesian optimization is an effective strategy for minimizing (or maximizing) objective functions that are costly to evaluate. The main advantage of this method is that it uses previous results in order to pick the next point to try, while dealing with the dilemma of exploration and exploitation. As such, it reaches the optimal solution with less number of evaluations [44,53].
Bayesian optimization uses the famous Bayes theorem. The theorem states that the posterior probability of a model M given data D P(M|D) is proportional to the likelihood of D given M P(D M) multiplied by the prior probability P(M). As for the hyperparameter optimization problem, model M should not be mistaken with the output model of machine learning algorithms. On the contrary, M is actually a regression model that represents our assumptions about f [54]: where D i:t = x i:t , f (x i:t ) defines our accumulated observations of the objective function f on sequences of data samples x i:t [44]. The prior prescribes our belief (what we think we know) over a space of objective functions. The likelihood captures how likely the data we observed are, given our belief about the prior. These two combined, give us the posterior, which represents our belief about the objective function f . Additionally, the posterior conceptualizes the surrogate function, the function used to estimate f . As it is much easier to optimize the surrogate probability model than the objective function, the Bayesian optimization selects the next set of hyperparameters to evaluate based on its performance on the surrogate. During the execution, the accuracy of the surrogate model is increased by continually incorporating the evaluations on the objective function.
The Bayesian optimization is considered as a sequential design strategy (formally Sequential model-based optimization (SMBO)) [10,55]. At first, a surrogate probabilistic regression model of the objective function is built. Until a budget limit is being reached, new samples of hyperparameters are sequentially selected. These new samples are selected by optimizing an acquisition function S, which uses the surrogate model. Each suggested sample is applied on the true objective function and produce new evaluations. The new observations are used to update the surrogate model (Algorithm 1). There are several variants of the SMBO formalism, which are specified by the selection of the probabilistic model and the criteria (acquisition function) used to select the next hyperparameters.
In the model-based optimization literature, the most recognized choices for the surrogate model are the Gaussian Processes [56], Tree Parzen Estimators (TPE) [10,57], and Random Forests [58]. In this research, we selected Random Forests for our experiments. Some of the advantages of this method is that it can represent the uncertainty of a given prediction, and that it can handle categorical and conditional parameters [59]. Hutter et al. [54] suggested that tree-based models (i.e., TPE and Random Forests) work best for large configuration spaces and complicated optimization problems. The same conclusions are marked by the authors of [60]. They further notice that, in their experiment, Random Forests could obtain results of four more configurations at that same time budget compared to TPE and GP. The reason is their support on small number of folds at the cross-validation procedure. Random Forests are used by the Sequential Model-based Algorithm Configuration (SMAC) library (https://github.com/automl/SMAC3). for t := 1 to T do 3: x M t := fitNewModel (H); 7: end for The role of the acquisition function is to decide which point the surrogate model should evaluate next. It determines the utility of the candidate data points, trading off exploration and exploitation. Exploration favors new, uncertain areas in the objective space. Exploitation benefits areas that are already known to have advantageous results [11]. The Expected Improvement is considered to be among the typically used acquisition functions [61]. Other strategies have also been suggested, such as the upper confidence bound (UCB), the Probability of Improvement [62], and the more recent proposed Gaussian process upper confidence bounds (GP-UCB) [63].

Use Cases
In the context of machine learning, each time we try a different set of hyperparameters, we build a model using the training dataset and create the predictions based on the validation dataset and the evaluation metric. Considering the high dimensional search space and the complexity of models, such as deep neural networks or ensemble methods, the specific process is practically intractable to be done by hand, while in most cases it requires advanced data science skills. Therefore, autoML processes make machine learning more accessible, while reducing the human expertise that is required and, finally, improving the performance of the model [45]. Moreover, automatic tuning is reproducible and generally supports the fair comparison of the produced results [57], while in the case that the search space is limited by design to specific learning algorithms, the produced results can be more transparent and interpretable to the end users, as our study shows.

Research Goal
The goal of this study is to examine the potential yield of advanced machine learning strategies to improve the prediction of student's performance based on their participation in online learning platforms. Specifically, we investigated the effectiveness of the Automated Machine Learning (autoML) in conjunction with educational data to early predict students' final performance. We experimented using autoML for the tasks of algorithm selection, hyperparameter tuning, feature selection and preprocessing. Furthermore, to achieve explainable machine learning decisions, we made available only tree and rule-based classifiers in the configuration space for the task of selecting a learning algorithm. We examined log data obtained from several blended courses that were using the Moodle platform. We studied whether, and to what extent, the use of the autoML could leverage the performance of the predictions and provide reasonable results rather early when compared with standard ML algorithms. Our work will hopefully help nonexpert users, such as teachers, to conduct experiments with the most appropriate settings and hence achieve improved results.

Procedure
The selected compulsory courses "Physical Chemistry I", "Physics III (Electricity-Magnetism)" and "Analytical Chemistry Laboratory" were held in the spring semester of 2017-2018 at the Aristotle University of Thessaloniki (Table 2). In total, 591 students attended the courses, 322 of which were male, and 269 females (mean = 295.5, standard deviation = 26.5). The total number of students of the first course was 282 (122 females and 160 male), in the second 180 (90 females and 90 male), and in the third 129 (57 females and 72 male). The final grade was based on weighted averages of grades that students received at the online assignments and the final examination. The students attended the courses between February 2018 and July 2018. For all courses, the face-to-face teaching was supported with online resources and activities over the Moodle learning platform. All of the material of the courses was added into sections as web pages, files or URLs. The materials were available to the students until the end of the semester. Most sections also contained learning activities that were evaluated for a grade. Each Moodle course preserved the default Announcements forum. Announcements were created by making posts in that forum. It should be noted that the courses were not directed or specially designed for the conduction of the experiments that are described in the research.

Data Collection
For the collection of the datasets, we developed a custom plugin for Moodle ( Figure 1). The implemented extension computes an outline report of the course's activities. For each student, the outline calculates the number of views for each available module (e.g., activity, resource, folders, forum), the grades of each activity (e.g., assign, workshop, choice, quiz), the number of created posts (if any), aggregated event counters, and her/his final grade. By default, the report is computed from the course starting date to the current date, but results can also be filtered by a specified date. Data are exported in various formats and for regression and classification machine learning tasks. Most of the data are retrieved from the log's table in conjunction with aggregate functions.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 11 of 27 Figure 1. Screenshot of the custom report plugin on Moodle. The report outline lists all the available learning activities of the course (modules) along with a corresponding score for each participant (student). Scores are calculated according to the resource type (e.g., for the 'page' module we count the total number of student access). The report can be exported in two formats (arff, xls) and for various data mining tasks (classification, regression). Results can be calculated until a specified date.
For each machine learning experiment (dropout, pass/fail, regression) we collected six samples of the exported reports, one for each month of the semester. We aimed at experimenting in order to be aware of the precision of the results at the time and to predict failures as soon as possible during the semester. Table 3 lists the total number of logs that were available per course in the Moodle platform. In total, more than 130,000 log events were parsed. As it is expected, the number of events is increased during the semester ( Figure 2). The students' interest in the course varies, depending on the course activities. In general, course registrations start in February, final examinations take place in June, and final grades are announced by professors in July. An exception is the "Physical Chemistry I" course, by which the logged interactions had not started until March. As such, we will not build models for the first period.  The report outline lists all the available learning activities of the course (modules) along with a corresponding score for each participant (student). Scores are calculated according to the resource type (e.g., for the 'page' module we count the total number of student access). The report can be exported in two formats (arff, xls) and for various data mining tasks (classification, regression). Results can be calculated until a specified date.
For each machine learning experiment (dropout, pass/fail, regression) we collected six samples of the exported reports, one for each month of the semester. We aimed at experimenting in order to be aware of the precision of the results at the time and to predict failures as soon as possible during the semester. Table 3 lists the total number of logs that were available per course in the Moodle platform. In total, more than 130,000 log events were parsed. As it is expected, the number of events is increased during the semester ( Figure 2). The students' interest in the course varies, depending on the course activities. In general, course registrations start in February, final examinations take place in June, and final grades are announced by professors in July. An exception is the "Physical Chemistry I" course, by which the logged interactions had not started until March. As such, we will not build models for the first period.

Data Analysis
Data related to 6 types of learning activities were collected and they are presented in Table 4. For each module, we calculated a numeric representation. Specifically, for the forum module, we counted the total number of times a student viewed the forum threads. Respectively, we counted the number of times a page, a resource, a folder, and a URL module had been accessed by each student. For the assign module, we counted the number of times a student accessed its description, and the grade that s/he took using the 0-10 scale. We also include a counter that aggregates the number of times a student viewed the course (course total views) and the total number of every kind of log written for a student for the specific course (course total activity). We were not able to examine additional demographic values apart from gender. The class attribute was the final grade that the student achieved, transformed to 0-10 scale, or nominal according to the supervised machine learning problem in question. Students that finally succeeded in a course are the ones that scored above or equal to 5. Dropout students were considered to be the ones whose final grade was an empty value. The dataset does not contain missing values. Students that did not access a learning activity score for 0. Learning resources that were not accessed by any student were not included in the experiments. The descriptive statistics are represented in Table 5. The mean (average) and standard deviation of the final grades in "Physical Chemistry I" were, respectively, 3.98 and 2.92, in "Physics III" 3.62

Data Analysis
Data related to 6 types of learning activities were collected and they are presented in Table 4. For each module, we calculated a numeric representation. Specifically, for the forum module, we counted the total number of times a student viewed the forum threads. Respectively, we counted the number of times a page, a resource, a folder, and a URL module had been accessed by each student. For the assign module, we counted the number of times a student accessed its description, and the grade that s/he took using the 0-10 scale. We also include a counter that aggregates the number of times a student viewed the course (course total views) and the total number of every kind of log written for a student for the specific course (course total activity). We were not able to examine additional demographic values apart from gender. The class attribute was the final grade that the student achieved, transformed to 0-10 scale, or nominal according to the supervised machine learning problem in question. Students that finally succeeded in a course are the ones that scored above or equal to 5. Dropout students were considered to be the ones whose final grade was an empty value. The dataset does not contain missing values. Students that did not access a learning activity score for 0. Learning resources that were not accessed by any student were not included in the experiments. The descriptive statistics are represented in Table 5. The mean (average) and standard deviation of the final grades in "Physical Chemistry I" were, respectively, 3.98 and 2.92, in "Physics III" 3.62 and 3.17, while in "Analytical Chemistry Laboratory" 6.27 and 2.73. The 48% of students passed the first course, 41% the second and 81% the third course. Only the 28% of students dropped out of the first course, 24% dropped out of the second and 6% dropped out of the third. It is concluded that Physical Chemistry I and Physics III were more challenging courses, while the high grades on average in the laboratory of Analytical Chemistry affirm that this course was easier.

Feature Importance
During our research, additional procedures to understand more comprehensively the datasets were performed. More specifically, we used extremely randomized trees [65] to evaluate the importance of features on the classification and regression tasks (We used ExtraTreesClassifier and ExtraTreesRegressor ensemble methods from sklearn Python library, that return the feature importance (e.g., see https: //scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html)). We therefore indicated the informative features for each course (Figure 3). Table 6 summarizes the results found by listing the module categories that were estimated above each average per course and per task.   However, such data approaches rarely can lead to conclusive results. As expected, features related to assignment grades were far more informative that views counters when considering regression tasks. A representative example is the laboratory course of Analytical Chemistry; as such type of courses concentrate on conducting assignments related to the theory of a related course. In-class exams were indicated as good predictors also by Meier et al. [32]. On the other hand, pass/fail classification tasks list among the effective and some features that are related to how many times a student accessed resources and pages. It is worth noting that among the high rated documents are the ones related to result announcements. Features 22 and 23 of the "Physical Chemistry I" course are related to resources were the results of the handwritten exams and the final grades were listed. Finally, dropout classification tasks list the widest variety of features among its important compared to the other tasks. Seventeen out of 42 features have estimated importance above the average of all the feature importance at the "Physical Chemistry I" course.

Evaluation Measures
For evaluating the performance of the classification models, we calculated the accuracy metric, which is defined in accordance to the confusion matrix (Table 7) as follows: In addition, in order to evaluate the regression models, we make use of the Mean Absolute Error (MAE) measure, which is defined as follows: Finally, the dropout classification accuracy was measured with the Receiver Operating Characteristic (ROC), since it is appropriate for datasets with imbalanced class distributions [66,67]. The ROC Curve is a two-dimensional graph that illustrates the performance trade-off of a given classification model [68].
Given such experimental set-up, it is necessary to use a statistical test to verify whether the improvement is statistically significant or not. We apply the paired, one tailed t-test to compare the maximum accuracy (or minimum error) obtained by a set of classic machine learning algorithms using their default parameter values (marked with a star *), with the results when using autoML. All t-tests have been performed with a significance level of α = 0.05. As such, if the p-value is inferior or equal to 0.05, we conclude that the difference is significant.

Environment
To apply the data mining techniques, we used the WEKA implementation [69] without customizing the default parameter values. In addition, we employed the Auto-WEKA [70], the autoML implementation for WEKA that uses SMAC to determine the classifier with the best performance.
During the experiments, the classic algorithms were executed using the 10-fold cross-validation method. Therefore, each dataset was divided into 10 equally sized subsets (folds). The method was repeated 10 times, and for each time 9 of the 10 subsets were used to form the training set, and the remaining 1 was used as the test/validation set.

ML Algorithms
For our study we used various classification and regression techniques for predicting the final student performance. We made sure to choose one representative example out of the six main categories of learners, i.e., Bayes classifiers, rule-based, tree-based, function-based, lazy and meta classifiers [71].

1.
Bayes classifiers: Based on the Bayes theorem, the Bayes classifiers constitute a simple approach that often achieves impressive results. Naïve Bayes is a well-known probabilistic induction method [72].

2.
Rule-based classifiers: In general, rule-based classifiers classify records by using a collection of "if . . . then . . . " rules. PART uses separate-and-conquer. In each iteration the algorithm builds a partial C4.5 decision tree and makes the "best" leaf into a rule [73]. M5Rules is a rule learning algorithm for regression problems. It builds a model tree using M5 and makes the "best" leaf into a rule [74].

3.
Tree-based classifiers: Tree-based classifiers is another approach to the problem of learning. An example is Random Forest classifier that generates a large number of random trees (i.e., forest) and uses the majority voting to classify a new instance [75].
Lazy classifiers: Lazy learners do not train a specific model. At the prediction time, they evaluate an unknown instance based on the most related instances stored as training data. IBk is the k-nearest neighbor's classifier that is able to analyze the closest k number of training instances (nearest neighbors) and returns the most common class or the mean of k nearest neighbors for the classification and regression task respectively [81]. 6.
Meta classifiers: Meta classifiers either enhance a single classifier or combine several classifiers. Bagging predictors is a method for generating multiple versions of a predictor and using them in order to get an aggregated predictor [82].

Results
In this section, we present the main results of our experiments that were outlined in Section 4. We will show that the application of autoML technique in educational datasets significantly improves the efficiency of classic machine learning algorithms.

Predicting Pass/Fail Students
At first, we conducted a series of experiments to identify the effectiveness of classic data mining algorithms to predict students that are likely to fail at early enough stages. We performed 6 classic machine-learning algorithms on the datasets of 3 courses, split into chronological sets (month A, month B, etc.). An exception is the "Physical Chemistry I" course, which did not have any logs on the first month, and as such, we did not conduct experiments during that period. Tables 8-10 present the effectiveness results, represented by the accuracy measure.
We observe that Bagging and the Random Forest ensemble algorithms outperform the others in most cases in the first course, Naïve Bayes, PART and IBk algorithms outperform the others in most cases in the second course, and SMO and Random Forest algorithms outperform the others in most cases in the third course. In total, SMO and Random Forest are noted to be among the best performers 9 times and Bagging 7 times. As such, we could not conclude one best performer for our given datasets.   In addition, we used Auto WEKA to run automated machine learning experiments for the corresponding datasets. From the results, we identify that in most cases, the accuracy was significantly increased. Auto-WEKA was able to optimize the results up to 10% (Figure 4). The suggested classifiers and hyperparameters again vary across the datasets, including LMT, J48, PART, and JRip, to name a few. Overall, tree-based classifiers were suggested the most. More specifically, the LMT method was the outperformer 8 out of 17 times. Finally, by applying the t-test on the results as shown in Tables 8-10, we obtained the following p-values: "Physical Chemistry I" p-value = 0.0365; "Physics III" p-value = 0.0317; Analytical Chemistry Laboratory p-value = 0.0223. Thus, we can conclude that the autoML presents a statistically significant increase when applied to specific educational datasets.

Predicting Students' Academic Performance
In addition, we conducted a series of experiments to identify the effectiveness of classic data mining algorithms to predict students' grades at early enough stages. Next, we compared the results with the predictions given by models created by applying the autoML. Similar to Section 5.1, the experiments comprise of six phases, one for each month of the semester. Tables 11-13 present the regression results, represented by the mean absolute error measure.
Depending on the dataset, we observe that Bagging and the Random Forest algorithms outperform the others in most cases in the first course, Bagging and SMOreg algorithms outperform the others in most cases in the second course, and Random Forest algorithm outperforms the others in most cases in the third course. In total, Bagging is noted to be among the best performers 10 times, Random Forest 9 times, and SMOreg 5 times. Overall, from the predictions generated we could not conclude one best performer for our given datasets.
In addition, we used Auto-WEKA to run the automated machine learning experiments for the corresponding datasets. From the results, we identify that in all cases, the mean absolute error was significantly decreased. Auto-WEKA was able to minimize the error from 0.0188 to 0.4055 ( Figure  5). The suggested classifiers and hyperparameters again vary across the datasets, including M5P, Random Tree, REPTree and M5Rules. Overall, tree-based methods were primarily suggested. More specifically, the M5P and Random Tree were the outperformers 5 out of 17 times.
Finally, by applying the t-test on the results as shown in Tables 11-13, we obtained the following p-values: "Physical Chemistry I" p-value = 0.0366; "Physics III" p-value = 0.0021;  Finally, by applying the t-test on the results as shown in Tables 8-10, we obtained the following p-values: "Physical Chemistry I" p-value = 0.0365; "Physics III" p-value = 0.0317; Analytical Chemistry Laboratory p-value = 0.0223. Thus, we can conclude that the autoML presents a statistically significant increase when applied to specific educational datasets.

Predicting Students' Academic Performance
In addition, we conducted a series of experiments to identify the effectiveness of classic data mining algorithms to predict students' grades at early enough stages. Next, we compared the results with the predictions given by models created by applying the autoML. Similar to Section 5.1, the experiments comprise of six phases, one for each month of the semester. Tables 11-13 present the regression results, represented by the mean absolute error measure.
Depending on the dataset, we observe that Bagging and the Random Forest algorithms outperform the others in most cases in the first course, Bagging and SMOreg algorithms outperform the others in most cases in the second course, and Random Forest algorithm outperforms the others in most cases in the third course. In total, Bagging is noted to be among the best performers 10 times, Random Forest 9 times, and SMOreg 5 times. Overall, from the predictions generated we could not conclude one best performer for our given datasets.
In addition, we used Auto-WEKA to run the automated machine learning experiments for the corresponding datasets. From the results, we identify that in all cases, the mean absolute error was significantly decreased. Auto-WEKA was able to minimize the error from 0.0188 to 0.4055 ( Figure 5). The suggested classifiers and hyperparameters again vary across the datasets, including M5P, Random Tree, REPTree and M5Rules. Overall, tree-based methods were primarily suggested. More specifically, the M5P and Random Tree were the outperformers 5 out of 17 times.   Analytical Chemistry Laboratory p-value = 0.0016. Similarly, a statistically significant decrease was observed in the overall measurements.   indicate that in all cases, the effectiveness of the best classic classifier was improved when we applied autoML.

Predicting Dropout Students
Lastly, we conducted a series of experiments to identify the effectiveness of classic data mining algorithms to predict students that are likely to drop out at early enough stages. We compared the results with the predictions given by models created by applying the autoML. Similar to Sections 5.1 and 5.2, the experiments consist of six phases, one for each month of the semester. As the Analytical Chemistry Laboratory course displays a large level of class imbalance-the dropout class represents the 6% of the dataset-we did not include it in this category of experiments. Since it is easy to get high accuracy without actually making useful predictions in imbalanced datasets [66,83], for the drop out problem we used the ROC measure to compare the results of the classifiers. For the same reason, in Auto-WEKA we set the 'area above ROC' as the metric to optimize. Tables 14 and 15 present the drop results.  Finally, by applying the t-test on the results as shown in Tables 11-13, we obtained the following p-values: "Physical Chemistry I" p-value = 0.0366; "Physics III" p-value = 0.0021; Analytical Chemistry Laboratory p-value = 0.0016. Similarly, a statistically significant decrease was observed in the overall measurements.

Predicting Dropout Students
Lastly, we conducted a series of experiments to identify the effectiveness of classic data mining algorithms to predict students that are likely to drop out at early enough stages. We compared the results with the predictions given by models created by applying the autoML. Similar to Sections 5.1 and 5.2, the experiments consist of six phases, one for each month of the semester. As the Analytical Chemistry Laboratory course displays a large level of class imbalance-the dropout class represents the 6% of the dataset-we did not include it in this category of experiments. Since it is easy to get high accuracy without actually making useful predictions in imbalanced datasets [66,83], for the drop out problem we used the ROC measure to compare the results of the classifiers. For the same reason, in Auto-WEKA we set the 'area above ROC' as the metric to optimize. Tables 14 and 15 present the  drop results. Depending on the dataset, we observe that SMO and Bagging algorithms outperform the others in most cases in the first course, and Bagging algorithm outperforms the others in most cases in the second course. In total, Bagging is noted to be among the best performers 7 times, and SMO 4 times. We could not result in one best performer.  Auto-WEKA experiments again proved to be more effective. From the results, we identify that in all cases, the ROCK curve measure was increased. Auto-WEKA was able to optimize the results from 0.018 to 0.202 ( Figure 6). The suggested classifiers and hyperparameters again vary across the datasets, including LMT, Random Tree, and J48. Overall, only tree-based algorithms were suggested. More specifically, the LMT was the outperformer 6 out of 17 times. Depending on the dataset, we observe that SMO and Bagging algorithms outperform the others in most cases in the first course, and Bagging algorithm outperforms the others in most cases in the second course. In total, Bagging is noted to be among the best performers 7 times, and SMO 4 times. We could not result in one best performer.
Auto-WEKA experiments again proved to be more effective. From the results, we identify that in all cases, the ROCK curve measure was increased. Auto-WEKA was able to optimize the results from 0.018 to 0.202 ( Figure 6). The suggested classifiers and hyperparameters again vary across the datasets, including LMT, Random Tree, and J48. Overall, only tree-based algorithms were suggested. More specifically, the LMT was the outperformer 6 out of 17 times. Finally, by applying the t-test on the results as shown in Tables 14 and 15, we obtained the following p-values: "Physical Chemistry I" p-value = 0.0308; "Physics III" p-value = 0.0039. A statistically significant increase was observed in all courses.

Discussion
The results illustrated in previous sections unveil the effectiveness of the autoML methods in educational data mining processes. Without any human intervention, the suggested models report performance often much better of classic supervised learning techniques with default hyperparameters settings. Even from the early half of the semester, predictions have satisfactory values. The result models can help educators and instructors to identify weak students, improve retention and reduce academic failure rates. It can also lead to improved educational outcomes.
Finding appropriate models and hyperparameter configurations is one of the most difficult processes when building a machine learning solution and indeed, it is central to the pursuit of precision. The main advantage of autoML and tools like Auto-WEKA is that they provide an out-of-the-box mechanism to build reliable machine learning models without the need for advanced data science knowledge. People in IT, educational administration, teaching, research or learning support roles, who want to explore how their students perform, are not always experts in educational data mining. Therefore, despite having basic knowledge, novice ML users can converge easily to a suitable algorithm and its related hyperparameter settings and develop trusted machine learning models to support their educational institutions. Finally, by applying the t-test on the results as shown in Tables 14 and 15, we obtained the following p-values: "Physical Chemistry I" p-value = 0.0308; "Physics III" p-value = 0.0039. A statistically significant increase was observed in all courses.

Discussion
The results illustrated in previous sections unveil the effectiveness of the autoML methods in educational data mining processes. Without any human intervention, the suggested models report performance often much better of classic supervised learning techniques with default hyperparameters settings. Even from the early half of the semester, predictions have satisfactory values. The result models can help educators and instructors to identify weak students, improve retention and reduce academic failure rates. It can also lead to improved educational outcomes.
Finding appropriate models and hyperparameter configurations is one of the most difficult processes when building a machine learning solution and indeed, it is central to the pursuit of precision. The main advantage of autoML and tools like Auto-WEKA is that they provide an out-of-the-box mechanism to build reliable machine learning models without the need for advanced data science knowledge. People in IT, educational administration, teaching, research or learning support roles, who want to explore how their students perform, are not always experts in educational data mining. Therefore, despite having basic knowledge, novice ML users can converge easily to a suitable algorithm and its related hyperparameter settings and develop trusted machine learning models to support their educational institutions.
In our comparative study we focus on classification and regression using educational data. We evaluated widely known ML algorithms from 6 different categories versus 2 allowed categories in Auto-WEKA, on 51 datasets exported from 3 compulsory courses, using the Moodle LMS as a complement to face-to-face lectures. For each course, we collected 6 samples of data-one for each month of the semester-in order to predict students' final performance as soon as possible. The total number of students (591) who attended the courses produced more than 130,000 log events between February 2018 and July 2018. Logs were triggered by a variety of learning modules that structured the courses-fora, pages, resources, assignments, folders, and URLs. Assignment grades were scaled to a common metric. The final performance for each ML task-pass/fail, regression and dropout-was adjusted to represent the class attribute appropriately.
We further indicated the informative features for each course using the extremely randomized trees. We concluded that attributes related to assignment grades were more important than views counters when performing regression tasks. For pass/fail classification, apart from assignment grades, some features related to how many times a student accessed a resource or a page also had high importance scores. Lastly, dropout classification tasks yielded the widest variety of features among its important compared with the other tasks.
As previous studies have shown [17,84,85], there is no algorithm that is the best across all classification problems. A similar conclusion was reached by our experiments, as we could not result in one best performer in our educational data mining tasks illustrated in Section 5. We applied 6 well known supervised machine learning algorithms-Naïve Bayes, Random Forest, Bagging, PART/M5Rules, SMO/SMOreg and IBk-one representative of the six main learners' categories. Each time we compared the performance of the classic models with the calculated by Auto-WEKA model. Auto-WEKA was set to allow only ten tree or rule-based classifiers. The automated one was in most cases better. Significant differences calculated by t-test were marked. We did not conclude an overall winner classifier suggested by the Auto-WEKA tool either. However, when we grouped the proposed classifiers under the 6 learner's categories, it was clear that Auto-WEKA suggestions were mostly tree-based classifiers (Table 16). Table 16. Classic outperformers and Auto-WEKA suggestions grouped under 6 learners' categories. Twelve out of 17 proposed Auto-WEKA classifiers (71%) were tree classifiers. Similarly, 14 out of 17 regressors (82%) and 11 out of 11 (100%) suggested dropout models also belong to the tree classifiers category.

Pass/Fail
Regression Dropout There are several advantages to using decision trees in classification and prediction applications. Decision trees models are easy to interpret and explain [14]. Compared to other algorithms, they require less effort for data preparation during pre-processing. A decision tree does not require normalization of data nor scaling of data.
Finally, automated hyperparameter optimization familiarized us with a wide range of models and configurations that were practically applied to our settings [10]. It allowed us to test models with many variables that would be complicated to be tuned by hand.
However, some limitations should also be noted. First, it was not clear what was the appropriate time limit that imposed on our (relatively small) datasets. We marked differences between the suggested models for the same datasets when Auto-WEKA was executed e.g., for 5 min and when it was executed for 1 or 2 h. In limited cases, as the time limit increased, the performance was not necessarily better. This behavior is probably due to the fact that the parameter space is too large to explore, and different randomizations (e.g., during the train/test splitting, or in the underline SMAC optimizer that is used) allow to explore a smaller or a larger part of it. Another possible explanation is the fact that datasets of such size are prone to overfitting and underfitting. In any case, the suggested model was generally beneficial over using default values. Secondly, autoML methods, by definition, take much longer to train. Such an effect could be afforded due to the reliability of the results.

Conclusions and Future Research Directions
This study has investigated the effectiveness of autoML techniques for the early identification of students' performance in three compulsory courses supported by the Moodle e-learning platform. In our experimental evaluation, we focused on classification and regression. We further limited the configuration space of autoML methods to allow only tree and rule-based classifiers, in order to enhance the interpretability and explainability of the resulting models. Our results provide evidence that tools optimizing hyperparameters rather than choosing default values achieve state-of-the-art performance in educational settings as well. The comparison we made reveals that in the majority of cases, hyperparameter optimization results in better performances than default values for a set of classic learning models. We also noted that in most cases, the proposed configuration included tree-based classifiers. On this basis, we believe that autoML procedures and tools like Auto-WEKA can help people in education-both experts and novices in the field of data science.
Moreover, the proposed method may serve as a significant aid in the early estimation students' performance, and thus enabling timely support and effective intervention strategies. Appropriate software extensions within learning management systems could be built, to enable non-expert users benefit from autoML. Meanwhile, such tools should not lack transparency. It is essential to incorporate features that enable the interpretability and explainability of the produced results to a certain extent. Understanding why a student is prone to fail can help to better align the learning activities and support with the students' needs. This assumption could be addressed in future studies. Overall, the potential use of automatic machine learning methods in the educational field opens up new horizons for educators so as to enhance their use of data coming from educational settings, and ultimately improve academic results.