Predicting Students Success in Blended Learning—Evaluating Different Interactions Inside Learning Management Systems

: Algorithms and programming are some of the most challenging topics faced by students during undergraduate programs. Dropout and failure rates in courses involving such topics are usually high, which has raised attention towards the development of strategies to attenuate this situation. Machine learning techniques can help in this direction by providing models able to detect at-risk students earlier. Therefore, lecturers, tutors or staff can pedagogically try to mitigate this problem. To early predict at-risk students in introductory programming courses, we present a comparative study aiming to ﬁnd the best combination of datasets (set of variables) and classiﬁcation algorithms. The data collected from Moodle was used to generate 13 distinct datasets based on different aspects of student interactions (cognitive presence, social presence and teaching presence) inside the virtual environment. Results show there are no statistically signiﬁcant difference among models generated from the different datasets and that the counts of interactions together with derived attributes are sufﬁcient for the task. The performances of the models varied for each semester, with the best of them able to detect students at-risk in the ﬁrst week of the course with AUC ROC from 0.7 to 0.9. Moreover, the use of SMOTE to balance the datasets did not improve the performance of the models.


Introduction
Student dropout and failure are two major problems faced during the teaching-learning process of computer programming at any education level [1]. These disciplines have high failure rates around the world, sometimes achieving over 50% [1][2][3][4][5]. According to the literature, many factors may contribute to this low approval rate, such as difficulties related to the required abstraction for the proper development of algorithms, difficulties in problem-solving, and also the early stage, in which the programming courses are placed inside the curricula [6][7][8].
In Brazil, for example, there is a huge demand for Information Technology (IT) professionals but the formal teaching-learning environments (schools, courses, universities, etc.) do not account for this demand. The prediction of the IT professionals demand is around 70,000 between 2020 and 2024 [9] but the Brazilian universities are graduating 46 thousand, which leaves a deficit of 24,000 per year.
at-risk, with the justification of using information that is not restricted to a given type of learning management system [25]. The idea here is to test the sole use of the interaction count to see if it achieves a satisfactory result.
In this work, we do not distinguish our analysis between students' failure and dropout. Therefore, if it is not explicitly said the difference, the term "at-risk students" means students that got a final grade lower than 6.0 (on a scale from 1.0 to 10.0). In this context, students who dropped out got zero in their final grade and, consequently, are included in this definition.
For those reasons, the present work come up with the following research questions:

RQ1.
Which are the most appropriate datasets to early predict at-risk students? RQ2. Is the sole use of the count of students interactions sufficient to early predict students' failure in the course? RQ3. Does the use of oversampling techniques (SMOTE) help the models to achieve better performances? RQ4. Does the use of data from questionnaires applied at the beginning of the course help to improve model performance?
The remainder of this paper is divided into 4 sections-Section 2 presents related works. Section 3 presents the methodology, the process of data collection, its description, the dataset generation and the model evaluation. Section 4 reports the results obtained in this study and Section 5 shows the works' conclusion and discussion.

Related Work
Predicting at-risk students on higher education is a relatively well-established task in the literature, as well as the notion of interactions within the LMS. Some works show that students' performance has often been associated with different measures of LMS interactions and usually has a high correlation with their success in the course. This section presents a non-exhaustive review of the literature that used user interaction data to predict at-risk students.

Programming Courses
In introductory programming course context, Costa et al. [24] presented a comparative study aiming to measure the effectiveness of educational data mining techniques aiming on predicting at-risk students. The results have shown that the techniques used in the study can identify students with the risk of failing, where the best results were achieved using the Support Vector Machine (SVM) algorithm. Azcona et al. [26] present a research methodology to detect at-risk students in computer programming courses too. The authors provide adaptive feedback to students based on weekly generated predictions. The models used students' offline data and information about the activity logs. Results show that the students who followed the personalized guidance and recommendations performed better in exams. The usage of online learning material (in an introductory programming course) was used to predict academic success [27]. The results obtained have shown that the time spent with the material is a moderate predictor of student success. The performances of the models depend on the amount of data used to train them (where the predictions become more accurate during the progress of the course).

Computer Science/IT Courses
In a computer science course, Tillmann et al. [28] used exam results data from the LMS to indicators of academic success. Results show that the use of data of domain-specific skills could help to improve the accuracy and student interaction data almost does not interfere in the results. Using interaction logs from three computer science courses, Sheshadri et al. [29] tried to predict students' performance in a blended course. Results show that the performance can be predicted using data from LMS and also from a forum, version control and homework system. Using a plug-in to capture data from Moodle, Jokhan et al. [30] tried to predict the student's performance in the first year of an IT literacy course.
A regression model was used to determine if there is any correlation between students' online behavior and performance. Results show that the performance in this course could be predicted based on their average logins per week and the average completion rates of activities.

University
On an university context, a model to early predict students who are at-risk of failing was presented by Sandoval et al. [31]. The data comes from the university's LMS, that is, activity logs for each user and the administrative information system called DARA, that is, past and current academic status and demographic data. The results outperform other approaches in terms of accuracy, cost and generalization. In Mwalumbwe and Mtebe [32], the authors designed a Learning Analytics tool to determine the causation between LMS usage and students' performance at Mbeya University of Science and Technology. Results show that discussion posts, peer interaction, and exercises are significant factors for students' academic achievement in blended learning at the university.

Fully Online
On a fully online course, Hung et al. [33] used time-series clustering to early identify at-risk online students. Data were collected from an online graduate program in the United States, and results show that the proposed approach could generate models with higher accuracy if it is compared to traditional frequency aggregation approaches. In Soffer and Cohen [34], the authors used learning analytics methods on engagement data from online courses aiming to find their impact on academic achievements. Results showed that there are significant differences between who completes the course and who does not. An example is that the students who complete the course are twice more active than those who do not complete (except for forum activities). In Kostopoulos et al. [35] it was combined classification and regression algorithms for predicting students' performance in a distance web-based course. When the results are compared with some machine learning methods, they show that the proposed model is accurate and remains comprehensive. Baneres et al. [36] propose to identify at-risk students using an adaptive predictive model based on students' grades, trained for each course. They also present an early warning system using dashboards visualization for stakeholders. The results show the effectiveness of the approach on data coming from a fully online university's LMS.

Blended Courses
In a blended course context, Conijn et al. [37] processed data from LMS on 17 courses. Results show that the performance of predictive models strongly varies across courses, even when they are generated with data collected from a single institution. In Sukhbaatar et al. [38], the authors used a decision tree analysis on LMS data with the goal of predict (until the middle of the semester) students that are at-risk of failing or dropout in a blended course. Results showed that this approach worked well to predict the dropouts. However, to predict students that are at risk of failing, the method presented a lousy performance.

Multiple Data Sources
In Adejo & Connolly [39], the authors compared the use of multiple data sources (student record system, LMS and survey) and different classification algorithms aiming to predict student's academic performance. The main result is that using multiple data sources combined with an ensemble of classifierhigh accuracy in the s brought a high accuracy in the prediction of student performance. Umer et al. [40] used machine learning algorithms and the LMS data to predict students at-risk of failing. Results show that those data can be used to predict students' outcomes. However, the count of activities alone is not enough. In other words, the combination of LMS data and assessment scores can improve the accuracy of prediction models. Olivé et al. [41] tried to find which students would likely submit their assignments on time based on LMS data until two days before the deadline. The main goal was to perform an early prediction of at-risk students. The authors added contextual information to improve their predictions using neural networks, achieving satisfactory results.

Only Interaction Data/Log Files
Using only course log files, in Cohen [42], the author used data accumulated in three academic course to check if student activity on course websites may assist in providing early identification of learner dropout. Results show that identifying the changes in student activity during the course period could help in detecting at-risk students. In Kondo et al. [43] was proposed an automatic method to detect at-risk students by using log data of the LMS. Experimental results indicated that using this log data, some characteristics of behavior about learning which affect the student outcomes can be detected. Also, by using interaction data from the LMS, Usman et al. [44] used EDM and pre-processing techniques to predict students' performance. Results show that the Decision Tree achieved the best performance, followed by Naive Bayes and kNN. In Detoni et al. [25], the authors presented a methodology to classify students using only the interaction count in the LMS. Three machine learning methods were tested and results showed that the patterns in the data could provide useful information to classify at-risk students, allowing personalized activities, trying to avoid the student dropout. In Zhou et al. [45], the authors created a feature selection framework to pre-processing the data coming from internet access logs and generate models to predict the students' performance. Results have shown that this approach can identify most of the high-risk students. Some online characteristics were also discovered and can help educational professionals to understand the relation between students' internet use and academic performance.

Early Prediction
Aiming to find the optimal time in a course to apply an early warning system, the authors of Howard et al. [46] examined eight prediction methods to identify at-risk students. The course has a weekly continuous assessment and a large proportion of resources on the LMS. The results show that the optimal time to implement an early warning system is in weeks 5-6 (halfway through the semester). This way, the students can make changes in their study patterns. One of the objectives in Lu et al. [47] was to find the moment that the at-risk students could be predicted. For that, the authors used learning analytics and big educational data approach to predict the students' performance on a blended calculus course. Results show that the performance can be predicted when one-third of the semester is complete. With a similar idea, the authors of Gray and Perkins [48] proposed a new descriptive statistic for student attendance and applied machine learning methods to create a predictive model. Results show how at-risk students can be identified as early as week three in the fall semester. Appendix A presents an overview of the main characteristics of the works discussed in this section.

Approach Novelty
The novelty of our approach is based on the extensive comparison of datasets and classification algorithms, resulting in 65 combinations (13 datasets and 5 classification algorithms). We also used pre-processing techniques (SMOTE) aiming to tackle the lack of samples to train and test the algorithms. Some questionnaire data were used to aggregate more information on the discussion, adding information like social and demographic variables on the analysis. We also used three types of presence (cognitive, teaching and social) aiming to generate more data to predict student at-risk of failing and according to an existing theory about how interactions work inside Virtual Learning Environments.
The idea of making early predictions is to find out as early as possible whether the student is at risk of failing. For that, from the data available, we used just those related to the weeks up to the half of the semester (week 8). In this way, we are testing models that can be used in time to provide information that can help professors to intervene in order to avoid students failure.

Methodology
This section describes the methods used to achieve the goal of this paper. This research paper investigates thirteen different datasets (set of attributes), for each of the 4 (four) distinct semesters: 2016-1, 2016-2, 2017-1 and 2017-2, of an Introductory Programming Course to evaluate whether the types of presence, presented by Garrison et al. [22], influences in the performance of predictive models to early detect at-risk students. The overview of the adopted methodology is shown in Figure 1 and the four steps of the methodology are shown in Figure 2.  Steps of the adopted methodology. The "Pre-Processing" step was showed in dotted lines because it is optional, i.e., applied only in some experiments.
The first step consists of gathering data from Moodle logs, considering that the platform records the interactions that the students make in the VLE. The next step consists of generating the datasets containing different attributes to compare them and verify those which achieve the best results. Next, we employ some pre-processing techniques, such as oversampling, intending to increase the performance of the models. The fourth step consists of the generation and evaluation of the classification models. In the final step, we compare the obtained results to answer the research questions. The next subsections describe, in more detail, the steps followed.

Data Collection and Description
Data was collected from the Moodle logs of Introductory Programming courses of the Information and Communication Technologies (ICT) undergraduate program at the Federal University of Santa Catarina (UFSC). The introductory course is offered at ICT at night and it has 108 hours, in total, over 18 weeks. There are three classes per week, where one of them is an online activity. Every type of activity is computed as an "interaction". In other words, independently of the type the activity (log in, click on a given link, send a file, etc.), the interaction count is incremented by one. Table 1 shows the summary of the data. The Average of Interaction is calculated by the Interaction Count divided by the Total of Students. The lecturer was the same for all four semesters. It is important to note that in 2016, the C programming language was used but in 2017 we started teaching the Python programming language. The activities were gradually developed by the lecturer for each semester and he also instead content from previous semester.
For example, every semester, the old exams are posted in the course. In 2017-1 (first semester of 2017), the lecturer posted new programming exercises, video classes and a bunch of links to other video classes and content. The LMS course became more abundant in the material than before. At the end of the course, in 2017-1, the lecturer created a new LMS course environment, reorganizing the topics and content, describing the environment to assist the student in following instructions and making the course very attractive to the student. In addition to the video classes and content, most of the activities cited here are related to programming exercises that can be developed, executed and evaluated in Moodle by the Virtual Programming Laboratory plugin (VPL) [49].
For all the courses, throughout the semester students had three assessments. In 2016-1 students had two tests-week 10 and 17 and a final assignment in week 18. The tests were handwriting, that is, students did not make it in VPL because the modified Moodle environment for tests (Moodle's Test) that prohibits students access to the internet, was not available at the campus. In 2016-2 the Moodle's test was installed and students had two VPL tests in weeks 8 and 16 and a final assignment in week 15. In 2017-1 students had two VPL tests in week 10 and week 16 and a final assignment in week 17. In 2017-2, it was a bit different, students had three VPL tests in weeks 9, 15 and 17.
It is important to note that in both semesters of 2016, the final assignment was made by a group of maximum 3 students, it was implemented at home, in 4 weeks and posted in Moodle. In 2017-1, the assignment was made by two students per group in two classes (the same double in both days). In 2017-2, all the tests were made within VPL in classes. The final score is calculated as follows: where FS is the final score, T1 and T2 are tests and ASGMT is the final assignment or the final test in 2017-2.
Every student interaction in the LMS is saved in the logs together with the description of the activity performed. From that, we calculated the interaction counting for each week during the course for every student. Figure 3 shows a frequency distribution of interactions on each week of the four semesters considered for this work. Regarding the weeks when the first test was applied, it is important to note that one or two weeks before the test, there was a peak of interactions, as seen in 2016-2, 2017-1 and 2017-2. It is also interesting that the students did not use the Moodle in 2016-1, even though there were 53 not-mandatory VPL activities there.
In 2016-1, there is not a peak per se. But the highest number of interaction happens on Week 1. For 2016-2 and 2017-1, most of the interaction happens on Week 7. The 2017-2 semester has the highest number of interactions, where the peak is found on Week 8. From the interactions, we generated thirteen datasets with different sets of derived attributes to compare the performances of the models. Table 2 shows the description of the attributes generated.  It is important to point out that each attribute in Table 2 is gathered at student level, that is, every calculation in based on data collected for each student. Cognitive Count, Teaching Count and Social Count are the counting of the Cognitive, Teaching and Social Presences presented in Swan [23]. "Other count" is a category created by us for all the other interactions that do not fit the three previously mentioned categories. In other words, the sum of these four types of interactions result in Count i (Equation (1)). Table 3 presents how different interactions inside Moodle fall into the three types of presence evaluated in our work. Students interactions are normally highly correlated to engagement in distance learning settings, reflecting the behaviour students have in relation the their course. According to Moore [51], the interaction with content (cognitive presence), interaction with instructors (teaching presence) and interaction among peers (social presence) are the three kinds of interactivity that affect the learning process. Each of these interaction types supports learning and in practice, none of them works independently [23]. The idea of using these types of interaction is to better discriminate each type of interaction aiming to help on the generation of better predictive models, that better capture students behaviour in those learning settings. The implicit idea is that students who fail present different interactions in the different types of presence than students who succeed and that difference helps to generate better models.
Following, we formalize every attribute contained in the datasets.
Equation (1) represents the sum of interactions on every day j in each week i.
Equation (2) represents the average number of interactions in week i, summing up the interactions on each day j, divided by the seven days on the week (to calculate the average, we used the .mean() method contained in Pandas library [52] (https://pandas.pydata.org/pandas-docs/stable/reference/ api/pandas.DataFrame.mean.html)).
Equation (3) represents the Median, where n represents the number of samples in the vector. It is the value in the middle of the crescent ordered vector. If the number of samples is even, the median is the mean of the two middle values of the vector (To calculate the median, we used the .median() method contained in Pandas library (http://pandas.pydata.org/pandas-docs/stable/reference/api/ pandas.DataFrame.median.html)).
Equation (4) represents the number of weeks that the students had zero interactions, until week i. For example, if he had zero interactions on week i, the result is incremented in one. If the student had at least one interaction on week i, the result stays the same.
The Average of the Difference represents the average of the difference of interactions between week i − 1 and week i.
The Commitment Factor represents the ratio between the interaction count of a student on week i divided by the average interaction count of the class, where j represents a student and n the number of students in the class. Table 4 shows the set of attributes/variables included in each dataset. The idea when creating the datasets was to "separate" the derived attributes from each work. For example, in DB3 (dataset 3), we only use the attributes from Swan [23] (together with the interaction count). In DB5, we used attributes similar to Detoni el al. [50]. In DB4, the idea is similar, but with the addition of the average and median. In DB1, there is only the interaction counting, and in DB2, the attributes are derived from the interaction counting only (not using the type of interaction). DB6 contains strictly variables from Swan [23]. In DB7, DB8 and DB9, we created combinations of two variables from the same work [23]. In DB10, DB11 and DB12, the counting of each type of presence were used. DB13 is the datasets that contains all variables shown in Table 2 together.
Information from the logs was used together with a socio-demographic-motivational questionnaire that was applied to the students in the first week of the course. Questions were created to outline students' profiles, such as their usage of computer/smartphone (if, how, and how much they use), the reasons they choose the ICT program, previous skills on computing and computer programming languages, among others. The idea of using data from the questionnaire was to test to which extent the inclusion of socio-demographic-motivational data about/from the students would improve the performances of the models in comparison with using only data coming from students' interactions within the LMS. The motivational part is only one question and it is related to the main reason students choose the ICT program, that is for personal satisfaction, to get a better job/position, for family satisfaction (pressure), to apply for a PhD in the future or to get any degree.

Dataset Generation
The interaction counting was made week by week. It begins on Week 0 (last week before the beginning of the semester) to Week 17 (last week of the semester). However, in this work, we have used only data from Week 0 to Week 8 (middle of the semester) since the objective was to early predict the students that were at-risk of failing. It is important to note that in this work we do not distinguish between fail and dropout. We consider an at-risk student that student that gets a final grade below 6.0, the necessary grade to be approved on this course. So, independently if the student drops the course out in the first weeks, he is considered a failing student. Table 1 shows the number of students in each semester and it is clear that there are not a lot of samples. Therefore, we applied the over-sampling technique called Synthetic Minority Over-sampling Technique (SMOTE) [53] to generate synthetic data. This allowed us to compare the performances of the models when using the original datasets and balanced datasets. The script was developed in Python using the method in the Imbalanced-learn library [54].

Generation and Evaluation of the Models
For classification, we used Naive Bayes, Random Forest, AdaBoost, Multilayer Perceptron (MLP), k-Nearest Neighbor (kNN) and Decision Tree algorithms. However, during the experiments, we removed the last one due to its over-fitting. All these algorithms were implemented using the Scikit-learn [55] library in Python.
Since the number of samples to train, validate and test the classifiers is small, we used the Leave-One-Out Cross-Validation. The performance was measured using the Area Under Curve (AUC)-ROC Curve [56]. It is a measure of performance for classification tasks at various threshold settings and represents how much a model can distinguish between classes. Consequently, the higher AUC, the better the model is at predicting. AUC has been used as a reference metric by related literature such as Gašević et al. [57].

Comparison between Cases
To compare the results and check if there are differences between performances we applied the Mann-Whitney U test [58]. This test is suitable for situations where the requisites for the application of Student T-test have not been met. The Mann-Whitney U test is a non-parametric test applied on two independent samples with the same size, checking for the null hypothesis. If the p-value is below a threshold (0.05 in this work), the difference between the two samples did not occur by chance.
To answer the research questions, we performed comparisons presented in Section 4. The difference between performances (and the improvement of one configuration versus the other) is considered existent when there is a statistic difference between them (i.e., the p-value of the test between the two samples is lower than 0.05).

Results
As previously mentioned, the ROC curve was calculated for each model generated for each DB. Sixty-five models for each semester were generated. Models were trained with the counting of the weeks. For example, to get the results in the first week, we fed the classifier with interaction data from week 0 only. For the second week, we used the counting of week 0 and week 1 and so on.
We calculated the mean and median of the ROC values until week 8 (middle of the semester) and sorted the results descending by the median, obtaining our Top-5 combinations for each semester. It is essential to say that the interaction count is made individually for each week and are not cumulative. Table 5 shows the Top-5 performances for each semester, considering the combination of the DB and the classifier used. To do this, we need to compute the ROC value for each week. So, we get the predictions on test set for each week using leave-one-out validation, followed by the calculation of the ROC value and computation of the AUC for the week. The median is calculated using the AUC values from Week 0 to Week 8, that is, we get the median of these nine values. The results show that the AdaBoost classifier appears in 11 of 20 cases, being the most present algorithm. Next, we have the Random Forest and kNN, both appearing four times each. Last, we have the MLP, which appears only one time, in the fifth position at 2016-1. Next section will answer the proposed Research Questions focused on the five best results.

RQ1. Which Are the Most Appropriate Datasets to Early Predict at-Risk Students?
To answer this question, Table 5 provides information about the Top-5 DB-classifier combination for each week, ordered by the median. The combination that presents better results is the DB2 with AdaBoost classifier, followed by the DB5 with the same classifier in almost every semester. The exception is 2016-1, where the DB12 with AdaBoost (again) achieved the best results. However, the two previously cited combinations (DB2 and DB5) are in the second and third positions, respectively.
To confirm to what extent the differences between the best performances and the others were significant, we applied the Mann-Whitney test. Table 6 shows the results of the tests for each combination.
According to the results of the Mann-Whitney test, there is no significant statistical difference between the best five results. This means that one could use any of the combinations (model + DB) without loose or gain significant performances in the predictions. From now on, we will use "DB2-AdaBoost" as the best combination, since there was no significant difference between this one and "DB12-AdaBoost". We choose the former because it appears as the best combination also in the other semesters (2016-2, 2017-1 and 2017-2). It is important to highlight though that from these findings, there is no better dataset configuration that one should use to train the models.
We considered the best combination as DB2 with AdaBoost since it brought the best result on almost every semester. However, it is necessary to say that there is no significant statistical difference between this combination the other four on Top-5 (Table 6). So, we may say that this combination is enough to predict at-risk students. The dataset consists of interaction count, with average and median, both derived variables from the first. This dataset may have brought the best results since it has the information on interaction count and, with the other two variables, gives a notion on the behavior of the students in the past weeks, until the moment of the prediction. It can also bring some insights into student's engagement during the weeks. Considering that there is no statistical difference between the models generated using DB1 and other datasets, one could say that the counting of student interactions could be sufficient to early predict student's failure. In other words, the inclusion of several different derived attributes was not sufficient to improve the performance of the models at a statistically significant level. At the same time, it is essential to point out that the best results were obtained from DB2 and DB5, which are variations of DB1 that do not consider the different types of presence (cognitive, social and teaching).

RQ3. Does the Use of Oversampling Techniques (SMOTE) Help the Models to Achieve Better Performances?
To answer this question, we calculated the median of the performances of the models generated with original DBs (without the application of SMOTE). We compared them with the performances of the models generated with oversampled DBs. SMOTE was applied on the training set after splitting the data in training/testing sets. To check if there is any statistical difference between them, we apply the Mann-Whitney test again. Table 7 summarizes the results. Results show that there is an improvement on the median of ROC values in 9 out of the 16 cases (only for DB5 -AdaBoost on 2016-2 semester the results got worse) but these differences are statistically significant in only 7 out of the 16 cases (p-value is smaller than 0.05). In 2017-1 ( Figure 6), it can be seen that the use of SMOTE improved the prediction results since the AdaBoost-DB5 combination with SMOTE presented the best results for the eight week. However, in Week 8, we can see that the Random Forest-DB5 combination presented similar results.
In Figure 7 we can see that the application of SMOTE brought the biggest difference if compared to the data without SMOTE. The Random Forest classifier (with DB2 and DB5) presented the best results and the application of SMOTE improved the results. However, in Week 8, results of Random Forest (without the SMOTE application) and AdaBoost-DB2/DB5 were pretty similar.
From the results, one can say that the use of SMOTE helps on improving the performances of the models in only 43.75% of the cases considering all four semesters. Moreover, the use of SMOTE showed the best improvement on 2017-2, where the ROC value for Random Forest with DB2 and DB5 stayed above all the other combinations on all the weeks.    To answer this question, we used the same methodology of the previous question, initially not including the use of SMOTE and then applying the SMOTE. Table 8 presents the results.
On the one hand, in Table 8 it is possible to see the improvement of the performances (median of ROC values) in only 3 out of 16 cases, where only 2 cases are statistically significant. Both in 2016-1, with DB2 and DB5 with AdaBoost. On the other hand, there are some cases where the inclusion of data from the questionnaire decreases the performances of the models, in which two of them have a statistically significant level. According to these results, we can say that using questionnaire data, without performing feature selection, does not help with the prediction of failing of the students. We also analyzed the case where over-sampled data was compared with the DBs plus questionnaire data (also over-sampled). Table 9 shows the results.
In Table 9, it is important to point out that the results got better in 8 out of 16 cases. However, in these 8 cases better results, there are statistical differences in six of them. Based on these results, we can reinforce our previous statement that data from the questionnaire does not help to improve the performance of the models (even when the datasets are balanced).

Conclusions
This work presented a comparative study aiming to find the best combination between dataset and classification algorithm (using and not using pre-processing algorithms) to early predict at-risk students in introductory programming courses. Thirteen dataset combinations together with five classification algorithms (k-Nearest Neighbor, Multilayer Perceptron, Naive Bayes, AdaBoost and Random Forest) were used in the experiments.
The literature has works that also analyze log data from Moodle for generating predictive models for early identification of at-risk students, though the present work differs from them by providing a categorization of the counting of the logs according to the three elements required for a successful computer-mediated learning experience proposed by Garrison et al. [22], that is, cognitive, social and teaching presences.
We tested to which extent the classification of the counting of the logs into these three dimensions would serve as better datasets for the generation of more accurate predictive models. The main idea was that the different classes of students (Approved versus Reproved) would interact differently in those dimensions of presence and that could help the models to better capture students behavior in the learning settings. However, results have shown that there is no improvement in the performance of the models using those three dimensions: cognitive, social and teaching presences. Because of that, one can assume that the simple counting of interactions can be used to generate predictive models, corroborating with previous work [59]. This contradicts the findings of other authors, such as Conijn et al. [37] that say that predictive models cannot be generalized only by the LMS data logs and additional data sources are needed.
Considering that our interest is to early predict at-risk students, we measured the performances of the models until the middle of the semester (8 weeks). It is possible to say that the models achieved performances that can be considered satisfactory (with AUC ROC values of 90% already in the first week) and it is similar to the results found in the literature, for example, Detoni et al. [25], Howard et al. [46], Sandoval et al. [31], and Lu et al. [47]. These results were found considering the pre-processing of the datasets using SMOTE to balance the classes. Even with datasets being highly unbalanced, the use of SMOTE did not helped on increasing the performance of the models, improving on only 43.75% of the cases. Improvements in the performances of the models to predict at-risk students by applying SMOTE were reported in the literature in Costa et al. [24].
At last, we tested whether the inclusion of general, demographic and motivational information about the students would help to increase the performances of the models. The results show that data coming from the questionnaire did not help to improve the performance, contradicting results of other experiments reported by Tillmann et al. [28] and Adejo and Connolly [39], but corroborating previous findings of Brooks et al. [60].
The performance of the models varied according to the semester and the machine learning algorithm in use. The decision of which model apply and the the best moment for that would depend on the specifics of the semester. For instance, in some cases, it is possible to observe a drop in the performance of the models for some algorithms as the semester approaches to the middle. This is the case, for instance, of Adaboost-DB2 and Adaboost-DB5 at week 5 of semester 2017.2 (e.g., see Figure 7. In this scenario, it is recommended to use models generated by Random Forest with the use of SMOTE). From the figures, one could say that the best moment for predicting with good performances and before any significant loss, would be week 3. Again, for each semester, a given set of configuration should be picked accordingly.
One of the main contributions of our work is the investigation of the effectiveness of EDM techniques to early detect at-risk students and the extensive comparison of different combinations of classifiers and dataset (five classification algorithms with 13 DBs, generating 65 combinations for each semester). We also investigated the effect of pre-processing algorithms, such as SMOTE and the use of questionnaire data.
Regarding the courses' context, activities, tests and assignment, an important discussion that we can provide are about the activities and materials the lecturer provided during the 4 semesters presented in this work. The lecturer gradually improved the quality and the quantity of the resources of the course. It includes VPL exercises, which increased from 53 in 2016-1 to 86 in 2017-2. It also increased the number of other resources (slides, websites, examples, tutorials and so on) from 23 to 60 at the end of 2017-2. A deep analysis of these aspects shows that after the 4th week, students are autonomous to interact with the course's resources, more specifically, they can start programming using VPL. There are a lot of interactions in Moodle within VPL exercises. It seems that the course structure of the 2017-2 version is more intuitive to the students and it let them interact more precisely with the resources. We are able to conclude that a more structured course, with dozens of materials, best fits the students' needs, because they can have good interactions with the course and, consequently, succeed. It also seems that student interaction means engagement, and more engagement leads students to succeed.
The limitation of the work lies on the small number of cases included in each dataset (semester), although this limitation was softened with the use of leave-one-out validation during the training and testing of the models and with the use of SMOTE (that generates and includes new synthetic cases in the samples).
Future works include the test of more pre-processing techniques, aiming to improve the quality of the data, since the number of samples used in this work was small. Also, we intend to use other classification algorithms or even a combination of them. Deep Learning techniques can be also used for classification. When available, we intend to process data from 2018 and 2019 to check if there are any differences in the results.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: