Predicting Student Academic Performance by Means of Associative Classiﬁcation

: The Learning Analytics community has recently paid particular attention to early predict learners’ performance. An established approach entails training classiﬁcation models from past learner-related data in order to predict the exam success rate of a student well before the end of the course. Early predictions allow teachers to put in place targeted actions, e.g., supporting at-risk students to avoid exam failures or course dropouts. Although several machine learning and data mining solutions have been proposed to learn accurate predictors from past data, the interpretability and explainability of the best performing models is often limited. Therefore, in most cases, the reasons behind classiﬁers’ decisions remain unclear. This paper proposes an Explainable Learning Analytics solution to analyze learner-generated data acquired by our technical university, which relies on a blended learning model. It adopts classiﬁcation techniques to early predict the success rate of about 5000 students who were enrolled in the ﬁrst year courses of our university. It proposes to apply associative classiﬁers at different time points and to explore the characteristics of the models that led to assign pass or fail success rates. Thanks to their inherent interpretability, associative models can be manually explored by domain experts with the twofold aim at validating classiﬁer outcomes through local rule-based explanations and identifying at-risk/successful student proﬁles by interpreting the global rule-based model. The results of an in-depth empirical evaluation demonstrate that associative models (i) perform as good as the best performing classiﬁcation models, and (ii) give relevant insights into the per-student success rate assignments.


Introduction
Predicting student performance is an established Learning Analytics (LA) problem [1]. In the context of university-level courses, the research community has acknowledged the importance of predicting student performance as early as possible, thus enabling timely interventions targeted to at-risk students [2].
To early predict exam failures or course dropout, many research efforts have been devoted to analyzing data about learners and their learning contexts through machine learning techniques (e.g., [3][4][5][6]). A common approach entails predicting the per-student success rate of an exam well before the end of course by means of classification techniques [7]. Classification aims at learning predictive models from a set of labeled data (i.e., student-related data for which the exam success rate is known). A test set is used to know whether models perform accurately enough. Our goal here is to forecast the success rate of a student based on the output of a classification model. Since student-related data and contextual information change over time, model training is repeated multiple times at different time points (e.g., before student enrolment, at the beginning of the course, immediately before the beginning of the exam session). In this way, classification models incorporate all the information about the students and the learning activities available at the current time.
A strong limitation of many state-of-the-art machine learning models is their limited interpretability and explainability. Model interpretability is ensured whenever the classification model can be easily understood and explored, whereas explainability entails understanding the underlying model characteristics that mainly influence its predictions [8].
In the context of student performance prediction, the lack of model interpretability and explainability could be particularly critical, because teachers can neither verify the appropriateness of the success rate prediction nor tailor the subsequent actions to the actual learners' needs.
Previous attempts to address students' performance prediction using interpretable classification models such as decision rules and fuzzy rules have already been made (e.g., [9][10][11]). However, as discussed in [12], these models may suffer from the problem of adaptability since they could take decisions based on small samples of data and, therefore, the final classifier could not be representative of the overall trends.
This paper proposes to early predict university-level student performance by means of associative classification. Associative classifiers are intepretable yet accurate classification models consisting of association rules [13]. Associative classifiers are known to be more accurate than traditional decision trees and rule-based algorithms. They overcome the problems of decision tree and fuzzy models by focusing on the features of the given test instance, thus increasing the chance of generating more rules that are useful for classifying the test instance [12].
Associative rules represent strong implications between recurrent combinations of feature values and the predicted success rate. For example, a rule may indicate that if a student is female and she has accessed the majority of the course modules, then she is very likely to pass the upcoming exam. The rules are automatically extracted from a (potentially large) labeled dataset, filtered and sorted by relevance, and then applied to unlabeled data. Thanks to their inherent interpretability, associative models can be manually explored and validated by domain experts. The rules applied in the prediction of the exam success rate of each student are known. Hence, they give an insight into the actual motivation behind rate assignment (i.e., local explanations). Assessing the appropriateness of the generated predictions could help teachers to trust the data-driven model, to decide whether to collect new data or not, and to tailor the subsequent actions to specific student profiles. For example, if at-risk profiles are shown to rarely access the online course materials then teachers could foster the use of online materials in order to prevent exam failures.
Motivated by the aforesaid model characteristics, this paper aims at exploring the potential of associative classification techniques to support early student performance prediction. Specifically, the following research questions will be addressed: RQ(1) Are associative models as accurate as the best performing classifiers in predicting the exam success rates of university-level students? RQ (2) What are the most discriminating features to forecast the exam success rates at different time points? RQ(3) Which combinations of feature values have frequently been used to assign the exam success rates?
To answer question (RQ1), we analyze learner-generated data acquired by our technical university, which adopts a blended learning model, using a variety of classification models with various levels of explainability. Specifically, as a case study, we considered about 5000 students enrolled in B.S. engineering courses, and their personal, admission, and scholastic career data. Besides, the learning platform traces students' interaction with educational material, where a massive educational video service, which delivers video recordings of the in-class lectures, has a central role [14]. The achieved results show that the performance of L 3 , a state-of-the-art associative classifier [15] is comparable to that of the best performing (not explainable) models and, on average, superior to than that of decision tree models. To answer questions (RQ2) and (RQ3), we thoroughly analyze the characteristics of the rules applied by the associative classifier to the students under consideration. Furthermore, we monitor profile changes over time. In the considered case study, the performed analyses have allowed us to better understand the career progress of the students enrolled to our university.
The rest of the paper is organized as follows. Section 2 overviews the literature related to early student performance prediction. Section 3 thoroughly describes the proposed methodology. Section 4 presents the main experimental results. Section 5 draws conclusions and discusses the future research directions.

Literature Review
In the context of student performance prediction, recent findings [3] have shown the high variability of classifier performance according to the learning context, the analyzed features, and the considered algorithms. The main research works conducted in this field have mainly addressed the following research questions: (A) What are the most discriminating features to forecast the exam success rates? (B) From which time point can classifier predictions be deemed as reliable? (C) What are the most effective classification techniques? (D) Can we make Learning Analytics solutions interpretable and transparent to the end-users?
A thorough analysis of each branch of research is given below.
(A) What are the most discriminating features? Learner-and context-related features describe the interactions between students, teachers, and Learning Management Systems under multiple aspects. Based on the type of interaction, in [16] the features are classified as student-student, student-teacher, and student-content categories. The results achieved on the Moodle LMS data and presented in [17,18] show that student-student and studentteacher interactions are relevant to predict the success rate of fully online courses, while student-content interactions are deemed as relevant to in-class lectures. The results reported in [19] confirm that learners' habits, social activities, and teamwork styles are relevant to identify the key factors influencing students performance. To deepen the analysis of the interactions between students and LMSs, in [20,21] the authors analyze the data acquired from the Moodle LMS to discover which features (e.g., total time online, number of downloads, amount of communications with peers) are significantly correlated with the final grade. The results show that the total time online and the number of files viewed are the most discriminating features. In [22,23], the authors identify a set of key performance factors influencing student performance in both K-12 and higher education environments. The goal of this work is different to those of all the aforesaid approaches. It proposes a methodology, based on associative classification, to accurately perform predictions and to interpret the results. By exploring the mined association rules, the most significant features can be automatically identified. Notice that since the presented methodology is general, it can be applied to an arbitrary feature set.
(B) From which time point can classifier predictions be deemed as reliable? To allow timely interventions, performance predictions should be performed as early as possible. The works presented in [7,24,25] address this specific issue under different viewpoints. As expected, the prediction accuracy increases when the examination session is approaching. However, for a relevant percentage of students the final grade appears to be strongly correlated with the result of the entry test [25]. Therefore, fruitful information about at-risk students is available very soon. The number of clicks to the online materials made in the week immediately before the course turns out to be another significant predictor [7]. Results of ongoing assessments or previous examination sessions are, as expected, strongly correlated with the grade as well [26]. Hence, they should be considered as soon as they become available. Even the presence of unlabeled data could contribute to the improvement of predictors' reliability [27]. In [3], the authors analyze 17 blended learning courses using the Moodle LMS. The prediction models achieve high recall values (i.e., they identify most of the at-risk students) but low precision values (i.e., the number of false positives is fairly high). The main conclusion drawn by the authors is that it is very hard to find a compre-hensive set of variables that can be used to consistently predict student performance across multiple courses. Hence, there is a need for tailoring prediction models to the learning context under analysis and to deepen the analysis of the extracted correlations among data. In this way, this paper explores the potential of associative classification models, which combine accuracy and interpretability.
(C) What are the most effective classification techniques? Various classification techniques have successfully been applied to early predict student performance. Specifically, a significant effort has been devoted to training traditional and Deep Neural Networks (e.g., [28,29]). In parallel, established approaches such as Support Vector Machines [30], distance-based classifiers [31], ensembles of classification methods (i.e., Gradient Boosting and Random Forest) [32], and time series forecasting methods [6] have achieved fairly high accuracy values. This paper is, to the best of our knowledge, the first attempt to use associative models to address early student performance prediction.
(D) Can we make Learning Analytics solutions transparent to the end-users? Explainable Learning Analytics (XLA) focuses on enhancing the transparency of Machine Learning techniques in support of the Learning Analytics domain [33]. Since most of the ML models act as black-boxes, the outcomes are often hardly interpretable. Hence, the XLA community aims at making the ML outcomes explainable by tailoring them to the particular stakeholders and end-users [8]. For instance, in [9] the authors used the Open University Learning Analytics Dataset to predict students' outcomes. They highlighted the need for using XAI in the educational field. In the context of automated essay scoring, the authors in [34] have studied the impact and trustworthiness of neural networks by means of the SHAP explanation framework [35]). Similar attempts have been made in the domains of computation thinking [36] and knowledge tracing [29]. SHAP produces visual explanations by correlating the input features with the target class. Conversely, the associative classifiers adopted in the present work are not aimed at studying the impact value of a specific feature, but rather focuses on identifying the specific combinations of feature values that are likely to be relevant to predict a given class label.
A parallel effort has been devoted to explaining the early predictions of student performance using tree-based models (e.g., [10,11]). The associative classifiers used in the present work are known to be more accurate than traditional decision trees and rule-based algorithms because they rely on global, co-occurrence-based models [12].

Materials and Methods
Associative classifiers are used to generate per-student predictions of the exam success rates. The architectural schema of the proposed methodology is depicted in Figure 1. The main steps can be summarized as follows. • Data acquisition and preparation: Student-related data are acquired over the whole academic year by the Learning Management Systems (LMS) adopted by the university, collected into a unified repository, and prepared for the classification process. • Associative model learning: Multiple associative classifiers, consisting of a selection of association rules [37], are trained from the prepared data at different time points (e.g., before student enrollment, before the beginning of the course, before the beginning of the examination session). Each classifier describes the most significant correlations between a combination of data features and the success rate of the upcoming exam. •

Model interpretation:
The associative model is manually explored to identify atrisk/successful student profiles based on the extracted rules and to validate the per-student rate predictions based on the associated profile.
A more thorough description of each step is given in the following sections.

Data Acquisition and Preparation
Learning Management Systems are nowadays able to acquire, collect, and store data related to a variety of different student-related data. According to the classification given in [6], we may categorize the collected data features into the following categories:

2.
Student's engagement indicators customized to the course under analysis, e.g., course satisfaction, frequency of logins to online portals, frequency of course materials' accesses and downloads, frequency of video lectures' accesses and downloads, number of discussions posted.

3.
Assessment scores: result of the entry test, grade earned in the previous exams.
In the proposed methodology the values of all the potentially relevant data features are acquired, collected in a unique repository (independently of their corresponding category), and stored into a relational dataset.
A relational dataset consists of a set of records, where each record is a set of items. In our context, items are pairs (feature, value). feature is a textual description of the studentrelated characteristics, while value is the corresponding value taken by the feature. More formal definitions of item and relational dataset follow. Definition 1. Item. Let f i be a label, called feature, which describes a peculiar student-related characteristics. Let Ω i be the discrete domain of feature f i . Each pair ( f i , value i ), where value i ∈ Ω i , is an item.
To deal with continuous attributes, the domain is discretized into intervals, where intervals are mapped into consecutive positive integers. For example, (Entry test result, From 70 to 85) is an item, which indicates that the grade earned by the student at the entry test is between 70 and 85.
. . , f n } be a set of features and Ω = {Ω 1 , Ω 2 , . . . , Ω n } the corresponding domains. A relational dataset D is a set of records, where each record r is a set of items and contains at most one item for each feature in F .
For classification purposes, a feature (hereafter denoted as class) is selected as prediction target. Hereafter, we will consider as class the success rate of the upcoming exam. . . , f n } its corresponding feature set. Let f n ∈ F be the class and let Ω n = C be the class domain. For each dataset record r i such that r i ∈ D, let c i ∈ C be its class value. D is a labeled relational dataset.
A record r ∈ D for which the class value is known is called training (labeled) record. Conversely, a record r t ∈ D for which the value is unknown is called test (unlabeled) record. Table 1 reports an example of labeled training dataset whose records are related to different students of the Mathematical Analysis course at the same time point. Notice that the dataset contains both time-invariant features, e.g., the grade of the entry test (e.g., 65 over 100), and time-dependent ones, e.g., the number of accessed video-lectures). To predict the students' success rates at a given time point, we can train the classifier on the running example dataset (excluding student ids) by setting as class the Success rate feature.

Associative Model Learning
Classification of relational data entails first generating a model from a set of (labeled) training records and then applying it to a set of (unlabeled) test records. Many different classification approaches have been proposed in literature (e.g., Bayesian classifiers, decision trees, Support Vector Machines (SVMs), Neural Networks, and associative classifiers). A relevant drawback of many classification techniques (e.g., Bayesian classifiers, SVMs, Neural Networks) is the limited explainability of the generated models. Predicting student academic performance by using non-explainable models entails fully trusting the prediction outcome, because the patterns used to assign the student's success rate for a given course are unknown [38].
Decision trees and associative classifiers are the two main classes of explainable classification models. Users could explore these models in order to gain insights into classifiers' decisions. Decision trees are popular classification techniques based on treebased structures built on the training dataset [39]. Decision trees perform a greedy search for rules by heuristically selecting the most promising features. Such greedy (local) search may discard important rules. Associative classifiers, on the other hand, perform a global search for rules satisfying some quality constraints (i.e., minimum support) [40]. More specifically, as discussed in [41], decision trees suffer from the problem of adaptability since they could take decisions based on small samples of data and, therefore, the final classifier could not be representative of the overall trends. Associative classification overcomes the aforesaid problem by focusing on the features of the given test instance, thus increasing the chance of generating more rules that are useful for classifying the test instance. Association rules represent strong associations between sets of feature values and the class. Rules are extracted, filtered, and ordered prior to be included in the classification model. In this work we apply the rule extraction and selection methodologies adopted by the L 3 state-of-the-art algorithm [15]. A more detailed description of the rule extraction and evaluation steps is reported below.
Association rules. Let D be a relational dataset (see Definition 1) and let X be an arbitrary set of items in D. An itemset is a set of items, i.e., a combination of feature values occurring in a dataset. In the context of relational data, X is an itemset if all of its items belong to distinct features.
Recalling the running example in Table 1, {(Fraction of video-lectures accessed, <5%), (Success rate, fail)} is an itemset representing the co-occurrence of two items (related to different features) in the source dataset. It indicates that the students who accessed less than 5% of the video-lectures of the course failed the exam.
Itemsets are characterized by their support value [37]. It indicates the fraction of records in the source dataset in which all the items in the itemset co-occur. For example, {(Entry test, [60, 70]), (Success rate, fail)} has support equal to 33% in the running example dataset, because the two items co-occur in two records out of six.
An association rule is an implication X → Y, where X and Y are disjoint itemsets. A more formal definition follows.

Definition 4. Association rule.
Let X and Y be two itemsets in D such that X ∩ Y=∅. An association rule is represented in the form R : X → Y, where X and Y are the body and the head of the rule respectively.
X and Y are also denoted as antecedent and consequent of rule X → Y. For example, {(Entry test, [60, 70]), (Video-lectures accessed, <5%)} → (Success rate, fail) is an association rule. It indicates that the co-occurrence of two specific conditions, i.e., passing the entry test with a grade between 60 and 70 and accessing less than 5% of the video-lectures, is correlated with an exam fail.
Association rule extraction is commonly driven by support (sup), confidence (conf), and correlation (corr) quality indexes [37]. The support of the rule indicates the frequency of occurrence of the implication in the source dataset, while the confidence index represents the rule strength.
Definition 5. Support of an association rule. Let D be a relational dataset. The support (sup) of an association rule R : X → Y is defined as the support of X ∪ Y in D. Definition 6. Confidence of an association rule. Let D be a relational dataset. The confidence (conf) of an association rule R : X → Y is the conditional probability of occurrence in D of itemset Y given itemset X, i.e., con f (R) = sup(X∪Y) sup(X) .
For example, the association rule {(Entry test, [60, 70]), (Video-lectures accessed, <5%)} → (Success rate, fail) has support equal to 33% and confidence equal to 100%, because in all the records in which the antecedent occurs the consequent occurs as well. In our context, the confidence index indicates that in all the cases in which the entry test grade is between 60 and 70 and the number of video-lectures accessed is very low the success rate is fail.
In some cases, measuring the strength of a rule in terms of support and confidence may be misleading [42]. When the rule consequent is characterized by relatively high support value, the corresponding rule may be characterized by a high confidence even if its actual strength is relatively low. To overcome this issue, the correlation (or lift) index [42] may be used to measure the (symmetric) correlation between body and head of the extracted rules. Definition 7. Correlation of an association rule. Let X → Y be an association rule. Its correlation index (corr) is given by corr(X, Y) = conf(X→Y) and conf(X → Y) are, respectively, the rule support and confidence, and sup(X) and sup(Y) are the support counts of the rule antecedent and consequent.
If corr(X,Y) is equal to or close to 1, itemsets X and Y are not correlated with each other. Correlation values significantly below 1 show negative correlation, whereas values significantly above 1 indicate a positive correlation between itemsets X and Y, i.e., X and Y co-occur more than expected. Classification rules. The L 3 classifier consists of a subset of high-quality association rules, hereafter denoted as strong classification rules. A classification rule [13] is an association rule whose consequent is a class item.
Classification rules are selected because they can be directly applied to label test records whose non-class feature values match those appearing in the rule antecedent.
A classification rule is strong if its support, confidence, and correlation values are above (analyst-provided) thresholds.

High-Quality Rules
In [15], the strong classification rules are further partitioned into (i) high-quality rules, i.e., rules used in the classification of training data, and (ii) unchecked rules, that is, rules unused during the training phase, but potentially useful to classify test data. In the next step, we will exclusively consider high-quality rules.

Profile Extraction and Ranking
The associative models generated by the L 3 classifier at different time points are collected and analyzed to gain knowledge about classifiers' decisions.
Classification rules related to rate fail describe at-risk student profiles. For example, rule {(Entry test, [60, 70]), (Fraction of video-lectures accessed, <5%)} → {(Success rate, fail)} describes a profile of students who have achieved fairly good test outcomes and who have not downloaded the video-recordings of the in-class lectures. Conversely, classification rules related to rate pass describe successful student profiles.
Profiles can be classified as (i) at-risk profiles, if they are peculiar to the success rate fail, or (ii) successful profiles, if they are peculiar to success rate pass. Note that profiles are peculiar to a given course and period of time. Hence, they may change while considering different courses and periods. Further profile categorizations can be based on the number and type of features involved. Specifically, • Single-feature profiles are profiles characterized by a single feature category. According to the feature categorization reported in Section 3.1, they can be further classified as profiles on student-specific characteristics (SSC profiles, in short), profiles on student's engagement indicators (SEI profiles), or profiles on assessment scores (AS profiles) depending on the category of the reference feature. • Mixed-feature profiles are profiles that are modeled on multiple data feature categories. They extend single-feature profiles by combining features of different categories.
While single-feature profiles could be identified also using decision tree models, mixedfeature ones are peculiar to associative models, which encompass rules including items belonging to multiple features in the rule antecedent. For the sake of simplicity, mixedfeature profile categories will be denoted by combining the same abbreviations indicated above of single-feature profiles. For example, SSC+AS profiles are those characterized by features belonging to both student-specific characteristics and assessment scores.
For example, rule {(Entry test, [60, 70]), (Fraction of video-lectures accessed, <5%)} → {(Success rate, fail)} can be considered to build an at-risk SEI+AS profile, because the assigned success rate is fail, while the non-class features in the rule antecedent belong to classes Student's engagement indicators and Assessment scores, respectively.
Teachers could be interested either in looking at the profile associated with a specific student or in identifying the most recurrent student profiles. In the latter case, once an associative model has already been applied to a significant number of students, the corresponding profiles can be filtered based on their actual usage. Specifically, identifying which classification rules have actually been applied by the classifier to each student allows teachers to associate students with recurrent risk profiles. For example, if rule {(Entry test, [60, 70]), (Fraction of video-lectures accessed, <5%)} → {(Success rate, fail)} has been used to classify 5% of students, then the corresponding at-risk profile could be deemed as particularly relevant to understand the career progress of the students.
At-risk and successful profiles are first ranked by decreasing number of classified students. Then, the top ranked profiles are manually explored in order to gain insight into the analyzed students. When the students described by a profile have attended the course exam, exam outcomes are used to discriminate between reliable profiles and not. For example, if the majority of the students who were classified using rule {(Entry test, [60, 70]), (Fraction of video-lectures accessed, <5%)} → {(Success rate, fail)} actually failed the exam, then the corresponding profile could be deemed as reliable.
Analyzing reliable students profiles well before the end of the course allows teachers to perform timely interventions. For each profile, one or more targeted actions can be recommended. Actions targeted to reliable at-risk profiles are preventive, because their aim is to prevent failures. Conversely, actions targeted to reliable successful profiles are aimed at reinforcing good practices.
For example, let us suppose that, based on the exploration of the mined rules, a mixedfeature at-risk profile, consisting of all students whose entry test grade is below average and whose percentage of accessed video lectures is very low (<5%), is identified. A strong correlation between entry test grades and accessed video lectures and exams' outcome could prompt university managers to plan (i) additional courses on basic concepts tailored to at-risk students or (ii) reinforcement actions triggered by the discovered profile (e.g., send reminders throughout the course notifying the publication of new materials) Similarly, a profile may represent foreign students who have neither downloaded nor accessed using streaming the video lectures of the course. Periodic recommendations targeted to inactive students can be exploited to foster students' engagement.

Results
We empirically analyzed the applicability of the proposed methodology on studentrelated data acquired by our technical university.
To perform quantitative and qualitative evaluation on the analyzed data, we run experiments on an Intel(R) Core(TM) i7-8550U CPU with 16 GB of RAM running Ubuntu 18.04 server.

Learning Context and Related Data
The university provides B.S. and M.S. engineering courses and adopts a blended learning model. The blended model relies on a massive educational video service, which delivers video recordings of the in-class lectures. The access to the video lectures of a course is allowed to all the enrolled students. Hence, students can exploit video lectures as complementary materials in addition to in-class lectures [43]. Since 2010 the university has video-recorded in the classroom all the courses of the first year of the B.S. in Engineering, that is common to all B.S. engineering curricula. The number of students enrolled to the 1st-year courses is approximately 5000 per academic year.
The learning platform of the university traces the students' interactions with the provided educational materials. Specifically, through the platform students can (i) access and download educational materials (e.g., slides, lecture notes, exam simulations), and (ii) watch (streamed from the educational Web service or downloaded) the video-records of the in-class lectures.
As a case study, we applied the proposed methodology on data acquired in the academic year 2018-2019. Specifically, we focused on predicting the exam success rate for all the students enrolled to the 1st-year B.S. courses. The 1st-year curricula encompasses the following courses: Mathematical Analysis (MA), Chemistry (CH), Computer Science (CS), Linear Algebra (LA), and Physics (PH), plus an elective course which is not considered in the present analysis because data are not comparable. In the considered academic year, courses MA, CH, and CS were held in the first semester (i.e., from 1 October 2018 to 15 January 2019), while courses LA and PH were held in the second semester (from 1 March 2019 to 15 June 2019).
Three examination sessions were scheduled within an academic year: (i) the 1st semester (winter) session, which is held at the end of the first semester (i.e., from 22 January 2019 to 28 February 2019), (ii) the 2nd semester (summer) session, which is held at the end of the second semester (i.e., from 16 June 2019 to 22 July 2019), (iii) the autumn session, which is held after the summer break (i.e., from 1 September 2019 to 30 September 2019). Students are free to attend any and all examination sessions, provided that they have already attended the course.
To early identify at-risk students, for each student and exam we predicted the success rate in the upcoming examination session. Specifically, we trained separate classifiers at the twelve different time points reported in Table 2. For the sake of brevity, each time point will be hereafter denoted by the corresponding identifier. At each time point we predicted the outcome of the considered student in the upcoming exam (independently of the actual exam attendance). More specifically, for the 1st-semester courses (MA, CH, CS) at time points from t 0 to t 5 we predicted the success rate of the winter session exam, from points t 6 to t 9 we predicted the success rate of the exam held in the summer session, while at points t 10 and t 11 we predicted the rate of the autumn session. For the 2nd semester courses, from t 0 to t 9 we predicted the success rate of the exam in the summer session (because the winter session is not eligible for students at first enrolment), while at points t 10 and t 11 we predicted the rate of the autumn session.
For the courses held in the first semester (MA, CH, and CS), the first exam after course attendance is scheduled in the winter session, while for the other courses (LA, PH) the first exam is scheduled in the summer session. However, for all the 1st-year courses students may undergo the corresponding exam in any of the aforesaid sessions (e.g., they can undergo the MA exam the first time during the summer session).

Details on the Source Data
The considered data source consists of two main tables:  Table 5 reports the number of records per time point and course as well as the percentage of records per class. It indicates for each course the number of students considered in the training data at different time points. For example, for the MA course at time points from t 0 to t 5 (i.e., before the first examination) training data consist of 4092 students, among which 1515 students who will pass the upcoming exam and 2577 students who will fail it.  4 15 January 2019 Close to 1st semester exams t 5 22 January 2019 Start of 1st semester exam session t 6 28 February 2019 End of 1st semester exam session t 7 31 March 2019 Early 2nd semester t 8 30 April 019 Mid-way 2nd semester t 9 15 June 2019 Start of 2nd semester exam session t 10 22 July 2019 End of 2nd semester exam session t 11 31 August 2019 After summer break

Performance Comparison between Different Algorithms
We conducted an empirical evaluation of the performance of various classification algorithms on the analyzed dataset. The goal of the experimental analysis is to answer to the following research question (RQ1): Are associative classifiers as accurate as the best performing ones in predicting student academic performance?
To address the above issue, we compared the performance of the L 3 classifier with that of a variety of other classifiers. Notice that despite similar experimental comparisons have already been conducted in previous studies (e.g., [3]), to the best of our knowledge associative models have not been considered yet.

Classification models
We considered the following classifiers [39] k-Nearest Neighbor (kNN): a lazy distance-based classifier (lazy classifiers do not create models, but on-the-fly compute the distances between a test record and each of the training records). 7.
For the L 3 classifier we used the C++ implementation provided by the authors, while for all the other algorithms we used the Python implementations available in the Scikit-Learn library [44]. To tune classifier performance at each considered time point, we performed a grid search by varying the values of the most significant parameters. Hereafter, for the sake of brevity, we will report the results achieved by the best configuration separately for each algorithm.

Time complexity
The time complexity for training and testing the classification models ranged between few seconds on simpler datasets (e.g., 7 s for MLP, 31 s for L 3 ) to approximately one hour in the worst cases. However, most prediction models were generated in less than 60 s.

Performance metrics
To quantitatively evaluate classifier performance at each time point, we applied a stratified 5-fold cross-validation strategy and computed the following performance metrics: • Precision of class fail: It is the ratio between number of students who have been correctly labeled as belonging to class fail (TN) divided by the total number of students assigned to class fail (TN + FN).

•
Recall of class fail: It is the ratio between number of students who have been correctly labeled as belonging to class fail (TN) divided by the total number of students who actually belong to class fail (TN + FP). • F1-score of class fail: It is the harmonic mean of precision and recall of class fail.

•
Balanced Accuracy: It is the average of the recall computed over the two classes and is given by 1 2 ( TP TP + FP + TN TN + FN ) [45]. It evaluates the ability of the classifier to correctly assign both class labels. It is especially useful when the classes are imbalanced in the test sample since it rewards the correct predictions on the minority class. When the test samples are balanced over the two classes, it corresponds to the conventional accuracy measure (i.e., percentage of correct predictions).
While the accuracy measure is independent of the class, the other metrics are specific to class fail. Since the main goal of student performance prediction is to early detect at-risk students, we specifically analyzed the ability of the classifiers to correctly classify this particular category of students. the performance decreases after the first examination session, because predicting students' outcomes at the next sessions (i.e., at the second trial) is significantly more challenging. For the 2nd semester courses (PH, LA) similar results were achieved after the first examination, which is scheduled after time t 9 . L 3 performed as good as the best performing classifiers (K-NN, MLP) while decision tree classifiers (DT and RF) performed worse. Slight fluctuations in the series of accuracy values have shown in the last time points (t 9 -t 11 ). The reason is that since the number of students under evaluation decreases (because the majority of them have already passed the exams), the models are less robust and more sensitive to noise. Figure 2f-t, respectively, plot for each course the F1-score, precision, and recall of class fail achieved by the classifiers at different time points. They describe the ability of the classifiers to accurately predict exam failures. Precision and recall values increase as time goes by (e.g., for MA the L 3 precision ranges from 65% at t 0 to 90% t 10 ). The recall values significantly increase after the first examination session (at t 5 for the 1st semester courses, at t 9 for the 2st semester ones) because students who failed once are more likely to fail again. L 3 performed as good as the best performing classifiers for all the courses and for most of the considered time points.
In order to assess the statistical significance of the performance variations (computed in terms of accuracy, F1-score, precision, and recall of the class fail), we applied the Wilcoxon signed rank t-tests [46] using a significance level equal to 0.5%. The results show that L 3 performed significantly better than DT and RF at specific time points for the majority of the analyzed courses, while it performed as good as the best performing approaches (K-NN, MLP). Hence, the L 3 associative model could be deemed as a reliable model for early predicting student performance.  age decreases its importance. The activities carried out in one course may influence other courses.

533
For example, since MA provides students with the basic concepts for both LA and PH, the exam 534 grade (whenever available) is particularly discriminating.

535
The radar plots in Figure 4 deepen the feature analysis for the 2nd-semester LA course.

Model Exploration
To explore the high-quality rules generated by the L 3 classifier we focused the attention on the most reliable student profiles. Specifically, we first counted, for each high-quality rule, the number of classified students as well as the percentages of correctly and incorrectly classified ones. Next, we analyzed the subset of high-quality rules that (i) have been applied to at least 1% of the students, and (ii) have exclusively produced correct predictions. All the other rules were discarded, as they potentially generated unreliable predictions. Rule exploration aims at answering the research questions RQ2 and RQ3 posed in Section 1. It entails exploring model interpretability since the extracted rules can be easily understood by non-expert end-users.
Research question (RQ2): What are the most discriminating features to forecast the exam success rates at different time points?
To answer to this question, we analyzed the frequency of occurrence of the single features and of the pairs of features in the selected student profiles. Figure 3 shows the percentage of rules including specific features in their antecedent at three representative time points: t 0 (before the entry test), t 1 (after the entry test and the BS course choice, but before the start of the semester) and t 5 (at the end of the first semester). The plot highlights the features that mostly influence student performance separately for each course and time point. At t 0 only the student-specific features are considered and, unsurprisingly, the most important one is the high school grade (HS-gr). Age turned to be strongly correlated with the success rate as well, because students who are older than average often achieved bad results during high school. At t 1 the entry test grade (GRE-gr) plays a significant role in rate prediction, in combination with HS-gr. The BS course choice (e.g., mechanical engineering vs computer science engineering) is also an important feature, because the attitude of the students towards the different disciplines depends on the perceived importance in their future. At t 5 the effect of the download of educational material (e.g., MA-mat) and of the streaming activity (e.g., MA-str) for the different courses are visible. The high school degree remains very relevant, while in general age decreases its importance. The activities carried out in one course may influence other courses. For example, since MA provides students with the basic concepts for both LA and PH, the exam grade (whenever available) is particularly discriminating. The radar plots in Figure 4 deepen the feature analysis for the 2nd-semester LA course. Features are grouped in different categories (personal data and scholastic history, entry test and BS course choice, activity in two sample courses, MA and CS, educational material download and exam success) and their importance is compared at all the time points (from 0 to 9). For example, graph (a) confirms that the high school grade (HS-gr) has an important impact on student performance during the whole academic year. On the contrary, gender is relevant only at the beginning, because the student activity during the semester becomes more and more important to foresee exam success rate. Graph (b) shows that the test grade (GRE-gr) and the BS course choice have an impact only in the first semester, and during the second semester (when the LA course is on) other features become much more important. Graph (c), about students' activity in the MA course, shows that studying MA is strongly related to passing the LA exam. It also evidences that the streaming activity (MA-str) is mainly relevant at the end of the semester and during the exam break: students use educational materials from the beginning of the semester, but concentrate the streaming activity closer to exam sessions. Graph (e) shows that passing the 1st-semester exams (MA-gr, CH-gr, CS-gr) has a strong influence on passing LA in the second semester (the influence is higher in case of MA than CS, coherently with the topic focus). The performance in the first semester of the first year is probably the most discriminating factor in classifying students as active or inactive. Extracting association rules allows us to explore the most relevant item co-occurrences relative to different feature combinations [40]. Figure 5 shows the importance of a subset of relevant combinations of feature for LA at different time points. For example, the scholar history (HS-loc, HS-gr) is relevant to predict success rate before any university activity (i.e., before t 1 ), including entry test and BS course choice. After that, and until almost the end of the semester (t 4 ), the high school grade and the entry test grade coupled with the BS course choice (HS-gr, BS course), (GRE-gr, BS course) became the most relevant feature combination. During the first semester (from t 2 to t 5 ) educational material download of two courses has the strongest effect (MA-mat, CS-mat), showing student active participation. Finally, during the second semester (when LA is given), the most relevant feature combination is the number of downloads of educational material combined with the grade achieved for the MA course (LA-mat, MA-gr).  Table 9 reports a selection of mixed-feature rules. For each rule we report the average support and confidence values computed over all the five cross-validation folds as well as their corresponding standard deviations.
The table content is organized as follows. Each rule has a progressive number. For each rule we report the time point at which the rule was extracted (from t 0 to t 11 , according to the notation introduced in Table 2) and its content (in the form body → head) as well as its main quality indices. For the sake of brevity, the discretized feature values within each rule are abbreviated as None (N), Low (L), Fair (F), or High (H). Since similar rules may be extracted for the other courses, for each rule we indicated which courses have similar rules in column Similar rules. Finally, a short comment on the rule is given in Column Description.
The rules give interesting insights into the career progress of the enrolled students. For example, rule 4 in Table 6 indicates that the students who received fairly low admission grades and who are older than average are more likely to fail the MA exam. To prevent exam failures, the university could organize recovery courses for the students who have not achieved a sufficient grade in the part of the admission tests related to mathematics.
Rule 10 in Table 7 indicates that the students who have downloaded a large part of the educational materials by the first half of the MA course are likely to pass the exam. The extraction of such a profile prompts specific reinforcement action, i.e., stressing the importance of using educational material and encourage students to keep going like they are doing.
Rule 25 in Table 9 indicates that the students who have just downloaded the videorecording of the MA course without accessing the rest of the teaching materials are less likely to pass the exam in the upcoming session, because the number of downloads is positively correlated with the success rate only in conjunction with other indicators (e.g., a high number of accesses to the teaching materials). Conversely, the number of streaming accesses to the video-lectures is correlated with the success rate independently of the other features (see rule 24). Table 6. High-quality single-feature rules mined from the MA datasets including only SSC features. minsup = 1%, minconf = 50%, mincorr = 2. Support and confidence values of each of the selected rules are averaged over the 5 crossvalidation folds (average and standard deviation are specified).   Streaming of other courses has positive impact, confirms t 2 and t 3 CS, CH (with fail) MA-str = F, CS-str = F 10.7 ± 0.2 69.6 ± 0.9 7 Table 9. High-quality mixed-feature rules mined from the MA datasets. minsup = 1%, minconf = 50%, mincorr = 2. Support and confidence values of each of the selected rules are averaged over the 5 cross-validation folds (average and standard deviation are specified).

Takeaways
In light of the outcomes of the feature analysis and of the rule exploration in Section 4.3, the following conclusions can be drawn: • The high school degree heavily influence students' performance. In the example rules, this is already evident at t 0 (see Table 6) for the MA course, but the feature is very relevant during the whole academic year (see (a) in Figure 4) and the result is valid for all the courses (see Figure 3). Planning ad hoc remedial courses for students with low high school grade is therefore a suitable action to prevent student drop-out. • Age has also a significant impact. This is not surprising because students that are older than the average likely had below average results during high school or are part-time workers. Rules 3, 4 and 5 in Table 6 confirm this statement for the MA course, but the feature is always relevant, especially in the first part of the academic year (see (a) in Figure 4) and the result is valid for the majority of the courses (see Figure 3). The fact that the influence of this feature decreases during the semester shows that motivated students learn to react putting extra effort in the study. Awareness actions toward this category of student can have a positive effect. • Inactivity as regards educational material download is strongly related to failure. Rules 9, 13, and 16 reported in Table 7 for the MA course show that this holds during the whole semester. A proactive, reiterated invitation to use available educational materials could help students. • It is very important to start the study activity very soon. Table 7 shows that a limited number of downloads of the MA educational materials is enough at the beginning of the semester (rule 6), but it is not later on (rules 12 and 15). This result is coherent with graph (e) in Figure 4: the educational material download activity has a strong influence on student performance already at the beginning of the semester. This is a useful recommendation for the students who want to improve their academic performance. • Putting effort on many courses at the same time is a strategy that pays at the beginning of the semester but not close to the exam session. Rules 8 in Table 7 show that working on more that one course at t 2 increases the chance to pass the MA course, while rule 17 shows that the same kind of behaviour at t 4 yields opposite effects. Students should therefore be invited to work hard since the beginning of the semester, but they should also be warned that they should focus on a specific course when they are close to the exam. • The use of the video-lecture streaming service is always positive. Table 8 shows that videolecture streaming activity, even if limited, has a positive impact on passing the MA exam. This is valid for the use of MA video-lectures (rules 18, 20 and 22), but also for the use other course video-lectures (rules 19, 21 and 23), because this activity likely identifies active and motivated students. Rule 24 in Table 9 adds that streaming is positive even without downloading educational material. This outcome is very positive for our institution, since it proves that the video-lecture service is valuable, besides being appreciated by the students. Encouraging students to actively use the service is another fruitful action to prevent failure and drop-out. • Downloading video-lectures is not enough. Rule 25 in Table 9 shows that video-lectures download without use of educational material is not enough to pass the MA exam. This rule identifies the students that simply download all the video-lectures for a later use, but that very likely (since they do not download the accompanying educational material) do not actually watch them. This result is supported by what is evident in graphs (c) and (d) of Figure 4. Video-lecture streaming activity and educational material download activity are shown to be indicators for exam success, while this is not the case for the video-lectures download activity.

Conclusions
The paper proposes to exploit associative classification to early predict student academic performance. Associative models are shown to be as accurate as the best performing classifiers on real student-related data acquired by our university. Thanks to the interpretability of associative models, domain experts can identify relevant at-risk and successful student profiles. Thanks to their explainability, they can also validate the assignments of the exam success rates by exploring the rules applied during the classification phase.
In the reported case study, the analysis of the rule-based models has allowed us to classify as at-risk students the ones with any combination of the characteristics enumerated below: (i) low high-school or entry test grades; (ii) older than the average; (iii) limited use of educational material, in general, or simply download video-lectures (no streaming); (iv) do not start to work from the beginning of the semester; (v) do not concentrate their effort on one course at a time close to the exam session. The achieved results confirm the applicability of the proposed methodology and open the following research questions, which we plan to address as future work.
• Is the associative model effective in predictive exams' outcomes in other learning contexts (e.g., higher level courses, university-level M.S. courses)? • Could associative models be integrated into an automated decision support systems that triggers personalized alerts based on the outcomes of the early prediction process? • How can student profiles be effectively processed and visualized in order to continuously monitor the advances in the students' learning process?