Interpretable Success Prediction in Higher Education Institutions Using Pedagogical Surveys

: The indicators of student success at higher education institutions are continuously analysed to increase the students’ enrolment in multiple scientiﬁc areas. Every semester, the students respond to a pedagogical survey that aims to collect the student opinion of curricular units in terms of content and teaching methodologies. Using this information, we intend to anticipate the success in higher-level courses and prevent dropouts. Speciﬁcally, this paper contributes with an interpretable student classiﬁcation method. The proposed solution relies on (i) a pedagogical survey to collect student’s opinions; (ii) a statistical data analysis to validate the reliability of the survey; and (iii) machine learning algorithms to classify the success of a student. In addition, the proposed method includes an explainable mechanism to interpret the classiﬁcations and their main factors. This transparent pipeline was designed to have implications in both digital and sustainable education, impacting the three pillars of sustainability, i.e.,economic, social, and environmental, where transparency is a cornerstone. The work was assessed with a dataset from a Portuguese higher-level institution, contemplating multiple courses from different departments. The most promising results were achieved with Random Forest presenting 98% in accuracy and F -measure.


Introduction
There are more than 20 million students in higher education institutions (HEIs) in the European Union. However, according to Vossensteyn et al. (2015) [1], around seven million (36%) will not finish their studies. Shapiro et al. (2015) [2] present a similar picture where out of 20.5 million existing students, around 39% will interrupt their studies.
The Organisation for Economic Co-operation and Development (OECD) [3] reports that other countries show similar behaviours in higher education (Australia and New Zealand (20%), Israel (25%), and Brazil (52%)). In addition, it is still possible to observe interesting remarks regarding school dropouts. In the United Kingdom, students who start higher education at 21 years old have more probability of dropout during the first year than those who had a higher education enrolment after high school (11.8% vs. 7.2%, respectively). The success of students may depend on students' characteristics and the conditions provided by HEIs. Specifically, students with a lower ranking in HEI admissions tend to underperform compared to the remaining students. By using both the students' characteristics and HEI information, it is possible to predict the success of students employing appropriate tools [4]. The advent of information technologies has dramatically increased the amount of data generated within HEIs. It is important to extract useful knowledge from data in multiple results, considering the data analysis, classification, and explanation. Finally, Section 5 concludes and highlights the achievements and future work.

Related Work
HEIs are evaluated either by the labour market or by national higher education assessment agencies. In addition, the universities' rankings are based on their students' success (academic and professional). To improve student progress, the data generated within academic activities can be used to recognise patterns and make suggestions to boost student performance. Despite the concern about reducing the number of dropouts, it is also essential to understand which decisions need to be implemented at the strategic level, concerning policies, strategies, and actions that institutions carry out as a whole.
Abdallah and Abdullah (2020) [23] and Rastrollo-Guerrero et al. (2020) [24] provide an extensive literature review of intelligent techniques used to predict student performance. In particular, Abdallah and Abdullah (2020) analyse three perspectives: (i) the learning results predicted; (ii) the predictive models developed; and (iii) the features which influence student performance. Both reviews conclude that most of the proposed methods focus on predicting students' grades. Regression models and supervised machine learning are the most used approaches to classify student performance.
The current literature review contemplates recent works that classify student performance by analysing both methods and data. Therefore, to predict the student success:  [29] predict the students who might fail final exams. The proposed model employs the course details and the students' grades. As ML classifiers, they compare NB, NN, SVM, and DT.
In addition to the face-to-face environment, other models provide learning conditions, without spatial and temporal restrictions, through online learning platforms, such as Massive Open Online Course (MOOC), Virtual Learning Environments (VLEs), and Learning Management Systems (LMS). Although it gives students more autonomy, online learning has raised multiple challenges, such as a lack of interest and motivation and low engagement and outcomes. In this context, a blended learning methodology is seen as an alternative because it combines face-to-face and online learning approaches [30]. In the literature, the student success classification has been explored by:  [33], who developed a predictive model to identify at-risk students across a wide variety of courses. The proposed method relies on courses details as well as actions logs Moodle to apply CatBoost (CB), RF, NB, LR, and KNN classifiers. Table 1 compares the reviewed classification models concerning the success/failure or dropouts in HEIs. A significant number of research proposals explore the students' grades and demographic information, excluding the student's opinions concerning the course organisation. In addition, the ML models leave users with no clue about why those classification has been generated. Interpretability and explainability are essential to understand ML-generated outputs. According to Berchin et al. (2021) [34], transparency promotes a change towards sustainability. Explainability and interpretability can often be used interchangeably [35]. Specifically, interpretability is "loosely defined as the science of comprehending what a model did (or might have done)" [36], implying a determination of cause and effect. A model is interpretable when a human can understand without further resources. However, ML incorporates both self-explainable and opaque models. While opaque models behave as black boxes, interpretable mechanisms are self-explainable. Naser (2021) [37] details the level of explainability of the models depicted in Table 2. Table 2. Interpretability of models [37].

Model Interpretability Mechanism
Linear/Logistic Regression Self-explainable Mathematical-based Interpretability and explainability have been explored to explain predictions [38], recommendations [39], or classifications [40]. In this context, several solutions were developed to be coupled with ML models [39,[41][42][43] to provide explanations. However, scant research has been addressed to incorporate those explainable models in dropout prediction or success classification in HEIs. Wang and Zhan (2021) [44] identify interpretability as the main limitation related to artificial intelligence technologies in higher education. To address this vacuity, the current work concentrates on: (i) employing the student's opinion via pedagogical questionnaires; (ii) implementing a transparent classification method with explanations; and (iii) using a face-to-face environment. Therefore, we propose a method which promotes transparency in higher-level education, supporting the three pillars of sustainability, i.e., economic, social, and environmental.

Proposed Method
This paper proposes an explainable method to classify the success of HEI courses using pedagogical questionnaires. Figure 1 introduces the proposed solution which adopts ML classification algorithms to generate interpretable success classification. The proposed method encompasses: (i) as inputs, the pedagogical questionnaire (Section 3.1); (ii) a data analysis module to assess the reliability of the data (Section 3.2); (iii) a classification method (Section 3.3) to automatically classify the success; and (iv) an interpretability mechanism to describe and explain the outputs (Section 3.4). The transparency of the solution supports the concept of sustainability in HEIs. The effectiveness of the proposed method is evaluated using standard classification metrics (Section 3.5). Figure 1 depicts the different modules of the proposed solutions.

Pedagogical Questionnaire
The proposed solution relies on a pedagogical questionnaire of a Portuguese HEI. The importance of pedagogical questionnaires among students has increased, contributing to improving and adapting the teaching system continuously in HEIs.
The questionnaire is equal to all CU, covering 3 categories: (i) teaching activity (not included in this study); (ii) CU (Cat 1 ); and (iii) student's performance in the CU (Cat 2 ). Each category encompasses 4 different questions using a 7-level Likert scale (level 1 represents the lowest value, and 7-the highest). Table 3 details the questionnaire composition and the content of the questions. We used eight consecutive academic years (2013/2014 to 2020/2021) and four courses from three different departments. Table 3. Questionnaire description.

Variable Categories Scale
Year 2013/2014 to 2020/2021 Ordinal Cat 1 -Curricular Unit: Q1-The usefulness of CU is perceived. Q2-There was articulation between the syllabus.
Q3-The program was adapted to the skills of the students. Q4-The bibliography is adequate to the contents of the program.
7-level Likert scale where level 1 represents the lowest value, and 7-the highest.

Ordinal
Cat 2 -Student's performance in the CU: Q5-I was motivated for the CU. Q6-I performed the tasks proposed in class.
Q7-I regularly studied the subjects. Q8-I used the materials provided by the teacher.

Data Analysis
The current work employs quantitative research to analyse the student's opinions using structured questionnaires [45]. According to the most adopted approaches, i.e., power analysis [46] and rules of thumb by Hair [47], the used sample size is enough for this study. Specifically, data analysis is a three-phase stage composed of: • Cronbach's alpha reliability analysis [48] which was used to verify whether the variability of the answers effectively resulted from differences in students' opinions. • Descriptive analysis which was conducted employing univariate and multivariate analysis. It uses descriptive and association measures, e.g., Pearson and Spearman correlations, graphical representations, and categorical principal component analysis (CATPCA). While Pearson's correlation coefficient measures the intensity and direction of a linear relation among two quantitative variables, Spearman's measures the dependence between ordinal variables using rankings [49]. In a positive correlation, two variables tend to follow the same direction, i.e., one variable increases as the other variable increases. In a negative scenario, the behaviour is the opposite, i.e., an increase in one variable is associated with a decrease in the other. Principal component analysis (PCA) can transform a set of p-correlated quantitative variables into a smaller set of independent variables denominated by principal components. Because the pedagogical survey is expressed in a 7-level Likert scale, the optimal scaling procedure was used to assign numeric quantifications to categorical variables, i.e., CATPCA. • Inferential analysis employs t-test for means and Levene's for variances. t-test is used to verify whether two populations are significantly different. This test requires the validation of normality assumptions of two groups and the homogeneity of variances (Kolmogorov-Smirnov and Levene's test, respectively). Therefore, for two populations 1 and 2, where Y follows a normal distribution, the hypotheses to be tested are: The hypothesis H 0 is rejected if p-value < α where α is the significance level adopted (5% or 1%). When the population variances are not homogeneous, the test statistic used to assess the equality of means is Welch t-Student.

Classification
This work employs batch ML classification. The experiments involve multiple classification algorithms to analyse the most promising results. The binary classification algorithms selected from scikit-learn (available at https://scikit-learn.org/stable, accessed on 12 April 2022) are well-known interpretable models with good performance: • NB is a probabilistic classifier based on Bayes' theorem [50]. • DT can be employed in prediction or classification tasks. The model is based on a tree structure which embeds decision rules inferred from data features. Decision trees are self-explainable algorithms easy to understand and interpret [51]. • RF is an ensemble learning model which combines multiple DT classifiers [52] to provide solutions for complex problems. • Boosting Classifier (BC) is an ensemble of weak predictive models which allows the optimisation of a differentiable loss function [53]. • KNN determines the nearest neighbours using feature similarity to solve classification and regression problems [54].

Interpretability
Interpretability and explainability are relevant to describing and understanding the outputs generated by ML models.
The proposed solution adopts local interpretable model-agnostic explanations (LIME) [42], which determine the output impact of an input feature variation. LIME will allow to understand the impact of the pedagogical survey variables in the success or failure scenarios. Therefore, for each classification, LIME generates the corresponding explanations promoting the transparency and, consequently, the sustainability of the method.

Evaluation
The model evaluation will allow to assess the feasibility and effectiveness of the solution. The proposed method employs offline processing to learn the behaviour of the students. Specifically, the data are partitioned into training, used to build an initial model, and testing is used to assess the quality of the classifications. The quality of classifications is assessed using the evaluation standard metrics: • Classification accuracy indicates the performance of ML model. As a binary classification, it uses the number of positives and negatives in the classification. • F-measure in macro-and micro-averaging computing scenarios assesses the effectiveness of the model employing the precision and recall. While precision concentrates the percentage of correct classifications, recall is the ability of a model to find all relevant cases in the dataset. The combination of macro-and micro-average provides an overall evaluation of the models across all target classes (assigning the same weight to each class) or individually.

Experiments and Results
We conducted several offline experiments using the results obtained from pedagogical questionnaires to evaluate the proposed method. Our system holds the following hardware specifications: • Operating System: Windows 64 bits.

Dataset
The dataset was built from pedagogical surveys in a Portuguese HEI. The data collected encompass: (i) two semesters; (ii) eight consecutive academic years (2013/2014 to 2020/2021); (iii) four courses (of 3 or 4 years); and (iv) three different departments. The dataset contains 87,752 valid responses to eight questions of two categories (Section 3.1). Table 4 describes the content of the dataset. In particular, the target was defined considering the median of failed students, i.e., it is considered a success if the failed percentage is lesser than the median. The distribution of classes is balanced where 44,089 were marked as success and 43,636 failure.

Data Analysis
A feature analysis identifies the most promising independent features to predict the target variables. Therefore, we start employing a descriptive and exploratory analysis of the dataset.
Cronbach's alpha reliability analysis verifies the variability of the answers. The eight questions of Cat 1 and Cat 2 present a Cronbach's alpha of 0.953, revealing excellent reliability. To assess whether students consciously responded to the questionnaire, the percentage of students that attributed the same value to the eight questions was explored. The results indicate that only 1.4% of the answers have the same value, being a very positive aspect of the dataset. The descriptive analysis starts with the mean, median, mode, standard deviation (SD), and coefficient of variation (CV), depicted in Table 5. The three location measures, i.e., mean, median, and mode, tend to be good and similar in all the questions except Q7. In the remaining questions, the most frequent answer is 7, while in question Q7, it is 5, i.e., the classification tends to be lower. The variability measured by CV shows moderate dispersion, indicating some lack of homogeneity in the responses. Therefore, the most appropriate location measure is the median used to create the target variable in the dataset.
In terms of correlations, we started to employ Spearman's correlation coefficient, illustrated in Figure 2. The results show that the variables are highly positively correlated, i.e., the student's opinion in all questions tends to follow the same direction. The highest correlations (r s > 0.7) are observed among variables from Cat 2 (Student's opinion about CU). All correlations are statistically significant at 1%. Because the courses used in the dataset have different sizes, we have employed the same analysis for the two groups, i.e., small courses (<200 students) and big courses (≥400 students). The results point to a favourable opinion of students enrolled in small courses in terms of motivation (Q5) and student subjects (Q7).
Regarding the behaviour of students in both semesters, it is interesting to highlight that motivation (Q5) and frequent study of subjects (Q7) decrease from the 1st to the 2nd semester. The accumulated tiredness can explain this scenario over the academic year. Additionally, the 2nd semester is composed of several academic breaks and activities, which can decrease the amount of time spent studying.
In addition to the information collected using pedagogical questionnaires, we associate for each CU the number of enrolled students (# students), the number of those who failed (# students failed), the number of students who suspended/annulled the CU (# suspended), and the mean of the CU. In this context, assessing the correlation between the mean and the remaining variables is relevant. Because, in terms of enrolled students, the real data encompass the number of enrolled students per CU; the correlations were calculated with the number of students instead of percentages ( Table 6). The results indicate a strong positive correlation between # students and # suspended or # students failed of 0.769 and 0.654, respectively. It is also worth noting the existence of a strong negative correlation between the CU mean and # students failed (−0.568), i.e., increasing the number of failures, the mean decreases.
Given these high values for the correlations, we chose to build new variables to represent the percentages of students who cancelled/suspended and the number of students who failed.
An inferential analysis with t-test was used to validate the hypothesis that the size of the courses has an impact on the mean. Specifically, t-test determines whether there are significant differences between the means of the two groups. The normality and the homogeneity of variances in the two groups were evaluated, with the Kolmogorov-Smirnov test (p-value = 0.06 for group 1 and p-value = 0.04 for group 2) and with the Levene test (p-value = 0.000), respectively.
Although group 2 does not follow a normal distribution because the asymmetry and kurtosis values (−0.36 and −0.14, respectively) are not very high, we can proceed with the test. On the other hand, once the hypothesis of equality of variance is rejected, the test statistic used for the equality of means is Welch's t-Student ( Table 7).
The differences between means are considered statistically significant because p-value = 0.00 leads to the rejection of the hypothesis H 0 : µ 1 = µ 2 . In short, small courses have higher mean grades than big courses. In this context, it is important to measure the dependence of the dataset variables concerning the size of the courses, i.e., small courses (S) or big courses (B). In addition, the answers in small courses are more homogeneous, (i.e., the CV in small and big courses is 21.5% and 26%, respectively). CATPCA multivariate statistical analysis will explore the principal components concerning S and B. The component criterion used was the eigenvalue rule greater than 1. Table 8 presents the Cronbach's alpha and the variance account for the total. According to the rule of an eigenvalue greater than 1, it is possible to summarise the relational information between the variables in two orthogonal components (dimensions 1 and 2). The total variance of the 11 variables is explained by more than 80%. The internal consistency of each component was measured with Cronbach's alpha. No further components were retained as the very low Cronbach's alpha values indicate unreliability. Specifically, the Cronbach's alpha of dimensions 1 and 2 is 0.919 and 0.571, respectively. These values represent the reliability of each dimension, and it is not cumulative. The total value (0.965) represents the reliability of the general model composed of dimensions 1 and 2. Table 9 presents the weights of the variables for each component, i.e., the component loadings. We have selected the variables for each dimension with component loadings greater than 0.5 in absolute value. Therefore, while dimension one is determined by variables Q1 to Q8, dimension two uses the mean per CU, percentage of students who failed (% Failed CU), and percentage of students who suspended/annulled (% Suspended CU). We can conclude that Cat 1 defines the "Students' opinion about the CUs" and Cat 2 the "CUs performance".   Figure 3 illustrates graphically the principal components provided by CATPCA. Furthermore, in general, we can conclude that small courses present, in dimension 1, higher averages and a lower percentage of failures and dropouts than big courses. Therefore, the results obtained by the CATPCA analysis reinforce the preliminary descriptive study. In the attachment, we include further results concerning the CATPCA with objects labelled by course size.

Classification
The proposed method estimates if the student will succeed using the flag target feature. While class #0 represents the students who failed, class #1 stands for success. Based on the data analysis results using the entire dataset, the experiments have been performed employing three different sets:

1.
Complete dataset combining small and big courses; 2.
Small courses. Table 10 contains the macro and micro classification results. The values obtained are consistent in most cases for all classifiers, increasing the accuracy according to the set used. In addition, the proposed solution provides promising results with the differentiation of courses with distinct amounts of students. The best classifiers in all the experiments are RF and AB with approximated results. When compared with AB, RF requires less computation effort. Therefore, to explain the classifications, the proposed method focuses on the RF algorithm.

Explanations
The RF classifier provides promising results, being based on decision trees, which are interpretable models. The proposed method employs LIME to each subset to understand the impact of the multiple features in the final classifications. The final explanation is generated via decision tree visualisation.

Features Impact
Subset 1 encompasses the answers of courses with different dimensions. Figure 4 displays the LIME explanations for a success scenario. The classification of this student was predicted with a probability of success of 78% based on the features f ailed, suspended, mean, Q6, Q8, and Q7. From the LIME explanations, we can conclude that the variables which most contribute to the classifications are the percentage of students who failed and the mean of the CU. In terms of questions, the questions from Cat 1 are associated with failure classifications in this subset. Subset 2 encompasses the answers of courses with a higher number of students. Figure 5 displays the LIME explanations for a failure scenario in big courses. The classification of this student was predicted with a probability of 99% based on the features f ailed, mean, suspended, and Q7. From the LIME explanations, we can conclude that the variables which most contribute to the classifications are the percentage of students who failed and the mean of the CU. In terms of questions, in this subset which integrates big courses, the classifications in the questions tend to be lower. In addition, the questions from Cat 2 appear to be associated with the success classification. Subset 3 encompasses the answers of courses with a smaller number of students. Figure 6 displays the LIME explanations for a failure scenario in small courses. The classification of this student was predicted with a probability of 99% based on the features f ailed, Q1, Q4, and Q3.

Visualisation
For the proposed method, the RF classifier provides the best results. It is based on decision trees, which are interpretable models. Therefore, in addition to the LIME explanations, we generate automatic sentences from the tree which cover the relevant subset of branches, namely from the root to the classification leaf, using the categories of the pedagogical survey (CU and Student performance in the CU). Figure 7 depicts the decision tree extracted from subset 1. For this tree, we generate an explanation to understand why a student has failed using only the leafs with questions: "At CU level, the program was not adapted to the skills of the student (Q3). Concerning the student's performance, the tasks proposed in the class were not achieved (Q6) because the student was not motivated for the CU (Q5)". With these explanations, the CU can be adapted in order to minimise the failed rate.

Conclusions
An HEI aims to increase its ranking by seeking to provide up-to-date teaching and research methodologies and, consequently, provide good professionals to the labour market. In this regard, an HEI will collect and analyse indicators related to the students' performance. In particular, every semester, students respond to a pedagogical survey to manifest their opinion concerning the CU content and teaching methodology.
Emphasising the student opinion, this work proposes a transparent method to predict the student's success. Particularly, the designed solution includes: (i) a data analysis to validate and explore statistically the collected dataset; (ii) ML classification; and (iii) the integration of an explainable mechanism to generate explanations for classifications.
The experiments were performed with a balanced dataset with 87,752 valid responses. The data analysis identified relevant differences between small and big courses. Small courses have higher mean grades in the CU, being more homogeneous than big courses. Furthermore, it can be concluded that courses with a high score in the mean of the CU have a lower percentage of failures or dropouts.
The proposed method was evaluated using standard classification metrics, achieving a 98% classification accuracy and F-measure. This result shows that it is possible to explain and classify student performance using pedagogical questionnaires. The proposed method allows to take pre-emptive actions and avoid early cancellations or dropouts. Specifically, the proposed method provides explanations which help to understand the reasons of the success or failure for each curricular unit. With this knowledge, the curricular unit can be adapted to avoid failures, early cancellations, or dropouts. The explanations provide information at both levels, curricular unit and students' efforts being a starting point to improve the pedagogical method in the next school year. In addition, the proposed solution aims to support the three pillars of sustainability, i.e., economic, social, and environmental.
We plan to integrate more information about student behaviour and enhance the explanations in future work. Moreover, we intend to analyse the performance of deep learning models and integrate more explainable mechanisms. In addition, we intend to create an environment-agnostic model for student performance prediction, i.e., adapting dynamically the ML model, independent of it being face-to-face, online, or blended.

Data Availability Statement:
The data presented in this study are available on request from the corresponding author.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript:  Figure A1 plots the weights of each variable in each component, i.e., the set of answers, and of the corresponding component loadings. In the figure, the small courses are identified as S and big courses as B. We can observe the relationships between objects and variables. We can conclude that, in general, small courses present, in dimension 1, higher averages and a lower percentage of failures and dropouts than big courses.