Performing Learning Analytics via Generalised Mixed-Effects Trees

: Nowadays, the importance of educational data mining and learning analytics in higher education institutions is being recognised. The analysis of university careers and of student dropout prediction is one of the most studied topics in the area of learning analytics. From the perspective of estimating the likelihood of a student dropping out, we propose an innovative statistical method that is a generalisation of mixed-effects trees for a response variable in the exponential family: generalised mixed-effects trees (GMET). We performed a simulation study in order to validate the performance of our proposed method and to compare GMET to classical models. In the case study, we applied GMET to model undergraduate student dropout in different courses at Politecnico di Milano. The model was able to identify discriminating student characteristics and estimate the effect of each degree-based course on the probability of student dropout.


Introduction
The present work is part of the international SPEET project (Student Profile for Enhancing Engineering Tutoring), an ERASMUS + project aiming to provide a new perspective to university tutoring systems.It intends to extract useful information from academic data provided by its partners 1 and to identify different engineering student profiles across Europe [1].Our goal was to find out which indicators may discriminate between two different student profiles: dropout students, who permanently abandon their Bachelor of Science (BSc) programs, and graduate students, who attain the academic qualification.This was motivated by the fact that, across all SPEET partners, almost one student out of two leaves his/her engineering studies before obtaining a BSc degree.If it were possible to know promptly to which profile a student belongs, tutors could improve counselling actions.
Data provided by universities usually include indicators about socio-economic background, and both the current and previous performance data of the students.However, academic success depends on different factors, both internal and external [2].The dataset we used in our analysis includes information on more than 18,000 BSc students from Politecnico di Milano (PoliMi): it essentially consists of student records, so it does not include all possibly relevant factors.Datasets with similar structures have already been used in recent developments oriented toward performance prediction and detection of future dropouts or students at risk of dropping out [3].The hypothesis is that background and performance indicators together are enough to identify the students at risk and to draw the attention of tutors, who should complete each student's profile with further information.
In our case study, students were naturally nested within degree-based courses.Further levels of hierarchy are possible, such as programmes within faculties, faculties within universities and finally universities within countries.While investigating the learning process, it is necessary to disentangle the effects given by each level of hierarchy [4].Indeed, if the clustered aspect of the data is not inspected, it may result in a loss of likely valuable information.Multilevel models take into account the hierarchical nature of data and are able to quantify the portion of variability in the response variable that is attributable to each level of grouping [5].Generalised linear mixed models (GLMM) fit a multilevel model on a binary response variable, but they impose a linear effect of covariates on a transformation of the response variable [6].On the contrary, tree-based methods such as the classification and regression tree (CART) model learn the relationships between the response and the predictors by identifying dominant patterns in the training data [7].In addition, these methods allow a clear graphical representation of the results that is easy to communicate.The goal of our work was to create a novel method, for a non-Gaussian response variable, which is able to preserve the flexibility of the CART model and to extend it to a clustered data structure, where multiple observations can be viewed as being sampled within groups.
This was not the first time that tree-based methods have been adopted to deal with longitudinal and clustered data.In Sela and Simonoff [8], a regression tree method for longitudinal or clustered data was proposed.This method is called the random effects expectation-maximization (RE-EM) tree.Independently, in Hajjem et al. [9] a mixed-effect regression tree (MERT) model was proposed.If clustered observations are considered, these are extensions of a standard regression tree to the case of individuals nested within groups.These methods use observation-level covariates in the splitting process and can deal with the possible random effects associated with those covariates.However, they both deal only with Gaussian response variables, and they are not suitable for classification problems.Our proposed method intends to generalise the RE-EM tree approach, thereby extending its use to different classes of response variables that belong to the exponential family 2 : this should allow one to extend it, for example, to a classification setting.At the same time, this method can deal with the grouped data structure, similarly to traditional multilevel models.As in RE-EM tree estimation, we developed an algorithm that disentangles the estimations of fixed and random effects.That is, an initial tree is built ignoring the grouped data structure, a mixed-effects model is fitted based on the resultant tree structure and a final mixed-effects tree is reported.
Similar methods were proposed in Hajjem et al.
[11] and Speiser et al. [12], but following different approaches.In Hajjem et al. [10] the MERT approach was extended to non-Gaussian data, and a generalised mixed effects regression tree (GMERT) was proposed.This algorithm is basically the penalised quasi-likelihood (PQL) algorithm used to fit GLMMs, where the weighted linear mixed-effect pseudo-model is replaced by a weighted MERT pseudo-model.In particular, the authors used a first-order Taylor-series expansion to linearise the response variable.In Fokkema et al. [11], the authors proposed the generalised linear mixed-effects model tree (GLMM tree) algorithm, which alternates the estimates of a GLM tree and a mixed-effects model until convergence.Its main distinction from the GMET algorithm is that the GLMM tree algorithm builds on model-based recursive partitioning (MOB, Zeileis et al. [13]), instead of on CART, as GMET does.Lastly, the most recent work was presented in Speiser et al. [12].The authors developed a decision tree method for modelling clustered and longitudinal binary outcomes.Even if the aim of their model is very similar to ours, their model only handles binary outcomes using a Bayesian GLMM, and it allows a random intercept, but not random slopes.Differently from these cited methods, GMET starts by initialising the random-effects to zero; it estimates the target variable through a GLM (using suitable link functions depending on the response family distribution); builds a regression tree using the estimated target variable as the dependent variable; and then fits a mixed-effects model to estimate the random-effects part, using the fixed-effects part estimated by the tree as an offset.
In the last few decades, learning analytics, and specifically, the topic of dropouts at university, is receiving particular attention.The investigation of the dropout phenomenon within higher education institutions (HEIs) has always been a concern for educators, university managers and policy makers.The academic literature distinguishes between two approaches to investigating the features of this phenomenon: theory-driven and datadriven.The first analyses the reasons and the psychological constructs behind withdrawing decisions, thereby identifying theoretical fundamentals and contributing to a conceptual model to guide the inquiry.Different authors [14][15][16][17][18] proposed models to show the processes of interactions among students, their features and their institutions that lead to dropping out [18].Basically, their models rely on an interdisciplinary approach to explain the dropout process.In particular, the model considers the interactions between the student and the university environment-individuals are exposed to influences, expectations and demands from a variety of sources (such as courses, faculty members, administrators and peers).The interactions between these two aspects contribute to a student's success or failure in both the academic system and the social system [17].Hence, these studies focus on the necessity to contextualise the student's educational career in a community structure.
The alternative approach is data-driven.In it, students' characteristics are analysed longitudinally to find the best statistical models predicting dropout or graduation [2,[19][20][21].In this case, researchers are less interested in explaining the phenomenon per se; the focus is on finding the best performing model in terms of forecasting student withdrawal.The prediction of low performers is increasingly getting the attention of academics, which is attributable to the applicability of remedial learning, which in turn serves the institutional goals of providing high-quality education ecosystems [22].In addition, the data mining approach to education is quickly becoming an important field of research due to its ability to extract new knowledge from a large amount of student data [23].
The goal behind the present study was the development of a clear theoretical framework, in the midway point between the two approaches, which considers the educational process and the need for predicting students' outcomes as early as possible.We applied the GMET model to the Politecnico di Milano data, collected within the ERASMUS + SPEET project, thereby identifying which fixed-effect covariates discriminate between dropout and graduate students.Through the GMET model, we relaxed the assumption of linear effects of student-level covariates on their performances, and we identified which interactions relevantly influence dropout status.We included the most common student characteristics in a flexible and interpretable model that takes into account the enrolment in different degree programs.A multilevel model allows one to estimate the degree programme's effect on the predicted probability of obtaining the degree.Machine learning and tree-based methods have been applied in the literature to model student dropout [24][25][26][27][28][29], but to the best of our knowledge, we are presenting the first time that a multilevel tree-based method has been applied to predict student dropout probability.
The paper is organised as follows.In Section 2 we describe the model and methodsthe generalised mixed tree algorithm (GMET).In Section 3 we show a simulation study.In Section 4 we describe the PoliMi dataset, we report the application of the proposed algorithm to the case study and we outline the results.Finally, in Section 5 we draw our conclusions.
All the analysis was performed using R software [30].The R code for the GMET algorithm and for all the simulations is available in Supplementary Materials Data S1.

Model and Methods
In this section, we present the proposed generalised mixed-effects tree model (Section 2.1) and the algorithm for the estimation of its parameters (Section 2.2).

The Generalised Mixed-Effects Tree Model
We start by considering a generic GLMM.This model is an extension of a generalised linear model that includes both fixed and random effects in the linear predictor [6].Therefore, GLMMs handle a wide range of response distributions and a wide range of scenarios where observations are clustered into groups rather than being completely independent.For a GLMM with a two-level hierarchy, each observation j, for j = 1, . . ., n i , is nested within a group i, for i = 1, . . ., I. Let y i = (y 1i , . . ., y n i i ) be the n i -dimensional response vector for observations in the i-th group.Conditionally on random effects denoted by b i , a GLMM assumes that the elements of y i are independent, with density function from the exponential family, of the form where a(•) and c(•) are specified functions, η ij is the natural parameter and φ is the dispersion parameter.In addition, we have A monotonic, differentiable link function g(•) specifies the function of the mean that the model equates with the systematic component.Usually, the canonical link function is used, i.e., g = a −1 .From now on, without loss of generality, the canonical link function is used.In this case, the model is the following [31]: ( where i is the group index, I is the total number of groups, n i is the number of observations within the i-th group and ∑ I i=1 n i = J. η i is the n i -dimensional linear predictor vector.In addition, X i is the n i × (p + 1) matrix of fixed-effects regressors of observations in group i, β is the (p + 1)-dimensional vector of their coefficients, Z i is the n i × q matrix of regressors for the random effects, b i is the (q + 1)-dimensional vector of their coefficients and Ψ is the q × q within-group covariance matrix of the random effects.Fixed effects are identified by parameters associated with the entire population, whereas random ones are identified by group-specific parameters.
Our proposed generalised mixed-effects tree (GMET) method expands the use of tree-based mixed models to different classes of response variables from the exponential family.At the same time, the method can deal with the grouped data structure as GLMMs do.We now specify the GMET model.The random component of this model consists of a response variable Y from a distribution in the exponential family.The fixed part in the GMET is not linear as in (1), but is replaced by the function f (X i ) that is estimated through a tree-based algorithm.Thus, the matrix formulation of the model is the following: ( where i is the group index, I is the total number of groups, n i is the number of observations within the i-th group and ∑ I i=1 n i = J.In addition, η i is the n i -dimensional linear predictor vector and g(•) is the link function.Finally, X i is the n i × (p + 1) matrix of fixed-effects regressors of observations in group i, Z i is the n i × q matrix of regressors for the random effects, b i is the (q + 1)-dimensional vector of their coefficients and Ψ is the q × q withingroup covariance matrix of the random effects.As in a GLMM, b i and b i are independent for i = i .Fixed effects are identified by a non-parametric CART tree model associated with the entire population, whereas random ones are identified by group-specific parameters.
Without loss of generality, let us now specify model (2) for the case of a binary random variable and univariate random effect.The logit function is the canonical link function: Here, the random-effects structure simplifies to a random intercept.The model formulation for observation y ij may therefore be written as: ( where we observe x ij = (x 1ij , . .., x ijp ) T , a (p + 1)-dimensional vector of fixed-effects covariates for each observation j in group i.

Generalised Mixed-Effects Tree Estimation
In this subsection we show the algorithm for the estimation of the parameters of the GMET model (2).Following the approach of the RE-EM tree, the basic idea behind the algorithm is to disentangle the estimation of fixed and random effects, with the difference that the GMET algorithm is not iterative.The structure of the algorithm is the following: 1.
Initialise the estimated random effects b i to zero.

3.
Build a regression tree approximating f using μij as dependent variable and x ij = (x ij1 , . .., x ijp ) T as vector of covariates.This regression tree identifies a number L of terminal nodes R , for = 1, . . ., L, and each observation ij, described by its set of covariates x ij , belongs to one of the terminal nodes.Through this regression tree, we define a set of indicator variables I(x ij ∈ R ), for = 1, . . ., L, where I(x ij ∈ R ) takes value 1 if observation ij belongs to the -th terminal node and 0 otherwise.

4.
Fit the mixed effects model (2), using y ij as a response variable and the set of indicator variables I(x ij ∈ R ) as fixed-effects covariates (dummy variables).Specifically, for i = 1, . .., I and j = 1, . .., n i , we have g(µ ij ) = I(x ij ∈ R )γ + z T ij b i .Extract bi from the estimated model.

5.
Replace the predicted response at each terminal node R of the tree with the estimated predicted response g( γ ) from the mixed-effects model fitted in step 4.
The GLM in step 2 is fitted through maximum likelihood.The maximum likelihood estimates can be found using an iteratively reweighted least squares algorithm or a Newton-Raphson method [32].
The fitting of the tree in step 3 can be achieved using any tree algorithm, based on any tree-growing rules that are desired.Here, tree building is based on the CART tree algorithm [7].After building a large tree T 0 , pruning is advised to avoid overfitting on training data.In principle, any tree-pruning rule could be used; here, we propose costcomplexity pruning [33].It considers a sequence of nested trees indexed by a nonnegative tuning parameter α which controls the trade-off between the subtree's complexity and its fit to the training data.For each value of α exists a subtree T ⊂ T 0 to minimise Here, |T| indicates the number of terminal nodes of tree T. When α = 0, then the subtree T will simply be equal to T 0 .However, as α increases, the quantity (4) will tend to be minimised for a smaller subtree.We can select a value of α using a validation set or using k-fold cross-validation: for example, we can pick α to minimise the average CV error.Tree building and pruning is implemented in R library rpart [34], according to the CART tree-building algorithm and cost-complexity pruning.In order to ensure that initial trees are sufficiently large, we set the complexity parameter to zero.Thus, the largest tree is grown then pruned based on ten-fold cross-validation error.Instead of choosing the tree that achieves the lowest CV error, we use the so-called 1-SE rule: any CV error within one standard error of the achieved minimum is marked as being equivalent to the minimum.Among all these equivalent models in terms of CV error, the simplest one is chosen as the final tree model.
The generalised linear mixed model in step 4 can be estimated using fitting techniques that were previously described.Different statistical packages can estimate those types of models: the glmer function of the R library lme4 [35] is used here.It fits a generalised linear mixed model via maximum likelihood.For a GLMM the integral must be approximated: the most reliable approximation is the adaptive Gauss-Hermite quadrature, at present implemented only for models with a single scalar random effect; otherwise, Gaussian quadrature is used [36,37].
For what concerns the time efficiency, the GMET algorithm is very fast.Indeed, being a non-iterative algorithm, its running time is approximately equal to the sum of three steps' running times, i.e., the ones to fit a GLM (step 2), a regression tree (step 3) and a GLMM (step 4).

Predictions for New Observations
After estimating a GMET, it is possible to make out-of-sample predictions for new observations.Suppose the tree is estimated on data from groups i = 1, . .., I for observations y ij , j = 1, . .., n i .Given a new observation x ij , we are able to output its corresponding response, since we know the estimation of the fixed-effects function f (•), of the random effects b i and of the associated covariance matrix Ψ.The algorithm is able to provide two types of prediction, depending on whether the group i to which the new observation x ij belongs is a new group (i.e., not observed in the data used to train the model) or not: • Predict response y ij given a new observation x ij for a group in the sample i ∈ {1, . . ., I}.We define it a group-level prediction.

•
Predict response y i j given an observation x i j for a group i for which there were no observations in our sample, or for which we do not know the relevant group.We define it a population-level prediction.
Following the classical approaches for prediction in mixed-effects models [8,38], for the first type of prediction, we estimate f (x ij ) using the estimated tree and attributes x ij and then add z T ij b i on the linear predictor scale, and get back to the response scale through the inverse link function g −1 (•).As we underlined before, random-effects coefficients b i are known from the estimation process.For the second type of prediction, since we have no information with which to evaluate b i , we set it to its expected value of 0, yielding the value f (x i j ), and transform it back to the response scale through the inverse link function.As noted in Sela and Simonoff [8], in this case we might expect that methods that do not incorporate random effects would have comparable performances to those that do, as long as the sample is large enough so that the fixed-effects function f (x ij ) is well-estimated by both types of methods.

Simulation Study
In this section we compare the performance of the proposed GMET method to the performances of standard classification trees and different types of mixed-effects models on simulated binary outcome datasets.
We used a variation of a simulation design proposed in Hajjem et al. [10] and followed the data generating process presented in their paper.We simulated a two-level data structure of I = 50 groups with n i = 60 observations each: 10 observations in each group were included in the training sample, and the other 50 observations constituted the test sample.Therefore, N train = 500 and N test = 2500.By setting i = 1, . . ., I and j = 1, . . ., n i , the response values y ij were simulated according to a Bernoulli distribution with conditional probability of success µ ij .Both fixed and random effects were used to generate µ ij .Overall, we considered 10 different data generating processes (DGPs) outlined in Table 1 by combining different fixed and random-effect specifications 3 .
Let us define the fixed-effect structure.Eight random variables X 1 , . .., X 8 , independent and uniformly distributed in the interval [0, 10], were generated.While all of them were being used as predictors, only five of them were actually used to generate µ ij , based on the tree rule summarised in Figure 1.Each observation was classified into one of the six terminal nodes according to the values x ij1 , . .., x ij5 .Within each leaf, values ϕ 1 , . .., ϕ 6 denote the probabilities of success when the random effects b i are equal to zero: is the logit link function.Two different possibilities were specified for the fixed effects: in the large fixed-effects specification, the standard deviation of the typical probabilities across the leaves was higher than in the small one (0.37 versus 0.24).Random intercept and slope, which add a linear random effect for the fixed-effect covariate X 1 , uncorrelated from the random effect on the intercept.That is, Within each fixed-effects scenario with random effects, we considered two specifications (low and high) for the covariance matrix Ψ to account for different levels of magnitude of the between-group variability.

Simulation Results
We ran eight different models for each one of the 10 DGPs: The BiMM algorithm proposed in [12] considering (x 1 , . . ., x 8 ) as fixed covariates and a random intercept 4 .
As noted in Hajjem et al. [9], the MElog model could not be a real competitor of any other model.Indeed, it is not possible in practice to specify this parametric structure without knowing the underlying data generating process.This model only serves as a reference for the performances of the other models.In tree-based models, we fixed to 10 the maximum depth parameter and to 20 the minimum number of observations necessary to attempt a split 5 .After fitting each model on the training set, we could compute the corresponding predicted probability μij and the predicted class ŷij of observation j in group i in the test dataset.While the former was directly estimated by the algorithm, the latter depended on the threshold value µ * k used to classify subjects in the test set: μij ≥ µ * k ⇒ ŷij = 1 where (i, j) ∈ test.There were at most K distinct fitted values µ k , with K ≤ I|T|.We used each of them to classify observations in the training set and we fixed the threshold µ * k as the one that yields the closest proportion of class 1 to the actual proportion of class 1 in the training set.
We measured the predictive performance by: • The predictive mean absolute deviation The mean, median, standard deviation, minimum and maximum of the PMAD and the PMCR over 100 runs were calculated and are reported in Table 2.
We observed that when there was no random effect (DGPs 1 and 2), the standard classification tree algorithm performed better than the mixed-effects models, especially when the fixed effect was large.Nonetheless, in the latter scenario, the performances of GLMERT and GMERT were very close to Std ones, proving to be robust even in absence of a true random effect.However, when random effects were present (DGPs 3 to 10), mixed-effects classification trees performed better than the standard classification tree in terms of average PMAD and PMCR.BiMM is the only mixed-effects tree algorithm whose performance was very close to Std ones, for all DGPs 6 .When the DGP included only a random intercept, GLMERT had the best predictive performance, and was directly followed by RI.When the true random effect structure included both random intercept and random slope, GMERT, GLMERT and RIS performances were very close.There was a slightly better performance by GLMERT when the fixed effect was large and of RIS when the fixed effect was small.The highest improvement in PMAD using a mixed tree model was observed when both the fixed and the random effects were large.The lowest improvement was observed when both the fixed and the random effects were small.Analogous statements can be made about PMCR.In addition, GMET performed better than standard trees even when we fit a mixed tree whose random component was over-specified (as in DGPs 3-6, Std vs RIS) or under-specified (as in DGPs 7-10, Std vs RI) in relation to the true data generating process.Next, we compare the performance of the GMET approach to the results of the MElog reference model.If the DGP did not include random effect, the difference between PMAD and PMCR was higher when the fixed effect was large (DGP 1).When the random effect was large and the fixed effect was small (DGPs 6 and 10), the GMET model performed similiarly to the MElog model.In terms of PMAD, the differences were 4.61% and 4.48% for DGPs 6 and 10, respectively; in terms of PMCR, they were 3.02% and 3.36%, respectively.The difference in predictive accuracy between the two models reached its maximum when the random effect was small and the fixed effect was large (DGPs 3 and 7).In terms of PMAD, the differences were 9.59% and 9.69% in DGPs 3 and 7, respectively; in terms of PMCR, they were 7.33% and 8.25%, respectively.
With respect to the other existing tree-based mixed-effects models, the fact that the GMET algorithm is not iterative makes it less performant when fixed and when its random effects are small and easier to be confused; and its performs better when they are large and easier to be disentangled.Moreover, its step through a glm makes it perform worse when the DGP includes only a large (in this case, nonlinear) fixed effect, but makes it competitive with the other existing methods when data have an important random-effects structure.In order to investigate the performance and to deepen the comparison across methods under different settings, we report, in Appendix B, additional simulations and results: we provide more details about the model's predictive quality in this simulation, e.g., the recovery of the right tree structure or the identification of the right number of leaves.We ran new simulations for different DGPs (linear and non-linear fixed-effects) and for a different response variable in the exponential family, i.e., Poisson.Results show that GMET on average outperformed all other tree-based methods when data had a linear structure, for both a binary and a Poisson response variable.

Case Study: Application of the Mixed-Effects Tree Algorithm to Education PoliMi Data
In this section, we describe the PoliMi dataset.We applied the generalised mixedeffects tree algorithm to these data.Using a GMET model, we could identify discriminating fixed-effects covariates and estimate the degree programme's effect on the predicted success probability.In addition, we also analysed the accuracy of this model for predicting dropout.
The PoliMi dataset consists of 18,612 records in Bachelor of Science (BSc) students that began between A.Y. 2010/2011 and 2013/2014.Students are nested within I = 19 degree programmes.Table 3 reports the list of the 19 degree programmes and the number of students enrolled in each degree program.A descriptive analysis showed that a high percentage of students leave the Politecnico before obtaining a degree.In particular, the sample shows a 37.11% dropout rate.Therefore, our goal was to find out which student-level indicators could discriminate between two different profiles: dropout and graduate students.We assumed a binary GMET model (3) where student j was nested within degree programme i.The response variable Y was the status, a two-level factor we coded as a binary variable: • status = 1 for studies definitely completed with graduation; • status = 0 for studies definitely concluded with dropping out.We would like to make predictions at the very early stages of students' academic careers.Thus, we chose as predictors five variables available at the time of enrolment and three more variables collected just after the first semester of study.The list and explanation of student-level variables to be included as covariates is reported in Table 4.In addition, we chose as the grouping variable the degree programme at the time of enrolment (factor DegreeProgramme) which has 19 levels.The influence of the grouping factor on the predictor was modelled through a group-level intercept b i .We randomly split the dataset into training and test subsets, with a ratio of 80% for training and 20% for evaluation.Thus, the training subset included 14,890 students and the test subset had 3722.While growing the tree, we fixed to 10 the maximum depth parameter and to 20 the minimum number of observations necessary to attempt a split.Figure 2 shows the estimated mixed-effects tree for the probability of graduation.Every internal node had a corresponding condition that split it into two children: if the condition was true, observations were sent down the tree through the left child; if the condition was false, through the right child.In addition, all nodes reported two values: the estimated probability of graduation and the percentage of observations in the node over the total training set.We remind the reader that variable PreviousStudies has been coded as a three-level factor with levels S (Liceo Scientifico), T (Istituto Tecnico) and O (other high school studies).The number of ECTS obtained in the first semester of the first year was used as the first split: students who obtained less than 13 ECTS were associated with lower success probability (0.16 versus 0.86).Then, students were further classified using other explanatory variables: we can see that Italian students who obtained more than 24 ECTS had the highest predicted success probability (0.95).Other variables actually used to split smaller internal nodes were Nationality and PreviousStudies: in these nodes, students who attended Istituto Tecnico and foreign students had lower predicted success than the others.Through this model, it was possible to find out significant interactions among the covariates: for example, variable Nationality was used to split the group of students that obtained at least 13 ECTS, but this same variable did not appear in the complementary branch of the tree.Finally, covariates Sex, AdmissionScore and AvgAttempts1.1 were not compared in the trees, so they do not appear to have strong influences on how one's studies end.Using the tree structure in Figure 2, we could get population-level predictions for new observations that did not include the effect of the programme.However, if we also specified the level of the random effect covariate, our model was able to adjust this prediction to account for the effect and make a group-specific prediction.Indeed, we extracted coefficients bi from the full estimated mixed model (3) and provide different predictions for different programmes within each leaf of the tree structure.Figure 3 shows the ranking of the 19 estimated random-effects intercepts, one for each degree program.Light blue points correspond to the point estimates bi , for i = 1, . . ., 19, and the horizontal black lines represent the 95% confidence intervals of the estimates.When the 95% confidence interval does not overlap with 0 (identified by the dashed vertical line), we have evidence to assert that the degree program's effect was significantly different from zero, i.e., from the average.For many groups, the 95% confidence interval does not overlap with the vertical line at zero, underlining substantial differences between the groups.If we use this model to estimate the probability of graduation, many degree programs will give results significantly different from the average.In particular, degree programs whose confidence intervals are entirely higher (lower) than zero are associated with higher (lower) dropout likelihood with respect to the average, all else being equal.After fixing all other covariates, Environmental and Land Planning Engineering and Civil and Environmental Engineering had large positive effects on the intercept: one student from one of those programmes improves the log odds by 1.051 or 0.705, respectively.On the contrary, studying either Civil Engineering or Electrical Engineering penalises the log odds by 0.680 and 0.546 respectively.Since we were using a multilevel model, we were able to account for the interdependence of observations by partitioning the total variance into different components due to the clustered data structure in model (3).The variance partition coefficient (VPC) is a possible measure of intraclass correlation: it is equal to the percentage of variation that is found at the higher level of hierarchy over the total variance [39].The idea of VPC was extended using the latent variable approach, to define a method to partition the total variance in the case of a binary response and the group-specific intercept for the randomeffects structure [40].In this case, the variance partition coefficient was constant across all individuals, and it can be estimated as:

DegreeName.out
where ψ is the estimated variance of the random intercept and σ 2 lat is the residual variability that can be explained by neither fixed effects, nor the group features that are represented by the random intercept.In this case, it is equal to the variance of the standard logistic distribution.This VPC value means that 6.12% of variation in the response is attributable to the classification by degree type.This value underlines the need to use a mixed model.
We can now evaluate the performance of the model and its predictive quality using the area under the ROC curve (AUC) and other performance indexes: accuracy, sensitivity and specificity.For each test observation, we were given a full set of covariates; therefore, we were able to compute an estimate p of the probability of successfully concluding a BSc and getting a degree.We used this estimate to define a binary classifier based on model (3): we chose p 0 = 0.6 as the optimal cutoff value through ROC curve analysis, as shown in Figure 4.For 20 iterations, we randomly split the observations into training and test sets.We fit a GMET model with the training set, and we classified test observations using the optimal threshold value.Finally, we computed the average accuracy, sensitivity and specificity values and their standard deviations, which are reported in Table 5. High values of accuracy, sensitivity and specificity indicate a good model.The model's performance was robust, as highlighted by the low standard deviations of the mean performance indexes and the high AUC, equal to 0.9127 (Figure 4).In addition, Table 6 reports the means and standard deviations of accuracy, sensitivity and specificity, computed separately for each degree program.It is interesting to compare these average performance indexes against those obtained using different methods.Our approach had similar accuracy to a standard classification tree (0.878 versus 0.879), but its accuracy showed less variability across the iterations.For example, its standard deviation of accuracy was 0.5%; compare that to 2.8% for the classification tree.Since we were interested in the detection of dropout careers, we compared mean sensitivity using different models.Using mixed-effects trees, we attained higher sensitivity than using standard classification trees (0.835 versus 0.800).Thus, the choice of a mixed-effects model seemed appropriate: the degree programme is a meaningful covariate for the prediction of status.The mixed-effects tree was slightly less sensitive than a classifier built through a GLMM (0.835 versus 0.850), suggesting that a tree-like structure for fixed effects might not be as suitable as the GLMM one.However, it has other advantages, such as offering an easily interpretable model that could be graphically displayed and understood.Overall, the good performance of GMET in this application was due to two reasons.The first is that the variability at the highest level of grouping, i.e., degree programs, was not negligible, and therefore, taking it into account improved the predictive performance of the models.The second is that the good performance of the GLMM suggests that the association between the most important covariate, i.e., TotaleCredits1.1 (the most relevant variable in the tree of Figure 2), and the response can be well approximated by a linear function.Therefore, μij , estimated at step 2 of the GMET algorithm and used as the input for the tree built at step 3, was very precise and representative of the real dynamics, helping the GMET algorithm to fit the data well.Appendix A reports the results of the application of GMERT and BiMM algorithms to the PoliMi case study and a comparison with GMET results presented in this section.

Conclusions
We proposed a multilevel tree-based model for a non-Gaussian response (GMET algorithm), showed a simulation study and applied the GMET algorithm to the PoliMi careers dataset as a tool to find student-level variables to discriminate between two different student profiles (graduate and dropout) and to estimate the degree programme's effect on the predicted success probability.
The GMET model can deal with a grouped data structure, while providing easily interpretable models that can outline complex interactions among the input variables.In the simulation study, the performance of the proposed mixed-effects tree method was a marked improvement over the CART model when the data generating process (DGP) included random effects, even if they were of small magnitude.In addition, the performance of the GMET model was similar to that of the benchmark logistic model that was fitted assuming the whole specification of the DGP.GMET's performance was comparable to that of other existing tree-based mixed-effects models, outperforming them when data had a linear structure, and it had a clear advantage in convergence time.Although our study focused on the binary response case, the mixed-effects tree approach could be extended to other types of response variables.Using a suitable link function, we could study if the method is appropriate to model different outcomes, such as count data or a multinomial factor response.Overall, the main advantages of the GMET algorithm are its flexibility and interpretability [41].By relaxing the linear assumption of the fixed-effects part, the method could model more complex functional forms, easily treating potential interactions among covariates.This complexity is then summarised in a tree structure, which is easy to interpret and communicate.At the same time, when data present a hierarchy, the method is able to take into account the dependence structure within observations and to model it.In the educational data mining context, this aspect is essential in order to better understand students and the settings in which they learn.On the other hand, GMET, as CARTs, suffers from high variance.This means that if we split the training data into two parts at random, and fit a decision tree to both halves, the results could be quite different.Ensemble methods which use a mixed-effects tree as a base learner together with a random forest approach may be developed.
In our case study, the effectiveness of the GMET model in dropout prediction was comparable to the effectiveness of more established classification methods.A GMET model with high accuracy and sensitivity was obtained by considering information available at the time of the admission and the results of the first semester of studies.In addition, our work identifies a significant effect of the engineering programme on dropout probability.The estimated student success probability might be used as a tool to conduct policy experiments at the institutional level, aimed at identifying the best practices to help and retain at-risk students.In this setting, PoliMi started an experimental early intervening program that invites at-risk students (identified by the GMET algorithm) to attend dedicated tutorship to support them during the beginning of their studies at PoliMi.
In the context of the SPEET project, a future development could be the extension of our analysis to the other project partners in order to compare the programme effect at the country level.This would allow us to relate this effect to programme-level variables, and we could establish whether the same profiles of students at risk of dropout arise at country level.Moreover, in accordance with the validity and the potential of the GMET method when applied to modelling student dropout prediction, our future perspective goes in the direction of major applications in the learning analytics area.This method, when applied to educational data, can be a useful tool to support the definition of best practices and new tutoring programmes aimed at enhancing student performances and reducing student dropout.A worthwhile consideration is also the approach that teachers and students have with respect to its results.Indeed, this method is also valuable from the perspective of recommendation systems, since, if its results are interpreted and communicated in the right way, they can be used to drive students in their career choices.
We ran GMERT and BiMM algorithms considering the same set of fixed-effects covariates shown in Table 4 and a random intercept given by the grouping of students within degree programmes.Equivalently to GMET inputs, we fixed to 10 the maximum depth parameter and to 20 the minimum number of observations necessary to attempt a split in the GMERF algorithm.Since the BiMM algorithm does not receive in input rpart control parameters, we ran the algorithm with the default parameters.Figures A1 and A2 report the fixed-effects trees and the random intercepts estimated by GMERT and BiMM.Regarding the fixed-effects, the trees identified by GMET and GMERT are very similar: the variables that were determined important are coherent across the two methods (i.e., TotalCredits1.1 as the most important one, followed by WeiAvgEval1.1,PrevStudies and AccessAge).BiMM tree performs a unique split, identifying TotalCredits1.1 as the most important covariate.Regarding the random-effects, comparing Figures 3 and A2, we can observe that the random intercepts estimated by the three methods are quite consistent.In particular, the correlation coefficient between random intercepts estimated by GMET and GMERT is equal to 0.95, whereas the one between random intercepts estimated by GMET and BiMM is equal to 0.73.The variance of random intercepts ψ estimated by GMERT is smaller that estimated by GMET.Indeed, the VPC estimated by model GMERT is 0.0479 (against VPC GMET = 0.0612).The variance ψ estimated by the BiMM algorithm is higher, and VPC BiMM = 0.0634.Regarding the predictive performances of GMERT and BiMM, Figure A3 reports the ROC curves obtained on the test set and Table A1 reports performance indexes of the classifiers based on the two methods, computed following the same procedure of GMET.The predictive performances of GMET and GMERT were very similar, with the small differences that the AUC of GMERT was slightly higher than that of GMET, but the accuracy, sensitivity and specificity indexes of GMERT had higher values than those of GMET.It is worth noting that the time of convergence for GMERT was significantly higher than that for GMET.BiMM seemed to perform slightly worse than the other two methods in terms of predictive power.

Appendix B. Additional Simulations and Results
In this section, we provide more details about the simulations presented in Section 3 and also the results from other simulations with different DGPs.

Appendix B.1. Recovery of the Right Tree Structure
The predictive performances of the GMET algorithm and other tested methods are given in Table 2, Section 3.Here we present the results about the ability of the methods to recover the right tree structure.Following the approach presented in [10], three different ways of looking at this aspect are presented.We evaluate if the tree has: (1) the right number of leaves (i.e., six), (2) the right structure and right splitting covariates and (3) the right structure, right splitting covariates and right cutpoints.The third criterion was achieved if the estimated cutpoints came within 1 unit of the true ones (which were all 5), that is, if 4 < cutpoint < 6 for all cutpoints.These criteria are in increasing order of difficulty.If the estimated tree achieved (3), then it achieved ( 2) and (1), and so on.Table A2 presents the results.In terms of number of leaves, Std, GLMERT and GMERT did fairly well with median numbers of leaves of six or sometimes five in the first four DGPs.GLMERT and GMERT tended to underestimate the number of leaves in the last six DGPs.BiMM tended to underestimate the number of leaves in all DGPs.Contrarily to the others, GMET tended to overestimate the number of leaves of the tree in all the DGPs.This result might depend on the second step of the GMET algorithm, in which the response variable is linearised through a GLM.As a consequence, the tree is built not using the original binary response y ∼ Be(p) as the target variable, but the p estimated by the GLM.This can lead to a different tree structure.
Table A2.Results of the 100 simulation runs, presented in Table 2, Section 3, in terms of recovering the right tree structure.# right splits reports the number of times out of 100 in which we obtained a tree of six leaves with the right splitting covariates; # right cutpoints reports the number of times out of 100 in which we obtained a tree of six leaves with the right splitting covariates and cutpoints (i.e., 4 < cutpoint < 5).The DGP presented in the simulation of Section 3 is tree-shaped.To complete this simulation, we investigated two other DGPs, the first based on data with linear fixed effects and the second based on data with non-linear fixed effects.Again, 10 different scenarios were considered, involving small/large fixed effects and models with/without random effects.The same cluster configuration and random components shown in Section 3 were used, and the results are based on 100 runs.

DGP
Again, we followed the DGP presented in [10].The first DGP had linear fixed effects f (x ij ).The large-effects scenario used f (x ij ) = 1.20x 1ij − 0.3x 2ij − 0.2x 3ij , and the smalleffects scenario had f (x ij ) = −0.6x1ij − 0.15x 2ij − 0.10x 3ij .The results are presented in Table A3.As expected, the GLMM had the best predictive performance, since it used the true fixed and random effect structures.Nevertheless, RI's and RIS's performances were very similar to that of GLMM, and they outperformed all other methods, for all random effects scenarios.This is the best result regarding the GMET method, which when data had a linear structure, thanks to its step involving a GLM model, had outstanding performances.
The second DGP had non-linear fixed effects f (x ij ).The large-effects scenario used The results are presented in Table A4.Again, GLMM had the best predictive performance, followed by GLMERT.GMET and GMERT had similar performances that increased when random effects were high; they got very close to GLMERT;s performance.In all the simulations presented in previous sections, we always considered the case of a binary response variable and balanced clusters.Here, to extend the simulation to a broader scenario, we consider DGPs for data with a different response variable in the exponential family, i.e., a Poisson response variable, and unbalanced clusters.We investigated 10 different scenarios involving small/large linear fixed-effects and 10 different scenarios involving small/large tree-shaped fixed-effects, both cases involving models with/without random effects.Random and fixed components for the 10 DGPs with tree-shaped fixed effects are shown in Table A5.Random components for the DGPs with linear fixed effects were those in Table A5, and linear fixed effects were f (x ij ) = 0.6x 1ij + 0.3x 2ij + 0.2x 3ij for the large fixed-effects scenario and f (x ij ) = 0.3x 1ij + 0.15x 2ij + 0.1x 3ij for the small fixed-effects scenario.The variables X 1 , . . ., X 8 were generated as uniformly distributed on the interval [0, 5].Regarding the cluster configuration, we simulated a new scenario in which the 50 clusters were unbalanced, while considering 10 different cluster sizes.In particular, the number of observations within clusters took values in {62, 64, 66, . . ., 80}, considering five clusters for each pair between 62 and 80. Within each cluster, about 30% of observations were used as the training set and 70% as the test set.For the Poisson response variable simulations, we compared GMET with the standard tree (Std), GLMM and GLMERT.We omitted BiMM because it can handle only a binary response variable and GMERT because the code did not work for a Poisson family distribution.Results in terms of PMAD are reported in Tables A6 and A7 for the tree-shaped and linear fixed effects, respectively.In particular, the proposed method can deal with response variables that belong to the following families: binomial, Gaussian, gamma, inverse-Gaussian, Poisson, quasi, quasi-binomial, quasi-Poisson (i.e., the distributions handled by GLMM).
3 Fixed-effects covariates, random effect coefficents and binary response variables were generated using the runif(), rnorm() and rbinom() functions implemented R software, respectively.Parameters of these functions are reported in Figure 1 and Table 1.The random intercept was the only random effect structure that BiMM algorithm handled.
We chose 20 as the minimum number of observations to attempt a split because it is the default number within the rpart R package; 10 as maximum depth was chosen in order not to grow "overly large" trees, but interpretable ones.The final depth of each tree was chosen by cross-validation (the complexity parameter of the tree was automatically chosen by cross-validation within the algorithm), and it was always smaller than 10.
This might have also been due to the fact that BiMM was disadvantaged, since it does not handle a random slope but only a random intercept.

Figure 1 .
Figure 1.Mixed-effects tree structure used to generate the conditional probability of success µ ij in the simulation study.The random component b i ∼ N(0, Ψ) was generated according to three different possibilities: • No random effects: Ψ = 0; • Random intercept: z ij = 1 ∀i, ∀j and Ψ = ψ 11 ; •Random intercept and slope, which add a linear random effect for the fixed-effect covariate X 1 , uncorrelated from the random effect on the intercept.That is, z ij =

• A standard binary
classification tree model (Std); • A random intercept GMET model (RI); • A random intercept and slope GMET model (RIS); • A parametric mixed-effects logistic regression model (MElog) that used the true model leaves' indicators as fixed covariates and the true random effect structure; • A parametric mixed-effects logistic regression model (GLMM) that used (x 1 , . . ., x 8 ) as fixed covariates and the true random effect structure; • The GLMERT algorithm proposed in [11] considering (x 1 , . . ., x 8 ) as fixed covariates and the true random effect structure; • The GMERT algorithm proposed in [10] considering (x 1 , . . ., x 8 ) as fixed covariates and the true random effect structure; •

Figure 2 .
Figure 2. The estimated mixed-effects tree of model (3) for the probability of graduation.Each node reports the percentage of observations belonging to the node (second line of the node) and the estimated probability that responses relative to these observations are equal to 1 (first line of the node).Regarding the splitting criteria, left branches correspond to the case in which the condition is satisfied, and right branches correspond to the complementary case.

Figure 3 .
Figure 3.Estimated random intercept for each degree programme in model (3).For each engineering programme, the blue dot and the horizontal line mark the estimate and the 95% confidence interval of the corresponding random intercept.

Figure 4 .
Figure 4. ROC curve computed on the PoliMi test set.Standing on this evidence, we chose 0.6 as the optimal value of p 0 to be used in the prediction as the threshold value for classification.

Figure A1 .
Figure A1.Fixed-effects trees estimated by the GMERT algorithm (left panel) and the BiMM algorithm (right panel) for the probability of graduation.GMERT tree leaves do not report probability of class 1 as GMET and BiMM leaves do, but they report the estimated linearised response variable (obtained using a first-order Taylor-series expansion).BiMM notation is 2 for graduate and 1 for dropout.

Figure A2 .
Figure A2.Random intercept for each degree programme, estimated by GMERT (left panel) and BiMM (right panel).For each engineering programme, the blue dot and the horizontal line mark the estimate and the 95% confidence interval of the corresponding random intercept.

Figure A3 .
Figure A3.ROC curve computed on the PoliMi test set for the GMERT model (left panel) and BiMM model (right panel), respectively.Standing on this evidence, we choose 0.6 as the optimal value of p 0 to be used in the prediction as the threshold value for classification.

Table 1 .
Data generating processes (DGP) for the simulation study.

Table 2 .
Results of the 100 simulation runs in terms of predictive probability mean absolute deviation (PMAD) and predictive misclassification rate (PMCR) for the eight models for the 10 DGPs.DGPs for which the performance gap between MElog and GMET was the largest or the smallest are marked in bold.

Table 4 .
A list and explanations of variables at the student level which were included as covariates in the GMET model.

Table 5 .
Performance indexes of the classifier based on the mixed-effects tree of model (3).

Table 6 .
Performance indexes of the classifier based on the mixed-effects tree of model (3), computed for each degree program.

Table A1 .
Performance indexes of a classifier based on the mixed-effects tree estimated by GMERT and BiMM algorithms, computed on 20 iterations, randomly splitting the observations into training and test sets.

Table A2 .
Cont.Appendix B.2. Simulations Based on Data with Linear and Non-Linear Fixed Effects

Table A3 .
Results of the 100 simulation runs in terms of predictive probability mean absolute deviation (PMAD) and predictive misclassification rate (PMCR) of the seven models for the 10 DGPs based on data with linear fixed effect.GMET outperformed all other tree-based methods.

Table A4 .
Results of the 100 simulation runs in terms of predictive probability mean absolute deviation (PMAD) and predictive misclassification rate (PMCR) of the seven models for the 10 DGPs based on data with non-linear fixed effect.

Table A7 .
Results of the 100 simulation runs in terms of predictive probability mean absolute deviation (PMAD) of the five models for the 10 DGPs with a Poisson response variable and linear fixed effects.