A Method to Automate the Prediction of Student Academic Performance from Early Stages of the Course

: The objective of this work is to present a methodology that automates the prediction of students’ academic performance at the end of the course using data recorded in the ﬁrst tasks of the academic year. Analyzing early student records is helpful in predicting their later results; which is useful, for instance, for an early intervention. With this aim, we propose a methodology based on the random Tukey depth and a non-parametric kernel. This methodology allows teachers and evaluators to deﬁne the variables that they consider most appropriate to measure those aspects related to the academic performance of students. The methodology is applied to a real case study obtaining a success rate in the predictions of over the 80%. The case study was carried out in the ﬁeld of Human-computer Interaction.The results indicate that the methodology could be of special interest to develop software systems that process the data generated by computer-supported learning systems and to warn the teacher of the need to adopt intervention mechanisms when low academic performance is predicted.


Introduction
Recent technological innovations are currently reflected in the proliferation of groupware systems aimed at facilitating communication and coordination between users, as well as providing shared workspaces where users build artifacts that solve tasks. Collaboration supported by groupware is characterized by a large number of interactions that each user performs to cooperate with other members of a common group. An analysis of these interactions can be used to improve these collective processes. Duque et al. [1] propose a methodology for carrying out this analysis based on the following three phases: (i) to capture descriptive information of the interactions, (ii) to categorize and characterize the information collected and (iii) to intervene in the improvement of the collaborative activity.
Among these improvements, it is worth highlighting those that refer to providing better mechanisms to be aware of the interactions performed by other users [2], optimizing business processes to achieve strategic goals of organizations [3], and adapting academic processes supported by collaborative learning environments [4].
Computer-Supported Collaborative Learning (CSCL) is the research field that studies how groupware can be exploited in academic environments. Thus, groupware systems support processes that enable students to build new knowledge. These processes are usually oriented towards solving academic problems using social interaction with classmates. Students discuss and interchange ideas about solutions that solve a problem proposed by the teacher. Therefore, students acquire new knowledge due to the arguments and reasonings that arise in these discussions. One of the main research challenges in the CSCL

Our Research Contribution
This work is dedicated to propose a methodology that enables teachers to identify the factors with most impact in the academic performance of the students in a course, using an interaction and collaboration analysis of the earliest activities of the subject. The idea is to obtain a flexible methodology that can be adapted to any subject and software system that supports the learning process, individually or collaboratively. Thus, the intention is that the methodology does not adhere to predefined indicators or competencies, but rather that the teacher establishes, in a flexible manner, how to measure those aspects of the learning process that he/she considers of interest. Additionally, the methodology is based on a statistical technique that can be carried out in an automated manner by software tools. Therefore, the amount of information generated by CSCL systems is not an obstacle for the execution of the prediction process, as is automated by software support. Thus, a generic and flexible solution is obtained, providing a state of the art proposal that automates the process of predicting academic performance while providing the teacher with the freedom of configuration, not sticking to specific competencies or indicators. This methodology is useful to intervene, not only in specific problem-solving activities but also in adapting the course development to the students. It is based on statistical data depth [18] and nonparametric kernel classification [19] and is here applied to the Human-computer Interaction subject of the Computer Science degree at the University of Cantabria. This paper has four additional sections. Section 2 describes the methodology for predicting the academic performance of the students from the interactions collected in early stages. Section 3 shows the results of a case study in which the methodology is applied to predict academic performance in a university course. Section 4 discusses the results of this work. The computations have been carried out using the R software.

Materials and Methods
Our main research problem is about knowing whether it is possible to predict successfully the performance of students in an academic course from the earliest activities supported by groupware, by making use of the performance of the students who took the course in previous years. Denoting by N = 6 the amount of tasks performed by the students, the research problem is divided into the following research subproblems:

1.
Is it possible to predict successfully the average grade over the N tasks based on the two first tasks performed by the students? 2.
Is it possible to predict successfully the average grade over the N tasks based on the three first tasks performed by the students? 3.
Is it possible to predict successfully the average grade over the N tasks based on the four first tasks performed by the students? 4.
Is it possible to predict successfully the average grade over the N tasks based on the five first tasks performed by the students?
To design a methodology that allows for this, the following three types of data, commonly used to characterize groupware [20], are taken as input: • Communication between classmates: These data measure the fluency in exchanging ideas on how to solve the activities (e.g.: contributions from each student, perception of the quality of the proposals of others, etc.). • Coordination to distribute tasks: These data allow us to quantify how the efforts are distributed between the members of the group (e.g.: hours of work of each member of the group, perception of the effort of the classmates, etc.). • Collaboration for building quality solutions: These data quantify whether the collective process allowed the student to improve solutions (e.g.: grades in collaborative activities, perception of how solutions are improved by the classmates, etc.).
The measurement of the students' academic performance was carried out through the analysis indicators proposed by [5]. This proposal includes a set of indicators that measure three dimensions of the students' academic work: The individual work of the students. Examples of these indicators are the number of proposal of each learner and the amount of individual interaction with the solution.

2.
The degree of collaboration. Examples of these indicators are the number of proposals commented by other learners and the degree in which the task distribution was equitable.

3.
The solutions generated. Examples are the degree to which the solution is well-formed according to the syntax rules of the programming language and the assessment of whether the solution solves the task goals.
Finally, the technological framework proposed by [21] was used to automate the calculation of a single variable that measures student performance as an average of the value of these indicators. Each indicator has the same weight in the calculation of the final variable. These data are multivariate and are processed by means of data depth to reduce their dimensionality, resulting in univariate data, which allows to easily predict the students performance. This prediction is done in terms of non-parametric supervised classification. We employ the random Tukey depth [22] as statistical data depth. As this is a more novel technique, we explore it in what follows. After that, we propose the methodology employed in practice and introduce the studied dataset.

Statistical Data Depth
According to the recent paper [23], statistical depth is a current hot research topic in statistical analysis [24][25][26][27][28] in some papers on the topic. Given a probability distribution P on R p , a statistical depth function orders the points in R p from the "center of P" to the "outer of P". Obviously, this problem includes data sets if we take P to be the empirical distribution associated to the dataset at hand. Note that in the one-dimensional case this order is trivial; being reasonable to order the points using the order induced by the function This implies that the data is ordered using the decreasing order of the difference between 50 and their percentiles, in absolute values, and the deepest points are the medians of P. Ordering multivariate data is, however, neither trivial nor pursued in a unique manner. Therefore, several multidimensional depths have been proposed [29][30][31][32]. Here we are mainly interested in the random Tukey depth function, which is a random approximation of the Tukey (or halfspace) depth [33]. The problematic of the Tukey depth is the required high computational time [34]. This issue is addressed by its random approximation. According to Zuo and Serfling [18], the Tukey depth behaves very well in comparison with the existing competitors. The random Tukey depth inherits the good theoretical properties of the Tukey depth and, in particular, that it characterizes discrete distributions [35], which comes in handy. for the study performed in this paper.
For x ∈ R p , the random Tukey depth of x with respect to P, D R (x, P), is the minimal probability which can be attained over a set of randomly closed halfspaces containing x; i.e., D R (x, P) is the minimum of the one-dimensional depths (see (1)) of a finite number of randomly chosen one-dimensional projections of x, where those depths are calculated with respect to the corresponding marginal of P. In this paper we make use of 50 random projections. Let us, then, concentrate further on explaining what the idea of deepness inside the definition of random Tukey depth is. Given n points, let us denote one of them by x. Then, we want to compute the random Tukey depth of x with respect to the set of n points. For that, we compute the number of points in the set that are contained in each of the randomly chosen closed halfspaces that has x in its border. Then, we record any of those halfspaces that contain the fewest points from the set and the depth of x is this number of points divided by n. In the left-hand side plot of Figure 1, in R 2 , n is equal to ten and the random Tukey depth of x is given, among others, by the randomly obtained closed halfspace painted in pastel blue. As there are four points inside this halfspace, the random Tukey depth of x is 0.4. Note that x is the deepest point in the set. From the right-hand side plot of Figure 1 we can observe that, taking sufficient randomly chosen halfspaces, the random Tukey depth of point y is 0.3 because among all the closed halfspaces that have y on their border, the ones that contain fewer points from the set do contain three points. Alternatively, taking into account that (1) coincides with the definition of random Tukey depth in R, to compute the random Tukey depth of a point x ∈ R p with respect to a set A ⊂ R p of size n we can do the following. For each randomly selected vector v in the unit sphere of R p , we compute the one-dimensional depth, (1), of the projection of x on v with respect to the projection of A on v. Then, the minimum of the one-dimensional depths over the drawn v s is the random Tukey depth of x. Note that when A is finite it suffices to take an amount of vectors, v, equal to the number of combinations of (p − 1) elements taking (n − 1) at a time without repetition. n is equal to ten and the random Tukey depth of x (left-hand side) and y (right-hand side) are, respectively, given, among others, by the randomly obtained closed halfspaces painted in pastel blue.

Methodology in Practice
To evaluate the performance of the students, we make use of their grades. There are a variety of grade systems. For instance: The methodology we present here is valid for any grading system. The reason is that any system can be translated into a success percentage. That is, any grade can be transformed into a number g ∈ [0, 100] with g% the percentage of right answers, for instance. Thus, to particularize it, we focus on the Spanish grading system.
These intervals have been set by taking into account that a student passes with a grade larger or equal than 5. Thus, we use the interval [0,4) for those grades where the student clearly fails. The largest possible grade posses also an interval, {10}, since it is a distinction.
The idea is to first construct a model making use of the training sample. To construct it, we simply employ a supervised classification procedure where first the random Tukey depth is used to reduce the dimensionality, and then a normal kernel classifier is applied to perform the classification. In what follows we explain in what consists this classifier; for which we refer to Ferraty and Vieu [36] and Ferraty and Vieu [37], Chapter 8 for more technical details, consistency, and rate of convergence of posterior probabilities. For that, we . . , m} and j ∈ {1, . . . , n}, are independent and identically distributed (i.i.d.) as (X, Y); where X takes values in R d and Y takes values in G. The classifier is based on a general Bayes classification rule. For a general pair (x, g), where g ∈ G and x ∈ R d , it is defined the posterior probability Note that P denotes the underlying probability. Then, x ∈ R d is classified to the class g ∈ G yielding maximum posterior probability. In particular, for classifying points in the training sample we take (x, g) = (r (d) i , IR i ) for some i ∈ {1, . . . , m}, while for classifying points in the test sample (x, g) = (e (d) j , IE j ) for some j ∈ {1, . . . , n}. For this purpose, we need to estimate p g (x). As the training sample (r (d) i , IR i ), i ∈ {1, . . . , m}, consists of i.i.d. copies of (X, Y), we use it to estimate the underlying probability distribution. Specifically, we replace p g (x) by its Nadaraya-Watson estimator [38,39], which is given bŷ where h > 0, · is the Euclidean norm on R d , and K is a probability kernel satisfying K(0) > 0, K(u) = 0 for u < 0, and it is non-increasing in u, for u is positive. Notice that the sum at the numerator is only over those i such that IR i = g yielding ∑ g∈Gp g (x) = 1.
Additionally, the closer the point x is to r i , the closer the quantity is to 0, the maximal point of the kernel K; thus, yielding a higher probability. Specifically, we choose K to be 2 times the standard normal density if u is non-negative and 0 otherwise. The parameter h is chosen so that the classification error in the training sample is minimized.

The Dataset
The proposed method was applied in the Human-computer Interaction (HCI) subject taken by students in the third year of the Computer Science degree at the University of Cantabria, in Spain. The HCI discipline deals with studying how people interact with computers. Some of the main objectives pursued by this discipline are the definitions of methodologies to develop more efficient and intuitive user interfaces, the creation of methods that allow evaluating and comparing the characteristics of user interfaces and the design of models that allow the interaction between people and computers to be represented. HCI studies the relationship of people with computers and this makes it necessary to apply knowledge from fields as varied as Psychology, Computer Science, Telecommunications and Sociology. Therefore, HCI has a multidisciplinary nature that bases it on many classical fields of knowledge.
This subject follows a Project-Based Learning (PBL) approach through tasks in which students work collaboratively to design and build different types of user interfaces (for mobile phones, web applications, or desktop tools). The methodology is applied to predict academic performance using data from a few early activities. The main goal of this experimentation is to identify elements of the learning process (tasks, group composition, etc.) that should be intervened to have a real impact in the academic performance of the students in a course.
Data collected quantify the activity of 205 students: As part of their academic course, these students performed 6 tasks that required designing user interfaces. These tasks are the following:

1.
Prototype a mockup of a user interface for smartphones.

2.
Build user interfaces using the Android platform.

3.
Design and build user interfaces for desktop computers using a WIMP (windows icons menus and pointers) style.

4.
Design and build user interfaces for desktop computers using a WYSIWYG (What You See Is What You Get) style.

5.
Design and build the user interfaces of a website.

6.
Perform a usability test process.
Software support used by the students was a videoconferencing tool with a shared whiteboard and chat, a shared folder and Axure, a UX tool to prototype interfaces. These user interfaces were later built using Android technologies and Java and HTML languages. Students collaborated in groups, resulting in a total of 79 groups: The dataset was used to experiment with the proposed methodology as shown in Figure 2. The students collaborated in groups to solve the proposed tasks. This collaboration was made with the support of software tools that recorded their communications, how the workload was distributed, and the solutions to the tasks. The teacher used all this information to grade each assignment. Finally, the methodology was applied to verify if a small number of tasks allowed to predict the final grade of the student. To illustrate the dataset, we have plotted in Figure 3 the grades of the groups over the six tasks of the different academic periods. These grades are in the range 0-10, 0 being the lower possible grade and 10 the highest one. These grades are the result of quantifying the following three aspects: (i) the quality of the user interfaces, (ii) the extent to which group members distribute the workload equitably, and (iii) the contributions and proposals that arise to establish a real collaboration.
The left plot corresponds to the grades in Task 1 against those in Task 2, the central plot to those in Task 3 against those in Task 4 and the right plot to those in Task 5 against those in Task 6. The grades of the academic period 2017/18 are represented in black, those of 2018/19 in red, those of 2019/20 in green, and those of 2020/21 in blue. In each of these academic periods we have labeled each group by a number. Thus, we can observe, for instance, that: • Group 3 of the academic period 2017/18 had grades in the interval [6,8) in Tasks 1 and 2 that improved to the range [8,10) for Tasks 3, 4 and 5 and decreased to the range [4,6) for Task 6. • Group 5 of the academic period 2018/19 had grades in the range [8,10) in Task 1 that worsened to the range [2,4) for Tasks 2 and 3 and slighted improved to the interval [4,6) for Tasks 4 an 5 and again improved to the interval [6,8) for Task 6. • Group 19 of the academic period 2019/20 had grades in the range [6,8) in Tasks 1 and 2 that improved to the range [8,10) for Tasks 3, decreased to the range [4,6) for Tasks 4 and 5 and then highly increased to a 10 for Task 6.
This leads us to realize that the patterns among the different groups is different, which makes the analysis more difficult.

Results
The objective of this section is to provide the results for the research problem proposed in Section 2, which consist in predicting the average grade of the students taking an academic course based on their early performance in the course and the performance of the students who took the course in previous academic years. In particular, we use the grades of the courses 2017/18, 2018/19, 2019/20 as training sample; obtaining a training sample of size m = 50. We use as test sample the early grades recorded during the course 2020/21 to predict the average grade over the six tasks (in groups); having a test sample of size n = 29. The research problem has four subproblems where the first one regards as early grades the first two, the second research problem the first three grades, the third the first four grades, and the fourth problem the first five grades. Making use of the methodology expressed in Section 2, we perform here a supervised classification to predict the average grade over the six tasks of the groups in the test sample. To do this, we take d, the amount of tasks done by the student, in the range from 2 to 5.
At the top of Table 1 we observe the results of applying the model to the test data with d = 2 (research problem 1). That is, we have constructed the model making use of the training data (r i,1 , r i,2 ), which are the grades for the first two Tasks, and the label IR i , which is the average grade of the six Tasks, for i ∈ {1, . . . , 50}. Then, we have reported in the top panel of Table 1 the summary of the range valuesÎ E 1 , . . . ,Î E 29 , that is, the estimated values of IE 1 , . . . , IE 29 . We report the values by providing a wider ranger, of a ± grade, than the one used to summarize the given values. Thus, we obtain that the three test groups of students whose average grade is in the interval [0,4) are correctly classified in the interval [0,4]. Analogously, the four groups of students with average grade in the interval [7,8) are appropriately classified in the interval [6,8].
There are 13 groups of students in the average grade range [8,9), 3 of them are classified in the [6,8] interval and the other 10 in the [7,9]. Although both classifications should be considered correct, to be on the safe side, we have only considered successful for a later analysis (Figure 4) those classified in [7,9]. Similarly, there are seven groups of students with average grade in [9,10) which are correctly classified in [8,10] and another one which is not so clear as it is classified in [7,9]. Thus, for the later analysis we only consider as correct the seven classified in [8,10]. Furthermore, there is one case in the analysis that is clearly wrongly classified. That is the one of the group of students with average grade in [9,10) whose estimation is in [6,8]. Table 1. Four confusion matrices between the real average grade intervals (columns) and the estimated average grade intervals (rows) for the n = 29 test data (academic course 2020/21) on the model constructed using the m = 50 training data (academic courses 2017/18, 2018/19, 2019/20). 3 test data belong to the interval [0,4), 4 to [7,8), 13 to [8,9,) and 9 to [9,10) (intervals reported in (6)). The analysis is based on: Tasks 1 and 2 (top matrix), Tasks 1, 2 and 3 (second matrix from the top), Tasks 1, 2, 3, and 4 (third matrix from the top) and Tasks 1, 2, 3, 4, and 5 (bottom matrix). The average grades make use of the six Tasks. The groups of students whose average grade is clearly correctly classified are in blue. The omitted values correspond to zeros. Tasks [8,9) 13 [9,10) 9

TEST SAMPLE
When making use of the grades of Tasks 1, 2, and 3 (research problem 2) to predict the average grade over the six tasks, the obtained results when classifying test sample are, as expected, slightly better than those obtained by just using Tasks 1 and 2 (research problem 1) and worse than when also using Task 4 (research problem 3). They are reported in the second block of Table 1. In particular, we can observe only one clear misclassification, which is that of a group with an average grade in the interval [0,4) whose estimated average grade belongs to the interval [6,8]. The results obtained when making use of just Tasks 1, 2, 3, and 4 (research problem 3) are the same than those obtained when also adding Task 5 (research problem 4). In particular, the absolute number of misclassifications increases to two in both cases. As reported in Table 1, they correspond to groups with average grade in the interval [0,4) that is estimated as in the interval [6,8].
We have reported above the absolute misclassifications when predicting the test data making use of a model that is based on just Tasks 1 and 2 (research problem 1) to a model based on Tasks 1, 2, 3, 4, and 5 (research problem 4). All these misclassifications are summarized in Table 2 where we report the relative success rate of the procedure under the different studied scenarios. There, we can observe that, when applying to the test data the model that only makes use of Tasks 1 and 2 (research problem 1), we are able to predict the interval to which the average grade over the six Tasks belongs with a success rate of the 82.76%. This rate increases to the 86.21% when also making use of Task 3 (research problem 2) and stabilizes to the 93.10% success rate when making use of Tasks 1, 2, 3, and 4 (research problem 3). A display of these success rates is in Figure 4 where a rapid increase and stabilization of the success rates is observed. It is worth saying that we have been conservative in computing these success rates, considering as successful the entries in blue in Table 1 although, as explained above, there are other entries that could also be considered successful. Table 2. Success rates for the supervised classification of average grade intervals over 6 tasks for the test data on the model constructed using the training data. Test data refer to the n = 29 groups in the academic course 2020/21 and training data to the m = 50 groups along the academic courses 2017/18, 2018/19, and 2019/20. The analysis is based on: Tasks 1 and 2 (left column), Tasks 1, 2, and 3 (second left column), Tasks 1, 2, 3, and 4 (third left column) and Tasks 1, 2, 3, 4, and 5 (right column). For the success rate, it is used as correctly classified only the groups of students whose average grade is displayed in blue in Table 1 Table 2. The OX-axis shows the tasks used in the model to predict the interval of the average grade; it goes from Tasks 1 and 2 (research problem 1) in the left corner of the axis to Tasks 1, 2, 3, 4, and 5 (research problem 4) in the right corner.
The obtained results are extremely good as after the student has completed just the two first tasks, we can predict their average final grade over the six tasks with a success rate of over the 80%; when completing the first three tasks with a success rate over the 85% and when completing the first four tasks with a success rate of over the 90%.

Discussion
The algorithm used is powerful in that it makes use of the random Tukey depth. This is due to two main reasons: 1.
The random Tukey depth is computationally effective in reducing the dimension to one even if the original data dimension is high, as it happens with high dimensional or functional data [40].

2.
The random Tukey depth behaves adequately [35] as it generally inherits the good properties of the Tukey depth, which is the most well-known in the literature but for it expensive computational time.
Furthermore, the kernel classifier performed on the resulting one-dimensional data is a well-known one but could be substituted by any other one-dimensional classifier. The only requirement being that the process can be automated as it occurs in this case.
Other works have carried out data-driven analysis of student academic activity with different objectives. Thus, ref [41] stores information on how students solve collaborative activities using CSCL systems and analyzed them to propose 17 strategies to optimize student performance. In other cases, the studies have focused on predicting the optimal number of students who should collaborate on the tasks [42]. Our study has a broader temporal scope. It is not about analyzing specific activities but rather predicting performance in an academic year.
The obtained results show a high success rate in predicting the average grade by just using the first two tasks performed by the students.
Thus, something that can be considered is the possibility of reducing the amount of tasks required of the students. Additionally, this would also allow an early intervention to improve the performance of the groups whose predicted grade is lower.
We can deduce that academic collaborative tasks imply greater richness and complexity due to the social interactions that take place, this work opens the door to consider the first tasks that students solve as predictive of academic performance in the rest of the course. The case study of this work illustrates how the method proposed can achieve this goal. This approach complements other works in the CSCL field that analyzes collaboration and interaction without predicting future academic performance [1,13,14]. However, our work has not delved into mechanisms that detail the causes (lack of communication, problems with the groupware system, etc.) that lead to an academic performance problem or in proposing automatic intervention strategies (adapting the groupware system, changing the composition of working groups, etc.).

Conclusions
This work has presented a proposal to automatically predict the academic performance of students using only the data recorded in the first tasks of the academic year. The interactions that students carry out with software tools to solve academic activities allow us to have datasets with which to try to carry out this prediction. In many active learning methodologies these activities are carried out collaboratively. For this reason, the work focused on experimenting with a real case in which the students collaborate in solving the tasks.
The proposal is based on a statistical depth based supervised classification technique, which first performs the random Tukey depth to lower to one the data dimension and then applies a kernel classifier. This means that the prediction can be carried out in an automated way, using support software that processes a significant amount of data. The experimentation carried out during four academic years in a university subject shows promising results, as just making use of the first two tasks that the students perform we obtain over an 80% success rate in predicting their final grade. This success rate increases to over the 90% success rate if the first four tasks, out of six, are known in predicting the final grade.
We propose that the results of the predictive mechanisms serve not only to inform what the academic performance of students would be at the end of the academic year but also to intervene in the automated development of activities. Our future work will address the analysis of the causes of task failures and to design intervention mechanisms in CSCL systems.  University of Cantabria through the teaching innovation project "Implantación de la técnica focus group para diseñar interfaces de usuario en la asignatura Interacción Persona-Computador" and "Utilización de las TIC para monitorizar y gestionar actividades colaborativas orientadas a resolver tareas de programación de algoritmos en el Grado en Ingeniería Informática." Institutional Review Board Statement: Not applicable.
Informed Consent Statement: This study does not involve personal information. Grades have been treated confidentially according to the regulations of the University of Cantabria.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: