Assignments as Inﬂuential Factor to Improve the Prediction of Student Performance in Online Courses

: Studies on the prediction of student success in distance learning have explored mainly demographics factors and student interactions with the virtual learning environments. However, it is remarkable that a very limited number of studies use information about the assignments submitted by students as inﬂuential factor to predict their academic achievement. This paper aims to explore the real importance of assignment information for solving students’ performance prediction in distance learning and evaluate the beneﬁcial effect of including this information. We investigate and compare this factor and its potential from two information representation approaches: the traditional representation based on single instances and a more ﬂexible representation based on Multiple Instance Learning (MIL), focus on handle weakly labeled data. A comparative study is carried out using the Open University Learning Analytics dataset, one of the most important public datasets in education provided by one of the greatest online universities of United Kingdom. The study includes a wide set of different types of machine learning algorithms addressed from the two data representation commented, showing that algorithms using only information about assignments with a representation based on MIL can outperform more than 20% the accuracy with respect to a representation based on single instance learning. Thus, it is concluded that applying an appropriate representation that eliminates the sparseness of data allows to show the relevance of a factor, such as the assignments submitted, not widely used to date to predict students’ academic performance. Moreover, a comparison with previous works on the same dataset and problem shows that predictive models based on MIL using only assignments information obtain competitive results compared to previous studies that include other factors to predict students performance.


Introduction
The popularization of Internet access and the advances in the exploration of digital resources have led to a growing interest in distance education. This education has as main advantages the accessibility (students can follow a course from anywhere in the world) and the flexibility (students can fit their learning around their daily routine) [1]. The current distance studies could not be understood without a digital platform that provides fundamental features like the publication of the contents of the course, a channel to maintain professor-students communication or the tools to keep a control of the student evolution. These systems, called Virtual Learning Environments (VLEs), include course content delivery instruments, quiz modules and assignment submission components, among other functionalities [2]. In addition, VLEs are very useful to control the student involvement in the course, since all his/her activity is recorded in log files that can be analyzed [3].
Even though the history of distance courses is too recent, they have experimented a high expansion, with Massive Open Online Course (MOOCs) as the most popular example. This new format is opening a widespread investigation, due to its differences with respect to traditional face-to-face higher education. Among these features, an open environment regardless of the location of students stands out, as well as a higher number of enrollments [4]. This also implies an important growth in the number of dropouts and academic failures [5]. In this context, the prediction of student success according to their work collected by VLE's system has become an essential task to be able to discover the main features that describe to the students that pass satisfactorily a course. Thus, some works analyze the impact of the chosen learning platform [6,7], others study the effectiveness of the student-instructor interaction in this engagement [8], while other works have focused on the dynamic adaptation of the e-learning system to the current level of knowledge of each student, based on the interaction with the exercises of the course [9] or prior knowledge and other social factors of students [10].
In this context, this work explores a little used factor to predict student success in distance learning analyzing how this information should be treated to extract all its potential. Specifically, the factor proposed is assignment information, understood as the tasks submitted by the students throughout the online course. The use of information about assignments is not extended as essential factor to determine the students' performance [1,4,11]. Preliminary, it could be estimated that delivered assignments may help predict the student performance more effectively than the number of accesses or clicks on the course resources, that it is the most used factor. Probably, the lower use of assignments information may be due to a combination of the little number of users who complete assignments, in relation to the total enrollments, and the substantial variation across courses in the assignments scheduled. In this context, our work proposes to use assignments information from a flexible data representation perspective based on Multiple Instance Learning (MIL). This learning framework introduced by Dietterich et al. [12] is considered an extension of supervised traditional learning focused on weakly labeled data. MIL could make a better use of the information provided by submitted assignments to predict the impact of students' achievement.
In the field of educational data mining there are not many public datasets due to the sensitivity of the data being worked with. This makes it difficult to compare different proposals since the data used by each study are usually so different to be compared. In this context, it is of great relevance the Open University Learning Analytics Dataset (OULAD) [13] availability, one of the few existing public datasets in the field. OULAD collects a large number of student data from an important distance university during two academic years, including demographic data, student interactions with the VLE and assignments submitted. Many works have used it to predict students' performance but mainly centered in clicks activity, as it is discussed in related work section. This work uses OULAD to obtain assignments information, adapt the representation to the MIL paradigm and store it in ARFF format to work with a popular framework for data mining called Weka. Thus, following the open access philosophy, these files have been made public online for the scientific community that wants to continue the line of this study.
Summarizing, this work carries out an exhaustive study to determine the relevance that information about assignments has to predict the student's performance. Specifically, this work addresses the following research questions: 1.
How should the information about assignments be represented? Previous works in distance learning use a classical representation based on single instances. However, each course has different type and number of assignments, and these are submitted by few students, which suggests a high sparsity in the data. Representation should be adapted to this environment so that machine learning algorithms can perform well. We propose to use an optimized representation based on MIL able to adapt to the specific information available for each student.

2.
Are machine learning algorithms affected by the way that assignments are represented? It is analyzed a wide set of machine learning algorithms using two different representations of assignments information: representation based on single instances (it is used in previous studies) and based on MIL (the representation that it has been proposed in the previous step). A significant performance difference between the same algorithms using both representations shows the relevance of an appropriate representation so that assignments can be considered a very influential factor for predicting students' performance. 3.
Is information about assignments a relevant feature to predict the student performance? The accuracy in predicting student performance using MIL is compared with previous studies that use different factors such as demographic features and interactions on VLEs to address the same problem. Algorithms using only information about submitted assignments reach competitive results achieving better accuracy in relation to the previous works that predict academic performance using other factors provided in the same dataset. This justifies the relevance of assignments to predict students' performance, if it is represented appropriately.
This paper is organized in following sections. Section 2 presents a briefly introduction to MIL and other concepts of background for this study. In Section 3, a review of related work for solving students' success prediction tasks in distance education is presented. Moreover, this section briefly introduces MIL and its application to the educational environment. Section 4 addresses an in-depth analysis of problem representation and the available information. In Section 5, it is presented the experimentation carried out and the obtained results. Finally, Section 6 draws some conclusions and proposes some ideas for future work.

Background
This section presents a basic background of concepts to understand the rest of the work. On the one hand, a brief introduction to MIL is carried out. On the other hand, it is presented the description of the algorithms that will be used in the comparative study from the traditional and MIL perspectives.

Multiple Instance Learning
Multiple Instance Learning (MIL) was introduced by Dietterich et al. [12] to represent complicated objects [14]. Its inherent capacity to represent ambiguous information allows an efficient representation of different types of objects, such as alternative representations or different views of the same object [15], compound objects formed by several parts [16] or evolving objects composed of samples taken at different time intervals [17].
The main characteristic of MIL is its input space representation: patterns are represented as bags which can contain a variable number of instances. In a supervised learning environment based on multi-instance, each bag or pattern has a label, however there is no information about the instance label. Thus, the hypothesis that relates each instance with each bag depends on the type of representation used. One of the most used is known as standard MI assumption, defined by Dietterich et al. [12]. This hypothesis determines that a bag represents a specific concept whether at least one of its instances represents the desired concept to learn, and the bag does not represent the concept whether none of its instances represent it. However, with the application of MIL to more domains, different assumptions have been proposed [18]. Formally, in a traditional machine learning setting, an object M can be represented by a feature vector V(M) associated with a label f (M), (V(M), f (M)). However, in multiple instance learning setting, each object M may have a variable number n of instances m 1 , m 2 , . . . , m n , and each instance has an associated features vector V(m i ), thus the complete training object M is represented as ({V(m 1 ), V(m 2 ), . . . , V(m n )}) associated with a label f (M), ({V(m 1 ), V(m 2 ), . . . , V(m n )}, f (M)).

Supervised Data Mining Techniques for Predicting Students' Performance
Predicting students' performance has been addressed from a wide range of popular methods within the field of supervised data mining [2]. There is a special attention to those models that are explainable, since they allow to identify the most determining factors in the result, i.e., student demographic information, VLE activity, etc. Thus, most popular methods for predicting students' performance are those based on decision trees [11]. These methods offer one of the most intuitive solutions: nodes in the decision tree involve specific predictive factors, and leaf nodes give a classification that applies to all students that reach that leaf. With similar points, rule-based methods offer a solution composed by an antecedent that presents several logical expressions and a consequent that gives the outcome for students covered by the rule. Bayesian methods and logistic regression are among most popular methods too, since they offer predictions based on likelihood of classes where it is possible to determinate the influence of each factor in the result. Support vector machines (SVM) are relatively popular as well, with an approach based on finding the maximum-margin hyperplane in the factor space that gives the separation between types of students. Artificial Neural Networks (ANN) are non-linear models composed of units organized in layers that transmit and transform an input, i.e., the student information, to the end of the network to provide a prediction. These models are less popular [11] because of their lack of explainability, although, on the other side, they tend to be more accurate in their predictions.

Related Work
Although VLEs have been used in traditional education for several years, their application to distance education has important particularities. Thus, distance education usually has higher number of enrolled students, more diverse demographic characteristics and, in general, a lower motivation level. These characteristics cause more academic failures and higher dropout rates. In this context, the task of predicting student success in distance education is particularly challenging [3,4].
This section presents a review of previous works and more specifically works that use OULAD. As it has been commented in introduction section, this dataset has had a notable relevance in EDM. In addition, it is addressed a review of the application of MIL framework in education.

Predicting Student Success in Distance Higher Education
Predicting student' performance in higher education is a problem that has attracted great attention [1,11,19]. Due to the rise of the VLE, online activities and the increase in log data generated by these environments that can be processed with machine learning techniques in order to detect at-risk students, measure the effectiveness of the e-learning system or give an idea of the success of the academic institution. In this context, several student background factors, previous academic record, or activity during the course can be selected to measure his/her engagement, and therefore, the chances of success in the course. According to [19], the most influential factors are the prior academic achievement (44%) and the demographic information (25%). In [11], these factors are also the most common, but they also include e-learning activities (25%) in the top-3 ranking. The e-learning category includes different statistics like number of logins, assignments or quizzes done. However, the number of clicks on the course resources is the most used factor by far in this category.
Recent proposals for predicting students' performance in distance higher education include several works such as [20]. This work combines demographic, assignments and clicks information with information about interaction with video of the recorded classes. The case of [21] also explores three-based methods combining assignments submission and clicks information. In [22], clicks information is explored, but from a frequency perspective rather than number of clicks during the course. In [23], it is explored a novel proposal based on graphical visualization of the logit leaf model that combines demographic, number of clicks and submitted assignments. The case of weakly labeled data in the student record is also addressed from an active learning perspective [24], from a semi-supervised learning approach [25] and from number of clicks. However, all these proposals work with sensible data that can not been published, and each one considers distinct demographic or assignments attributes, so it is difficult to compare proposals and results.
This study uses for evaluating results the OULA dataset or OULAD [13]. It is one of the few existing open datasets about learning analytic and educational data mining. Moreover, it is collected from a real case of study, specifically at the Open University (https://www.open.ac.uk/, accessed on 23 October 2021), the largest institution of distance education in United Kingdown and one of the most important worldwide, with around 170,000 students per year and a wide range of degrees, as well as free courses under its platform Open Learn (https://www.open.edu/openlearn/, accessed on 23 October 2021). The dataset contains information of 32,593 students and 7 courses in their different semesters. It is focused on students, aggregating their demographic data, information about their course enrollments, number of daily accesses done to course resources (clickstream), and records about assignments submitted to the VLE (referred as their assessments) during a course. Due to the characteristics of this dataset and the large amount of information, some literature works have referenced it as MOOC data [26][27][28][29]. Specific dataset details are addressed in-depth in Section 4. An open dataset with these characteristics implies the possibility of having a common framework where different authors can compare their studies with previous works. In this context, although it is a recent dataset (published in 2017), it has reached a high relevance in the field, counting with more than 20 works to date that use it to study the problem of predicting the academic performance. Table 1 summarizes the main characteristics of these previous proposals, taking into account the purpose of the study, the criteria used and the algorithm proposed (or the main one among proposals).
Considering the factors used to carry out the prediction tasks in OULAD, it can be observed slight differences in the most used factors with respect to the previous general work. Thus, a 39% of studies use the number of accesses to resources (clickstreams) [26,[29][30][31][32][33][34][35], while a 25% of studies combine this information with demographic data from the students [27,28,32,[36][37][38][39]. Focusing solely on assignment information, only one study [40] uses exclusively this factor. Concretely, it considers assignments as an important factor to predict the student's performance. However, this study has important limitations, like the fact of analyzing only two courses of a total of seven available. The other studies that use assignment information, a 30% of works, use this factor together with the rest of sources sources [3,39,[41][42][43]. Thus, the real relevance of this factor in the final prediction cannot be analyzed.
This work explores how only information on assignments improves the prediction of academic achievement. The purpose is to show that less data can be used more efficiently to obtain competitive results. For this aim, a study is carried out including all OULAD courses, as well as all specific characteristics of assignments in each course with different representations. The study is conducted over a set of different machine learning algorithms belonging to different paradigms in order to provide a comparison as representative as possible.

MIL in Educational Data Mining
MIL has been used in a wide range of application domains, including classification, regression, ranking and clustering tasks [14]. This framework has experienced a growing interest to represent problems because of its characteristics in data representation. MIL can naturally adapt to complex problems and it allows to work with weakly labeled data. Prediction of the student's performance from VLE logs has also been addressed with this learning approach [16]. This previous research is set in a different context of traditional e-learning courses. Thus, it is taught in combination with face-to-face classes and it uses different factors in the study. However, it can be considered as an example of MIL efficiency to represent educational data mining problems. From another perspective in [48], it is shown a tool to discover relevant e-activities for learners using MIL.

Materials and Methods
In this section, information on assignments in Open University Learning Analytics Dataset (OULAD) [13] is analyzed. It is an anonymized, public and open dataset supported by Open University of United Kingdom. It maintains information about courses, students and their interactions with VLE.
In this section, the original structure of OULAD is analyzed first; secondly, the problem of predicting student's performance from the activity associated to his/her submitted assignments is discussed and, finally, it is addressed the representation based on MIL and the main differences with respect to traditional representation.

Information Analysis of OULAD
The original source of OULAD has been published by Kuzilek et al. at (https:// analyse.kmi.open.ac.uk/open_dataset, accessed on 23 October 2021). It contains 7 distance learning courses (called modules), all of them taught at the Open University in several semesters during the years 2013 and 2014 (called presentations). The courses consider different domains and difficulties. Thus, courses AAA, BBB and GGG belong to Social Sciences domain and courses CCC, DDD, EEE and FFF to Science, Technology, Engineering and Mathematics (STEM). Concerning to difficulty levels, AAA is a level-3 course, GGG is a preparatory course, and the rest are level-1 courses [36]. Each course has several resources on the VLE used to present the contents of the course, one or more assignments that mark the milestones of the course and a final exam. In total, it contains records of 32,593 students. There is demographic information, such as their gender, region or age band. There is information related to their enrollment in the courses, such us number of previous attempts or the final mark obtained in the course. Also, there is information related to their activity during the courses. This information includes interactions with the resources in the VLE, number of clicks, and the submitted assignments during the course.
An overview of the course structure can be seen in Figure 1. Students can register in several courses during a semester. Moreover, courses are repeated in different years (they have different editions). The content of a course is usually available in VLE a couple of weeks before the official course start. The course assignments are defined as their assessments whose purpose is to have a control of the student's evolution. During the presentation of the course, students' knowledge is evaluated by means of assignments which define milestones. Two types of assignments are considered: Tutor Marked Assessment (TMA) and Computer Marked Assessment (CMA). If the student decides to submit an assignment, the VLE collects information about the date of submission and the obtained mark. By contrast, if a student doesn't submit the assignment, no record is stored. At the end of a presentation, a student enrolled in a certain course takes a final exam and achieves a final mark. This mark can take 3 different values: pass, distinction or fail. Additionally, if the student doesn't carry out this exam, it will be considered that he/she doesn't finish the course and the final mark is set as withdrawn.  Table 2 shows a summary of available information for each course considering the number of times that the course has been offered, the average number of students per course and its standard deviation (considering the different times that it has been offered), the number of assignments by course, the average number of assignments submitted by students, and the percentage of students that fail or drop out the course relative to the total enrolled students. Assignments are divided between TMA and CMA types, as it has been commented previously. As we can see, there are significant differences between courses: they have been offered at different times during the considered academic years and the number of enrolled students also differs. There are also differences in terms of number and type of assignments, as well as the average number of submissions per student. Figure 2 shows the difference between courses. Figure 2a shows the average number of enrollments in a course versus the average pass rate. Figure 2b shows the number of assignments available by course versus the average submitted assignments per student. It can be observed that the number of assignments is different in each course and there are courses where the average percentage of submitted assignments is approximately 90% (as AAA course) while in other ones, as DDD course, this rate only reaches a 40% of submitted assignments. However, there is a tendency that seems to indicate that the more assignments are submitted, the more students pass the course. Thus, AAA course has a 71% of students that pass while DDD course has only a 42%.

Problem Representation Based on Assignment Information
In this study, prediction of student's performance to determine whether he/she will pass a course is focused on the information of submitted assignments. Table 3 shows the specific information provided by OULAD for assignments submitted by a student in each course: assignment_type is a categorical value specifying the two types of assignments considered (TMA and CMA). Each assignment has a weight (assignment_weight) and a score (assignment_score). Normally, the weighted sum of all assignments in each course is 100. The score is a numerical value in the range from 0 to 100. Assessement_advanced considers the number of days between the submission of the assignment by the student and its deadline. This is not a direct attribute in OULA dataset, but it can be calculated as the difference between the deadline date and the day on which student submitted it. Finally, it is considered assignment_banked that indicates if the assignment has been transferred from a previous enrollment in that course. This study presents the traditional representation based on single instance and proposes a representation based on multiple instances learning to solve the problems of traditional representation. Since OULAD is presented in form of several CSV tables, it has been converted to ARFF format [49] using both mentioned representations. This process has implied the load of the dataset in a MySQL database and a slight restructuring of the data schema to ensure that it is maintained the Codd's normal form and data are not duplicated. Finally, from the database and through automated scripts, the different ARFF files with relationships considered have been generated. These datasets have been published in open access mode in the web repository associated with this paper (http://www.uco.es/kdis/mildistanceeducation/, accessed on 23 October 2021). Thus, a reproducible experimentation is facilitated to allow new advances in the area. The following sections define the representations proposed to solve the problem.

Representation Based on Single Instance Learning
As it has been commented, each student can submit a different number of assignments. Actually, assignments are not necessary to pass the course, although they are recommended to get a better understanding of the course. This information should be kept in an appropriate way so that it can influence in the prediction of student's academic achievement. That is, with the aim of predicting whether a student passes (with or without distinction) or does not pass (aggregating the failure and the dropout) a course.
The traditional supervised learning representation, used in previous studies with this dataset, is characterized by representing each student enrolled in a course during a semester as a pattern or vector of characteristics. Each pattern keeps the student's activity by means of a fixed number of attributes. According to the information specified in the Table 3, each assignment is represented by five attributes. Thus, each student is an pattern composed of 5 × X attributes, being X the total of programmed assignments during that course and the final mark (student passes or fails the course). Moreover, the student's participation in a course is specified by means of his/her identification, the course identification and the presentation identification that represents the edition of the course.
An illustrative example of problem representation can be seen in Figure 3. Here, we can see two students who belong to course AAA. Course AAA has 5 assignments. Therefore, it is necessary 25 attributes (5 × 5 = 25) to represent information about student's assignments. One student submitted only two of the five assignments while the other one submitted all of them. As we can see, in traditional supervised learning, both students have the same number of attributes. Thus, if a student doesn't submit an assignment, the attributes related to this submission will have an empty value, but they have to be presented. This representation forces you to fill all attributes related to the non-submitted assignments, so there is a potential increase of the computational and storage resources needed for courses with a representative number of assignments. The equivalent representation of the commented example in ARFF format to be processed by machine learning algorithms using Weka framework can be seen in Table 4. In this case, it is necessary to define the 25 attributes in the header of the file. The instances are defined one per line following the @data label. Each instance represents one student and each attribute is separated by comma in the same order that were defined in the header. Thus, although a student does not submit an assignment, the information related to that assignment has to be filled in the instance. Other problem in this approach is that representation depends on the course. Thus, whether the previous example of AAA course is compared with an example of DDD course shown in Table 5, as DDD course has 13 assignments instead of 5, the dataset would have 65 attributes instead of 25 attributes. As we can see, the representation becomes more inefficient in cases of students that submit a low number of assignments. Moreover, there is a limitation of working with different courses because the representation is not uniform between courses. It depends of the assignments by course. MIL allows a flexible representation that adapts itself to the specific information available for each student according to his/her work in the course. In MIL representation, each pattern is called bag and represents a student enrolled in a course during one semester. Each bag represents the student's activity. Thus, the bag is composed of a variable number of instances being each instance an assignment submitted by the student. Therefore, each bag has as many instances as assignments submitted by the student during a presentation of a course and one class attribute that can take, similarly to traditional representation, two values: the student passes (with distinction or without it) or does not pass the course (aggregating the failure and the dropout). This representation fits the problem perfectly because it can be customized by each student. Thus, the number of attributes in an instance is always the same, while the number of instances in a bag depends on the student's activity. There are five attributes in every instance described in Table 3: type, weight, days between the submission and the deadline, score obtained by the student and a status flag that indicates if the given assignment has been transferred from a previous presentation coursed by the student. The same example presented in traditional supervised learning (see previous Section 4.2.1) is addressed in Figure 4 from a flexible representation based on MIL. There are two students enrolled on AAA course: one of them submits only two assignments and he/she doesn't pass the course while the other one submits all assignments and he/she passes the course. In case of MIL, the data representation is much more efficient: each student is represented as a bag with so many instances as assignments he/she had submitted. As we can see, with this representation there are no empty fields and the representation. It is adapted perfectly to the available information of each student. The corresponding ARFF representation used by machine learning algorithms in Weka can be seen in Table 6. In this case, attributes for each instance must be defined as part of a relational attribute. Thus, they do not depend on the number of assignments in the course achieving a uniform representation between courses. For all course there are one relational attribute (with five instance attributes) independently of the number of assignments by course. Thus, the ARFF representation for DDD course would use the same number of attributes than AAA course. Each student is represented by one bag with all their instances enclosed in double quotes and each one separated by the character "\n" representing each submitted assignments. Table 6. Fragment of ARFF header for multiple instance representation in any course.

Experimentation and Results
The goal of the experimental study is to investigate the potential of assignments to predict whether a student will or won't pass a course. As it has been commented in related work, the previous studies focus on this problem with OULA dataset involve mainly the evaluation of the student interactions with resources in VLE to determine his/her success in the course. On the contrary, this paper explores the potential of assignments to determine the level of engagement of students in a particular course. To validate this hypothesis, the performance of same algorithms will be analyzed. Thus, same information is used, but it is represented in one case from a traditional approach and in another case from a MIL approach. Thus, the flow of the experimental study is divided in five steps: first the configuration of the algorithms used to predict the student performance is presented in Section 5.1. Secondly, in Section 5.2 two procedures that permit to algorithms perform with MIL problems are presented and configured. Then, Section 5.3 defines the evaluation metrics as well as their meaning from a classification perspective and from a educational perspective. Next, Section 5.4 addresses the results contextualizing them in two comparative studies: attending to representations and attending to previous works. Finally, in Section 5.5, a discussion of the obtained results is carried out.

Configuration of Classification Algorithms
The experimentation of this is designed to offer a fair comparison between MIL and the traditional single-instances paradigm evaluating the same metrics in the same problem with the same information and in the same wide set of algorithms, part of the state of the art in supervised learning. They have been selected 23 algorithms considering the main paradigms of machine learning and the most popular methods for predicting student performance (see Section 2.2).
The experimentation has been developed using Weka [49], a framework for machine learning in Java. In order to ensure solid evaluation, each experiment is executed with a 10-fold cross-validation. In addition, stochastic algorithms are executed 5 times with different seeds, having a total of 50 executions per algorithm and course. The datasets in ARFF format ready to be used in Weka have been published in open access mode in the web repository associated to this paper (http://www.uco.es/kdis/mildistanceeducation/, accessed on 23 October 2021). In order to easily reproduce the experimentation, this section presents the studied algorithms as well as their configurations. Since the purpose of this study is to compare types of learning under equal conditions, the configuration of the predictive algorithms should not favor one or the other representation paradigm. Thus, these configurations has been chosen based on the default settings that the authors specified according to the Weka workbench [49], where more information can be consulted.

Configuration of Wrappers for MIL
With respect to MIL representation based on multiple instance, it is proposed the use of two different wrappers available in Weka [49] to adapt the MIL problem to single instance or traditional learning problem. Once that the problem is transformed, the same algorithms used in single instance representation (presented in previous Section 5.1) can be used with MIL representation. Thus, it is a more fair comparison because same algorithms and configurations are used. The proposals of MIL wrappers are the following: • SimpleMI [67]: this wrapper makes a summary of all the instances of a bag in order to build a unique instance that can be processed by a simple instance algorithm. • MIWrapper [68]: this wrapper assumes that all instances contribute equally and independently to the bag's label. Thus, the method breaks up the bag into its individual instances labeling each one with the bag label and assigning weights proportional to the number of instances in a bag. At evaluation time, the final class of the bag is derived from the classes assigned to its instances.
In the case of SimpleMI, there are two possible configurations to compute the summary of the instances of a bag into a single instance: • Configuration 1: computing arithmetic mean of each attribute using all instances of the bag and using it in the summarized instance. • Configuration 2: computing geometric mean of each attribute using all instances of the bag and using it in the summarized instance. In the case of MIWrapper, three configurations to compute the final class of the bag, extracted from the classes assigned at evaluation time: This study evaluates the accuracy of the different configurations to predict if a student will pass or fail the course. The experimentation consists of a 10-fold stratified crossvalidation for every combination of wrapper configuration, algorithm and course. The complete results of this experimentation can be downloaded from the web repository associated to this work (http://www.uco.es/kdis/mildistanceeducation/, accessed on 23 October 2021).
With the average accuracy of the cross-validation, a statistical analysis is carried out in order to find significant differences between configurations in each MIL wrapper. Concretely, it is used the non-parametric Wilcoxon signed-rank test [69] to carry out a pairwise statistical procedure between every two configurations. In each comparison is applied the test and obtained a p-value independent to show if algorithms obtain significantly better accuracy values with a specific configuration. Table 9 shows the R + , R − and p-values for all pairwise comparisons carried out. For both wrappers and considering a confidence level of 99%, the configuration 1 obtains significantly higher accuracy values than the others. Thus, for SimpleMI is more convenient to summarize the bag with the arithmetic mean and in case of MIWrapper, it is also better to use the arithmetic mean to combine the class probabilities of instances into the final class bag.

Evaluation Metrics
The metrics used for evaluation are some of the most common ones in the field of classification. In this context, classical concepts of binary classification are re-defined to our specific problem of having success in a course (passing it with or without distinction) or not having it (failure or dropout) as follow: • t p is the number of students correctly identified to pass the course. • t n is the number of students correctly identified to fail the course. • f p is the number of students do not correctly identified to pass the course (it is predicted that students pass the course, but they really do not pass). • f n is the number of students do not correctly identified to fail the course (it is predicted that students do pass the course, but they really pass).
Given the nature of the problem, in this context it is specially interesting to focus on students who are likely to fail. Thus, the metrics studied are [70]: • Accuracy is the proportion of correctly classified students, i.e., identifying if they pass or not the course.
• Sensitivity is the proportion of students correctly classified that pass the course.
• Specificity is the proportion of students correctly classified that do not pass the course.

Comparative Study
This section presents experimental results in the problem of comparing both multiple instances and single instance representation in predicting student performance using only his/her assignments activity. First, it is evaluated the performance of 23 machine learning algorithms using both representations. Statistical tests are used to determine if there are significant differences between performance using the different representations. Then, the best results achieved in this study are compared with the results of previous works shown in Section 3 that also predict the success of students for the same public dataset but using other student information available in OULAD.

Comparative Analysis between Different Representations
This section compares the performance of a wide set of algorithms in the problem of predicting student's success in a distance course using the same student information with different representations: traditional representation (based on single instance learning) and flexible representation (based on MIL). For solving the problem using flexible representation, as it is commented in Section 5.2, two different methods (MIWrapper and SimpleMI) that transform the MIL problem are used.
The experimental study carries out a 10-fold stratified cross-validation. The full results of this experimentation can been downloaded from the web repository associated to this work (http://www.uco.es/kdis/mildistanceeducation/, accessed on 23 October 2021). Tables 10 and 11 show average accuracy results for each course. Thus, for each representation and algorithm, it is presented the average accuracy results of each course presented in OULAD. It can be observed that SimpleMI (using flexible representation) obtains the highest accuracy for most algorithms in the different courses. Thus, with an accuracy between 85% and 95% for all algorithms, SimpleMI outperforms in a robust way to traditional representation. MIWrapper (also using flexible representation) achieves similar results to SimpleMI and it obtains better results for the most algorithms than traditional representation. Although its values are somewhat lower. This affects to general accuracy of MIWrapper (around 80%) that is lower than those of SimpleMI. Algorithms that use traditional representation have a more variable performance. It can be appreciated that in general, this representation obtains lower accuracy (around 65%). In this case, we appreciate that more complex algorithms like the multi-layer perceptron, LibSVM or Ridor are needed to reach results comparable to SimpleMI. This is a disadvantage in terms of interpretability, since these methods do not provide information of which are the most relevant attributes in order to obtain representative information to help students. In this line, methods like those based on rules or trees get to outperform their results using SimpleMI representation while maintaining interpretable results. Concretely, they obtain a 20% more than accuracy in average, using the same data but with a more optimal representation that fixes better to the problem.
For a more detailed analysis, Table 12 shows the average results for accuracy, sensitivity and specificity considering average results of the seven courses by the different algorithms. Results are grouped by representation: traditional representation and flexible representation (SimpleMI and MIWrapper). A full report of the results can be seen at the web repository associated to the article (http://www.uco.es/kdis/mildistanceeducation/, accessed on 23 October 2021). These data help to see in more detail tendencies like the superiority of SimpleMI, that gets the best accuracy results in all courses. Thus, the general tendency is that flexible representation (using SimpleMI) gets to improve the algorithms performance, obtaining better accuracy values versus traditional representation. In addition, this table shows in-depth the differences of performance between methods with the different representations. On the one hand, it is shown that traditional representation causes that algorithms obtain better values for specificity (predicting students that do not pass the course) at the expense of obtaining worse values for sensitivity (predicting students that pass the course). On the other hand, flexible representation (using WrapperMI) entails that algorithms obtain better values for sensitivity (predicting students that pass the course) at the expense of obtaining worse values for specificity (predicting students that do not pass the course). The fact of this off-balance between these measures is traduced in worse predictions overall. Again, flexible representation (using SimpleMI) obtains the most balanced results for both measures, sensitivity and specificity. Thus, this representation gets the best value or a very close one, achieving the best accuracy.
To analyze final results and show if there is significant differences between the behavior of algorithms using different representations, it is applied the test of Wilcoxon signed-ranks [69]. Thus, a pairwise comparison is carried out facing representation based on single instance (traditional) and the two MIL-based representations (SimpleMI, MIWrapper). Table 13 shows the results of the tests attending to accuracy measure. It is shown the R + , R − and p-values. With a confidence level of 99%, SimpleMI shows an improvement over the other representations. With a confidence level of 95%, MIWrapper shows an improvement over traditional representation.
Analyzing the differences in sensitivity, Table 14 shows a similar tendency: both flexible representations significantly outperform with a confidence level of 99% to traditional representation. However, attending to specificity results in Table 15, it can be appreciated that MIWrapper has difficulties to distinguish the negative class, which leads to a bad performance compared to SimpleMI and traditional representation. However, SimpleMI does not have this problem, reaching the best results in this metric too.    The main conclusion extracted from this experimentation is the importance of an appropriate problem representation. It can be seen that the assignments represented with single instance learning obtain lower results. These results can explain why assignments are not widely included as influencing factors in previous studies of predicting students' performance. Using the same type of information and the same learning algorithms, but a representation based on MIL, algorithms can predict the student's success in distance education with more accuracy. Thus, we can see that flexible representation can obtain differences of more than 20% of performance in comparison with traditional representation.

Comparative Analysis with Previous Works
Based on previous studies shown in Section 3 (Table 1), this work deviates from the general trend marked by the use of clickstreams (well known as student interactions with the VLE). As it has been shown, a limited number of studies use the information about assignments as influential factor for the predictions.
However, as it has been shown in the previous comparative study, assignments can obtain equal or better results if they are processed with the appropriate representation. Thus, in this section a comparison of the accuracy to predict student's performance is carried out attending to the best MIL method according previous section, SimpleMI, and the related work. Table 16 shows these differences with a special focus on previous works that have used same algorithms but from a traditional learning perspective. Thus, it is shown the algorithm and the data from OULAD in previous work compared to the use of MIL. For example, we can focus on the unique previous work that uses solely assignments data [40]. This work is limited to courses CCC and FFF and it obtains an average accuracy of 83% using decision trees. In our work, for these courses and algorithms, the best results reached show an average accuracy of 92'1%. Focusing on previous works focus on predicting student success or failure in a course but not based on assignments data, the best results are achieved by [30]. This work applies J48 over VLE activity data obtaining an average accuracy over all the courses 90%. In our case, using only assignment data with the same algorithm, the best result achieved for each course gets an average accuracy of 92.7%. traditional data representation. Given the superiority in performance, it is also worth commenting on the advantage of using MIL in terms of interpretability of results. In the context of predicting student's performance, it is very important that the models are not a black box. Thus, they could be interpreted by mentors and tutors in order to be able to correct in time trends that may lead to students failing or dropping out of the course. With MIL, a student's assignment information takes up to four times less than using the simple instance-based representation. This helps to create models more quickly and reduce redundancy in the information, which makes the results easier for a human to read. This, together with the lack of the need for deep learning or black box algorithms to obtain results above 90% accuracy, means that the level of interpretability of MIL models remains high and can be used in real-world tools to identify potential problems in distance learning courses with large numbers of failures, as well as to identify specific students at risk of failing.
On the other side, the approach of this work may also have limitations that should be taken into account. The authors can identify two main problems that can be addressed in future works. First, this work has been carried out by analyzing the dropout and failure profiles together. Although both profiles correspond to students who do not pass the course, this could be due to different causes, so it is worth considering the possibility of carrying out a separate analysis to identify each type of student at risk. Second, the study has been made considering all the activities of the course, i.e., it is required to have reached an advanced point of the course to have all the information of the assignments that have been submitted. This limits the possibilities of action to prevent an at-risk student from failing the course. It would be desirable to tweak the approach to only use the information from the assignments up to a certain point in the course, in order to have enough time before the final exam to be able to adequately guide the at-risk student to avoid failure.

Conclusions and Future Work
This paper shows the impact of assignments information to predict the academic achievements. Online courses are characterized by a high number of enrolled students with a low participation and engagement, in general terms. Moreover, assignments depend on each course, because each has its own curriculum, scheduling and evaluation approach, so offers different number and type of assignments. This has led to ignore assignments as a criterion to predict students performance, as proves the very limited number of works that study them. The main problem is that traditional representation produces a very complex representation that machine learning algorithms cannot properly process.
This work shows that information about assignments can be very valuable to predict students performance when it is appropriately represented. The comparative study has employed a public dataset in learning analytics, OULAD. This dataset allows to work with a big amount of data and the comparison of the proposed study and results in an existing common framework. Thus, starting from this dataset, the appropriate transformations have been applied to use MIL as the learning paradigm, generating the files in ARFF format necessary to train the predictive models. These files are publicly available in the web repository associated with the article. Experimental results over a wide set of 23 machine learning algorithms and 7 courses show that, in a general way, using assignments in a flexible representation improve the accuracy with respect to use the same information in a traditional representation, achieving an important balance between sensitivity and specificity measures. Statistical tests confirm these results showing significant differences in every studied metric between multiple instances and single instance representations. Finally, it is carried out a comparison with previous studies that also use OULAD for predicting student performance from other factors, such as demographic information and students interactions with VLE, showing the relevance of assignments as a very influential factor to determine the student success or failure.
The great variety of information gathered in OULAD together with the promising results obtained open the door to continue this line of research. Thus, it may be tested another source of information to predict the student success like the clicks activity in the VLE, the number of times that the student has done a course or including demographic data. In addition, it is propose to extend the study to algorithms unique to MIL paradigm, as well as explore different MI assumptions existing in the bibliography.