Comparative Analysis of Self-Labeled Algorithms for Predicting MOOC Dropout: A Case Study

Raftopoulos, George; Kostopoulos, Georgios; Davrazos, Gregory; Panagiotakopoulos, Theodor; Kotsiantis, Sotiris; Kameas, Achilles

doi:10.3390/app152212025

Open AccessArticle

Comparative Analysis of Self-Labeled Algorithms for Predicting MOOC Dropout: A Case Study

by

George Raftopoulos

¹

,

Georgios Kostopoulos

^1,2,*

,

Gregory Davrazos

¹

,

Theodor Panagiotakopoulos

³

,

Sotiris Kotsiantis

¹

and

Achilles Kameas

⁴

¹

Department of Mathematics, University of Patras, 26504 Patras, Greece

²

School of Social Sciences, Hellenic Open University, 26331 Patras, Greece

³

Department of Management Science and Technology, University of Patras, 26334 Patras, Greece

⁴

School of Science and Technology, Hellenic Open University, 26335 Patras, Greece

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(22), 12025; https://doi.org/10.3390/app152212025

Submission received: 22 October 2025 / Revised: 7 November 2025 / Accepted: 10 November 2025 / Published: 12 November 2025

(This article belongs to the Special Issue Advances in Intelligent Information Systems and AI Applications—2nd Edition)

Download

Browse Figure

Versions Notes

Abstract

Massive Open Online Courses (MOOCs) have expanded global access to education but continue to struggle with high attrition rates. This study presents a comparative analysis of self-labeled Semi-Supervised Learning (SSL) algorithms for predicting learner dropout. Unlike traditional supervised models that rely solely on labeled data, self-labeled methods iteratively exploit both labeled and unlabeled instances, alleviating the scarcity of annotations in large-scale educational datasets. Using real-world MOOC data, ten self-labeled algorithms, including self-training, co-training, and tri-training variants, were evaluated across multiple labeled ratios. The experimental results show that ensemble-based methods, such as Co-training Random Forest, Co-Training by Committee, and Relevant Random subspace co-training, achieve predictive accuracy comparable to that fully supervised baselines even with as little as 4% labeled data. Beyond predictive performance, the findings highlight the scalability and cost-effectiveness of self-labeled SSL as a data-driven approach for enhancing learner retention in massive online learning.

Keywords:

semi-supervised learning; self-labeled methods; learning analytics; educational data mining; MOOC dropout prediction

1. Introduction

Massive Open Online Courses (MOOCs) have profoundly transformed the global educational landscape by enabling large-scale, open access learning that transcends geographical, socioeconomic and institutional boundaries. Since their emergence in the early 2010s, MOOCs have been heralded as a democratizing force in higher education, fostering lifelong learning and providing equitable access to high-quality educational resources [1]. Despite this transformative potential, MOOCs continue to grapple with a persistent and critical challenge: exceptionally high dropout and attrition rates [2].

Empirical studies consistently report that fewer than 10% of registered learners complete their courses. This phenomenon raises critical questions about learner motivation, engagement, instructional design, and overall pedagogical effectiveness. High attrition not only undermines the pedagogical and social promise of MOOCs but also limits their potential contribution to educational equity and capacity building at scale [3].

Addressing this challenge is therefore imperative. Improving course completion rates should not be viewed merely as a metric of institutional success but as a necessary condition for realizing the broader societal mission of MOOCs to promote inclusive, sustainable and effective learning ecosystems [4]. A deep understanding of learner dropout behavior, informed by behavioral analytics, learning design research and educational psychology, is crucial for designing targeted interventions that enhance learner engagement and persistence.

Recent advances in Educational Data Mining (EDM) and Learning Analytics (LA) have facilitated predictive modeling of learner behavior, enabling early identification of at-risk participants and the development of adaptive support mechanisms [5]. Such data-driven insights, when integrated with pedagogically grounded strategies, such as fostering social presence, promoting self-regulated learning and designing interactive learning pathways, can significantly mitigate dropout tendencies and strengthen learner retention [6]. Consequently, understanding, predicting and preventing learner disengagement is central not only to the future success of MOOCs but also to the ongoing evolution of open and digital education worldwide.

Conventional Machine Learning (ML) approaches to MOOC dropout prediction have predominantly relied on supervised learning techniques, which depend on large volumes of accurately labeled data to build robust and generalizable predictive models [7]. However, in large-scale online learning environments, obtaining such labeled datasets is often costly, time-consuming and prone to bias, as it typically involves extensive manual annotation, post-course surveys, or retrospective analysis of learner behavior logs [8]. The limited availability of labeled samples constrains model performance and scalability, particularly when dealing with the heterogeneous and dynamic nature of learner engagement data in MOOCs.

To mitigate these challenges, Semi-Supervised Learning (SSL) approaches have emerged recently as promising alternatives. SSL methods exploit the abundance of unlabeled interaction data readily available in MOOCs, integrating it with a smaller labeled subset to iteratively refine model predictions. By leveraging both labeled and unlabeled data, these methods enhance predictive accuracy while substantially reducing the dependence on manual labeling efforts [9]. Consequently, semi-supervised approaches offer a scalable and cost-effective pathway for improving dropout prediction, enabling more timely and adaptive interventions to support learner retention in open online learning environments [10].

This study presents a comprehensive comparative analysis of self-labeled SSL methods for MOOC dropout prediction. A wide range of self-labeled algorithms, such as self-training, co-training and tri-training, are systematically evaluated on a real-world MOOC dataset to assess their predictive performance under varying proportions of labeled data. The goal of this investigation is twofold: first, to determine the capacity of SSL methods to maintain high predictive accuracy in scenarios with limited labeled information, and second, to explore their potential for early identification of at-risk learners. These findings show that SSL techniques provide a promising framework for timely dropout prediction, enabling MOOC providers to derive data-driven, actionable insights that can inform targeted pedagogical interventions, enhance learner engagement and ultimately improve overall course retention and effectiveness [7,11].

The main contributions of this paper can be summarized as follows:

We present a systematic evaluation of semi-supervised learning methods, comparing Self-Training, Co-Forest and CoBC across multiple performance metrics;
We provide an empirical analysis of how ensemble-based methods improve classification accuracy under varying levels of labeled data availability;
We discuss the trade-off between performance gains and computational efficiency, offering practical insights for practitioners operating under resource constraints.

The remainder of this paper is organized as follows: Section 2 provides an overview of recent advancements in the application of SSL methods within EDM and LA, with a particular focus on their utilization in MOOC-related research. Section 3 describes the semi-supervised methodologies adopted, including their algorithmic foundations and implementation details. Section 4 introduces the dataset employed in this study, detailing its structure, key features and relevance to the dropout prediction task. Section 5 and Section 6 report the experimental outcomes and offer a comprehensive analysis of the predictive performance and comparative behavior of the evaluated algorithms. Finally, Section 7 and Section 8 discuss the broader implications of the findings for the MOOC domain, highlight methodological insights and outline promising directions for future research.

Building on these motivations, recent research has investigated the application of SSL for predicting MOOC dropout under limited label availability. The following section reviews key studies that have applied SSL and related methods in the educational field.

2. Related Work

Predicting student dropout in MOOCs and distance learning environments has become a central issue in the fields of EDM and LA. Despite the increasing popularity of MOOCs, dropout rates remain persistently high, and research applying SSL methods to this problem is still relatively scarce. Most existing studies rely on fully supervised techniques, which require large volumes of labeled data, a costly and time-consuming requirement in educational contexts. Consequently, there is a growing need for comparative and exploratory studies that assess the effectiveness of SSL algorithms in exploiting the abundant unlabeled data available in MOOCs.

The study by Kostopoulos et al. [12] represents one of the first attempts to employ SSL for student dropout prediction in the educational field. To this end, the authors explored the application of several semi-supervised classification methods for predicting dropout in a distance learning course. Although the study does not focus on MOOCs specifically, it provides empirical evidence that SSL algorithms can enhance classification accuracy when labeled examples are limited. Their experiments confirm that incorporating unlabeled instances into the training process enables models to generalize better, highlighting the value of SSL in real-world educational settings where annotation is often incomplete.

Building upon the previous study, Li et al. [13] introduced a multi-view SSL framework to address the challenge of predicting student attrition in MOOCs. Using behavioral data from the KDD Cup 2015 competition, the authors integrated heterogeneous feature sets, such as forum participation, assessment performance, and video engagement, to model diverse aspects of learner behavior. By employing a co-training strategy that exploits complementary information across multiple feature spaces, their approach achieved superior robustness and predictive accuracy compared to single-view baselines and traditional supervised models. This work provided one of the first empirical demonstrations that the multifaceted behavioral dynamics of MOOC learners can be effectively captured through a multi-view SSL paradigm.

Another notable contribution is that of Goel and Goyal [10]. The authors conducted an extensive evaluation of the self-training algorithm, one of the most fundamental SSL techniques, using a real-world MOOC dataset. Their findings indicate that self-training can achieve robust predictive performance even with a small ratio of labeled data, achieving F1-scores above 94%, thus highlighting its practical applicability for large-scale educational platforms where annotation resources are constrained.

Zhou et al. [14] introduced the M-ISFCM method, a hybrid semi-supervised clustering and classification algorithm for anomaly detection in MOOC learning behaviors. The proposed algorithm combined fuzzy clustering with incremental learning to identify atypical interaction patterns that often precede dropout events. This research highlighted the applicability of semi-supervised anomaly detection as a proactive diagnostic tool in large-scale online education such as MOOCs.

More recently, Cam et al. [15] introduced a hybrid deep learning model integrating Recurrent Neural Networks (RNNs) with a semi-supervised Support Vector Machine (S3VM) for early identification of students at risk of dropping out. The hybrid architecture captured sequential behavioral dependencies while leveraging unlabeled data to enhance model generalization. Experimental results showed superior performance (over 92% accuracy) compared to standard supervised baselines, underscoring the potential of deep semi-supervised architectures for early intervention systems.

Beyond these studies, a few additional works have explored related directions using primarily supervised learning. For example, Psathas et al. [16] applied traditional supervised models combined with data balancing strategies to predict MOOC dropout in the context of self-regulated learning, while Chi et al. [17] employed Random Forest and k-Nearest Neighbors (k-NN) algorithms on large HarvardX datasets to analyze learners’ dropout behavior. Although these approaches rely on fully labeled data, they underscore the ongoing interest in applying machine learning for dropout prediction.

Overall, these studies collectively demonstrate that while supervised learning has dominated MOOC dropout research, semi-supervised methods remain underexplored yet highly promising, particularly when labeled data are limited or costly to obtain. SSL techniques have shown encouraging results in capturing behavioral complexity, improving generalization, and identifying at-risk learners earlier. Moreover, SSL can enhance model generalizability to an appreciable extent while minimizing the reliance on manual annotation, which often proves to be expensive and time-consuming in learning settings. For a comprehensive overview of research trends, techniques, and challenges in this field, readers may refer to the systematic review by Chen et al. [7].

Building on these insights, the next section outlines the main principles of SSL and the specific self-labeled algorithms evaluated in this study.

3. Semi-Supervised Learning

SSL represents a class of ML approaches that exploit both labeled and unlabeled data to improve model performance. In many real-world scenarios, labeled data are scarce or/and costly to obtain, whereas unlabeled data are abundant. SSL methods bridge this gap by integrating information from both data sources, often assuming that unlabeled instances share underlying structural similarities with labeled ones [18]. In general, SSL algorithms can be broadly classified into four main categories: graph-based approaches, self-labeled methods, generative models and maximum-margin techniques. Despite their methodological differences, these paradigms share a common goal: to leverage the intrinsic geometry and probabilistic distribution of both labeled and unlabeled data in order to enhance model generalization and decision boundary refinement [9]. These methods have proven effective in various domains, including natural language processing, computer vision and EDM, particularly when annotation costs are high or expert labels are limited [19,20].

Within the broader SSL paradigm, self-labeled methods constitute a simple yet powerful group of semi-supervised classification. In these methods, a model is first trained on a small set of labeled samples and then used to predict labels for the remaining unlabeled data. The most confident predictions are treated as pseudo-labels, which are iteratively added to the labeled dataset to retrain and refine the model [21]. Through this process, the model effectively “teaches itself”, leveraging its own predictions to enhance learning. Self-labeled techniques have been shown to perform competitively with fully supervised methods, particularly when the initial labeled set is representative and model confidence thresholds are properly managed [22]. In the context of MOOC dropout prediction, where labeled instances (e.g., confirmed dropouts) are often limited and highly imbalanced, self-labeled methods offer a pragmatic and efficient way to harness the large volumes of unlabeled behavioral and activity data generated by learners. Consequently, these methods are well suited to EDM applications, supporting scalable predictive modeling and early intervention strategies.

For the purposes of this study, a comprehensive set of widely recognized self-labeled SSL methods was employed to explore their effectiveness in the context of MOOC dropout prediction. In total, ten (10) distinct algorithms were applied, encompassing both single-view and multi-view approaches, as well as single-classifier and multi-classifier architectures. These methods, which collectively represent a diverse spectrum of self-labeled paradigms, are presented and discussed in detail below.

Self-training [23] is one of the earliest used and most fundamental SSL techniques and is often used as a baseline for evaluating other self-labeled approaches. The method begins by training a base classifier on a small set of labeled examples (L). The trained model then predicts labels for the unlabeled instances (U) and the most confidently classified samples are added to L for retraining. This iterative process continues until a stopping criterion is met. Self-training has been successfully applied in diverse domains, including text classification, bioinformatics and EDM, due to its conceptual simplicity and ability to leverage abundant unlabeled data. However, a key limitation of this approach lies in the potential propagation of labeling errors, which can degrade model performance over iterations [24].
SetRed (Self-training with editing) enhances the robustness of self-training by introducing a noise-filtering mechanism that mitigates the risk of error propagation [25]. After pseudo-labels are assigned, SetRed applies an editing phase, commonly based on the k-NN algorithm, to evaluate the local consistency of pseudo-labeled samples. Instances whose labels disagree with the majority of their neighbors are removed before the next training iteration. By filtering out unreliable pseudo-labels, Setred produces cleaner training sets and more stable classifiers, particularly useful in noisy or imbalanced data contexts such as EDM, where dropout cases or minority outcomes can distort training.
Co-training [26] is a seminal disagreement-based SSL method that assumes that each instance can be described by two distinct and conditionally independent feature subsets (views). Two classifiers are trained independently on each view using the labeled data. Each classifier then predicts labels for the unlabeled set and the most confident predictions from one classifier are used to augment the labeled set of the other. This mutual exchange of high-confidence examples continues iteratively until a stopping criterion is reached. Co-training has inspired numerous extensions and remains a foundational paradigm for multi-view learning and ensemble-based SSL [27].
Co-Training by Committee (CoBC) [28] extends the co-training concept by combining it with ensemble diversity. A committee of classifiers, each trained on randomly sampled labeled subsets of the feature space, iteratively collaborates to label unlabeled instances. Only pseudo-labels with high confidence and consensus among the committee members are added to the labeled dataset, reducing the risk of reinforcing incorrect predictions. This ensemble-based mechanism promotes robustness and minimizes noise propagation, making CoBC suitable for tasks involving complex, high-dimensional data representations.
Democratic Co-Learning, introduced by Zhou and Goldman [29], is a semi-supervised ensemble learning framework that generalizes the co-training paradigm by employing multiple diverse classifiers that collaboratively label unlabeled data through a majority-voting mechanism. Unlike traditional co-training, which typically relies on two views or classifiers, Democratic Co-Learning constructs a committee of classifiers, each potentially using different learning algorithms or feature subsets, that iteratively exchange pseudo-labels based on collective confidence rather than pairwise agreement. During each iteration, unlabeled instances receiving consistent predictions from the majority of the committee are incorporated into the labeled set, reinforcing reliable patterns while mitigating the influence of noisy or biased individual models. This democratic voting strategy enhances robustness, reduces overfitting to erroneous pseudo-labels and performs effectively even when the assumption of conditional independence between feature views does not hold.
Rasco (Random subspace co-training) [30] and Rel-Rasco (Relevant Random subspace co-training) [31] are multi-view co-training variants that exploit random subspace and feature selection strategies to create diverse learning views. In Rasco, multiple classifiers are trained on randomly generated feature subspaces, each representing a distinct view of the data. Their predictions on unlabeled instances are aggregated to determine consensus labels, facilitating complementary learning across subspaces. Rel-Rasco enhances this approach by constructing subspaces based on feature relevance scores, typically computed via mutual information between features and class labels. This systematic subspace generation increases interpretability and improves the reliability of pseudo-labels, particularly in structured and high-dimensional datasets.
Co-Forest (Co-training Random Forest) [32] integrates the co-training framework into the ensemble learning paradigm of Random Forests [33]. Multiple decision trees are trained on different labeled subsets and during each iteration, each tree labels a portion of the unlabeled data. High-confidence predictions are shared among trees, effectively expanding their training sets. This collaborative labeling process enhances model diversity and generalization, preserving the strengths of RF, robustness and low variance, while leveraging unlabeled data to further improve classification performance.
Tri-training [34] is a multiple-classifier SSL approach that eliminates the need for explicit multi-view assumptions. It employs three classifiers trained on the labeled data and iteratively refines them using unlabeled examples. When two classifiers agree on the label of an unlabeled instance, that instance is pseudo-labeled and added to the training set of the third classifier. This majority-voting mechanism reduces labeling errors and encourages consensus-driven learning. Tri-training is computationally efficient and has demonstrated competitive performance in settings where data are noisy, imbalanced, or limited, conditions frequently encountered in EDM and dropout prediction tasks.
Tri-Training with Data Editing (DE-Tri-training), proposed by Deng and Guo [35], extends the classical tri-training algorithm by introducing an additional data-cleaning mechanism to improve pseudo-label reliability and model stability. Similar to tri-training, the method employs three classifiers that iteratively label unlabeled data through a consensus-based strategy, where two agreeing classifiers assign pseudo-labels for retraining the third. The key innovation lies in its editing phase, during which potentially mislabeled or noisy instances are detected and removed using neighborhood-based criteria, such as k-NN consistency checks. This selective filtering prevents the accumulation of erroneous pseudo-labels that can degrade performance in later iterations.

4. Dataset Description

The MOOC was developed within the framework of the Erasmus+ project “DevOps Competences for Smart Cities” (https://smartdevops.eu/dev, accessed on 10 October 2025). The overarching objective of the project was to enhance the professional capacity of current and prospective employees in municipalities and regional authorities by equipping them with the competences required to support and sustain emerging smart city initiatives. The DevOps approach, which emphasizes continuous integration, automation and collaboration between development and operations teams, provides a methodological foundation for improving the efficiency and responsiveness of smart city digital infrastructures [36].

The DevOps MOOC (https://smartdevops.eu/dev/about-the-mooc/, accessed on 10 October 2025). was designed and delivered as part of the project to support large-scale, open and flexible training opportunities. Registration for the course was open from 15 September to 15 October 2020 and the program officially commenced on 19 October 2020. The course spanned approximately three months and followed a weekly structure, offering one or two training modules (i.e., competence units) per week. Each module, delivered in English, comprised two to five learning units, each containing interactive materials and an automatically graded assessment activity.

The content was developed to align with the European Qualifications Framework (EQF) Level 5, corresponding to the level of autonomy and responsibility expected from professionals involved in smart city planning, implementation and management. To ensure quality monitoring and enable subsequent research, the registration form included a structured questionnaire collecting personal and demographic information. Participants were explicitly informed that all data would be collected, stored and processed in compliance with the General Data Protection Regulation, exclusively for educational evaluation and research purposes. Informed consent was required prior to data collection; participants who declined could still enroll in the MOOC by providing only their name and email address, thereby ensuring voluntary participation and adherence to ethical research standards.

Dataset Attributes

The dataset constitutes a comprehensive collection of data from 961 learners enrolled in the MOOC. Among these learners, 944 completed an extensive questionnaire that captured a wide range of demographic, performance-related and behavioral indicators. Following rigorous preprocessing, comprising the treatment of missing values, elimination of duplicates and standardization of responses, a final dataset of 936 consistent and reliable records was obtained for analysis [37]. To enable a multidimensional exploration of learner characteristics and behaviors, the dataset was systematically organized into three distinct subsets:

Demographics Subset: This subset comprises 10 primarily categorical variables describing participants’ backgrounds, including mother tongue, education level, employment status, current occupation, years of professional experience, working hours per day, advanced technical English skills, advanced digital skills, prior MOOC experience and average study time per week. These attributes offer valuable insights into the diversity and prior preparedness of the learner population, aligning with established findings on demographic influences in MOOC engagement and success [38].
Performance Subset: Containing 4 numerical variables, this subset reflects participants’ early academic outcomes, such as study hours per week, Quiz1 score (Week1-Unit1), Quiz2 score (Week1-Unit2) and average grade from the first two course modules. These indicators enable the exploration of correlations between initial performance and overall course completion, a relationship frequently highlighted in LA within MOOCs [39].
Activity Subset: This subset includes 7 numerical variables that quantify learners’ engagement on the MOOC platform, such as the number of module1 discussions, module1 posts, module1 forum views, module1 connections per day, announcements forum views, introductory forum views, discussion participations and total time (minutes) spent on learning activities in module1. These metrics serve as behavioral proxies for engagement intensity and learning strategies, which are strong predictors of persistence and dropout in online learning environments [40].

The output attribute, denoted as “dropout”, serves as the target variable in our analysis and represents the student’s course completion status. It is a binary variable, where a value of 1 indicates that the student successfully completed the course, while a value of 0 denotes that the student dropped out before completion. This dichotomous structure enables the formulation of the problem as a binary classification task, which is commonly adopted in EDM and LA research for modeling and predicting student retention and attrition behaviors [41].

Overall, the dataset offers a rich, multidimensional representation of MOOC learners, integrating demographic, behavioral and performance dimensions. It provides a robust foundation for research in EDM and LA, particularly for modeling dropout prediction and retention improvement using advanced ML techniques. The inclusion of diverse feature types also facilitates exploratory analyses capable of uncovering latent behavioral patterns and informing the design of more inclusive and engaging online learning experiences.

5. Experimental Design

The preprocessing was implemented using the scikit-learn library [42]. Missing values were imputed via SimpleImputer (mean or mode strategy), categorical features were transformed with OneHotEncoder and numerical features were standardized using StandardScaler within a unified Pipeline to avoid data leakage. During the experimental procedure, the dataset was partitioned using the 10-fold cross-validation procedure. Specifically, the dataset was divided into ten mutually exclusive folds, each containing approximately 10% of the total instances. For each iteration, nine folds were used to train the model (training partition), while the remaining fold served as the test partition. This process was repeated ten times so that each fold was used exactly once for testing. The test partitions were strictly reserved for performance evaluation to ensure an unbiased assessment of the learned hypotheses. Within each training partition, the data were further divided into labeled and unlabeled subsets to simulate realistic SSL conditions. Labeled instances were randomly selected from the training partition, while the class labels of the remaining examples were removed. To guarantee sufficient representation, at least one labeled instance per class was ensured in every training subset. Furthermore, to investigate the impact of the proportion of labeled data on model performance, three different labeled (i.e., the proportion of labeled to unlabeled instances) ratios were examined: 2%, 3% and 4% of the training data. For example, in a dataset containing 1000 instances, a 2% labeled ratio corresponds to 20 labeled and 980 unlabeled examples. This design allows for a systematic analysis of how the scarcity of labeled data affects the learning dynamics and generalization capability of self-labeled methods.

5.1. Configuration Parameters

All algorithms were implemented using the sslearn library [43] version 1.1.0, a Python framework specifically developed to extend the functionality of scikit-learn [42] by incorporating a broad range of SSL algorithms. The library’s modular architecture, API consistency and full interoperability with scikit-learn facilitated the seamless integration of the selected SSL algorithms into our experimental workflow. This compatibility allowed for efficient experimentation, standardized evaluation procedures and reproducible comparative analysis across a wide range of self-labeled. The configuration settings employed for each algorithm are summarized in Table 1, ensuring transparency and replicability of the experimental setup.

5.2. Evaluation Metrics

To assess the performance and reliability of the SSL methods, three widely recognized evaluation metrics were employed: Accuracy, F1-score and the Matthews Correlation Coefficient. These metrics were selected to provide a comprehensive and balanced evaluation of the models’ predictive capabilities across different aspects of classification quality. The combined use of these complementary metrics ensures a more reliable and interpretable evaluation of model performance across varying data conditions. These metrics are briefly described as follows:

Accuracy is one of the most widely used performance metrics in classification tasks, defined as the proportion of correctly predicted instances over the total number of samples. Mathematically, it is expressed as

A c c u r a c y = \frac{T P + T N}{T P + T N + F P + F N}

(1)

where TP, TN, FP and FN denote true positives, true negatives, false positives and false negatives, respectively. Accuracy provides an intuitive measure of overall model correctness and is particularly informative when class distributions are balanced [44]. However, in imbalanced datasets—common in educational and dropout prediction tasks—it can be misleading, as it may obscure poor performance on minority classes [45]. Therefore, accuracy is often complemented with more discriminative metrics such as the F1-score or the Matthews Correlation Coefficient (MCC).

The F1-score is the harmonic mean of precision and recall, balancing the trade-off between correctly identifying positive instances and minimizing false negatives. It is defined as

F 1 - s c o r e = 2 * \frac{\frac{T P}{T P + F P} * \frac{T P}{T P + F N}}{\frac{T P}{T P + F P} + \frac{T P}{T P + F N}}

(2)

The F1-score is particularly valuable in scenarios with class imbalance, as it emphasizes the model’s ability to correctly detect minority classes while penalizing excessive false positives or false negatives [44]. In dropout prediction, for instance, where the number of students who drop out is typically smaller than those who complete the course, the F1-score offers a more balanced assessment of model performance than accuracy alone.

The MCC is a robust metric that measures the quality of binary classifications by considering all four elements of the confusion matrix. It is defined as:

M C C = \frac{(T P * T N) - (F P * F N)}{\sqrt{(T P + F P) (T P + F N) (T N + F P) (T N + F N)}}

(3)

The MCC yields a value between −1 and +1, where +1 indicates perfect prediction, 0 indicates random performance and −1 indicates total disagreement between predictions and true labels. Unlike accuracy and F1-score, the MCC remains reliable even in highly imbalanced datasets, as it provides a balanced measure that accounts for both the true and false classifications in all classes. For this reason, MCC is increasingly recommended as a preferred metric in binary classification problems, particularly in EDM and health-related predictive modeling [46].

6. Results and Analysis

This section reports the results of applying ten self-labeled SSL algorithms to the MOOC dropout prediction task. The analysis focuses on their comparative performance under varying proportions of labeled data, emphasizing predictive accuracy and robustness against label scarcity.

The experimental results obtained from the application of the self-labeled methods, along with their corresponding performance metrics, are summarized in Table 2, Table 3 and Table 4. These tables present a comparative overview of each algorithm’s behavior across different labeled data ratios, highlighting variations in predictive Accuracy, F1-score and MCC. The inclusion of multiple evaluation measures provides a comprehensive assessment of each model’s robustness, consistency and ability to generalize under limited supervision conditions.

The performance results in Table 2, Table 3 and Table 4 aggregate the comparative behavior of the ten self-labeled algorithms across differing proportions of labeled data (2%, 3%, 4%, 20%, 40%) as well as the fully supervised baseline (100%). The combined assessment over Accuracy, F1-score, and MCC highlights each model’s ability to cope with label scarcity and class imbalance. Overall, the results reaffirm the strong potential of self-labeled SSL methods in addressing the limited annotation challenges inherent in MOOC dropout prediction.

6.1. Overall Performance Trends

Across all labeled ratios, most self-labeled methods achieved high predictive accuracy—typically above 90% when as little as 3–4% of the training data were labeled. The included 20% and 40% settings further confirm that performance continues to improve gradually with more supervision, though the marginal gains become smaller beyond 20%. This trend illustrates the diminishing semi-supervised benefit as the quantity of labeled data increases, which aligns with the typical behavior of SSL algorithms. Among all examined methods, Co-Forest, Tri-Training, CoBC, and Rel-RASCO remained stable and adaptable across all data conditions, emerging as the overall best performers in most evaluation metrics.

Conversely, Democratic Co-Learning and particularly De-Tri-Training underperformed relative to the other algorithms. The latter exhibited consistently low and stagnant scores across all labeled ratios, indicating sensitivity to noisy pseudo-labeling and possible instability during its data-editing phase. This may be attributed to the relatively homogeneous MOOC engagement features, which limit the gains achievable from aggressive noise filtering.

6.2. Accuracy Analysis

As shown in Table 2, all models exhibited an increasing trend in accuracy as the labeled ratio rose from 2% to 40%. Co-Forest achieved the highest accuracy at the 2% labeled ratio (90.97%), demonstrating strong bootstrap capability under weak supervision. As labeled data availability increased, Tri-Training reached the top accuracy at the 4% level (94.11%), closely followed by CoBC (93.46%) and Rel-RASCO (93.71%). The 20% and 40% results further show that accuracy gradually approaches saturation, with most algorithms achieving values comparable to the fully supervised (100%) case. This observation confirms that SSL can recover much of the predictive power of a fully labeled model while drastically reducing annotation requirements.

6.3. F1-Score Analysis

The F1-score results in Table 3 largely mirror the accuracy trends while offering additional insight into class balance. Rel-RASCO and Co-Forest achieved the highest F1-scores (92.79% and 94.58%, respectively) at 3–4% labeling, indicating robust detection of minority-class dropout cases. With 20% and 40% labeled data, F1-scores for most models exceeded 94%, underscoring consistent class-level performance as supervision increased. Tri-Training’s competitive F1 results reflect the effectiveness of its majority-vote mechanism in mitigating class bias. In contrast, Democratic Co-Learning and De-Tri-Training yielded lower F1-scores (80–86%), suggesting that their ensemble consensus or data-editing strategies were less effective given the feature homogeneity of the dataset.

6.4. Matthews Correlation Coefficient (MCC) Analysis

As shown in Table 4, MCC values provide a stringent measure of model robustness that is insensitive to class imbalance. The top-performing methods (Co-Forest, CoBC, and Rel-RASCO) consistently achieved high MCC values (92–93%) even at low label ratios, with steady improvements up to 40%. These results demonstrate that the models’ correlation between predicted and true labels remains strong across all supervision levels. Notably, the performance gap between the best SSL models and the fully supervised baseline remains within 2–3 percentage points, reaffirming that self-labeled SSL approaches can deliver nearly equivalent robustness while requiring only a small fraction of labeled data.

The next section discusses these findings in relation to algorithmic design and their implications for educational data mining practice.

7. Discussion

All iterative self-labeling algorithms were run for a maximum of 20 iterations or until fewer than 1% of the unlabelled examples changed their label from the previous iteration. In practice, convergence was achieved well within these limits for both methods, around 5–7 iterations for the Self-Training algorithm and approximately 8–12 iterations for the ensemble algorithms Co-Forest and CoBC, reflecting the influence of diversity and re-labeling processes in both cases. From the performance curves, the points of convergence for improvements in both precision and recall suggested the algorithms’ convergence to pseudo-label distributions without overfitting the noise from the self-labels.

Synthesizing the results across metrics (Figure 1), several comparative observations emerge:

CoBC demonstrated the most consistent performance, combining high mean accuracy with low variability. Rel-RASCO also maintained relatively stable results, though slightly below CoBC. Co-Forest, while competitive in some cases, exhibited greater variance across folds. Overall, ensemble and multi-view algorithms tend to outperform simpler approaches like Self-Training, as they can better leverage diversity across learners and feature representations [47].
Tri-Training and CoBC demonstrated exceptional adaptability to scarce labeled conditions, validating the effectiveness of consensus-based and committee-driven labeling strategies in minimizing error propagation.
SetRed maintained moderate but consistent performance, indicating the value of its noise-filtering component but suggesting limited scalability benefits in this dataset’s structure.
De-Tri-Training’s weak results emphasize that additional editing or data-pruning layers may not always be beneficial, especially when the base models already handle noise adequately.

Figure 1. Performance comparison of the top four self-labeled algorithms (Co-Forest, Rel-Rasco, Tri-Training, and CoBC) across Accuracy, F1-score, and MCC metrics at 4% labeled data.

Overall, the comparative analysis confirms that the multi-view and ensemble self-labeled schemes tend to outperform the individual-view models, particularly in the high-dimensional, multidisciplinary learning databases where redundancy and feature heterogeneity are the rule. These findings corroborate the practicality of the self-labeled SSL protocols in the real MOOC environments where label shortage and behavioral heterogeneity are the over-riding concerns.

These findings individually show that self-annotated SSL methods can efficiently exploit unlabeled data in learning settings. Through the combination of ensemble diversity and agreement mechanisms, these models exhibit high predictive accuracy even in the extreme case of label sparcity. The extensions of these findings go beyond predicting dropout, offering practical applicability to other LA tasks like engagement prediction or knowledge tracing.

7.1. Summary of Findings

In short, the experimental results show that self-labeled SSL algorithms can obtain competitive, frequently near-supervised performance in the MOOC dropout prediction tasks based on as few as 2–4% of labeled data. The ensemble-based algorithms (Co-Forest, CoBC, Rel-Rasco)proved to be the most efficient, providing the optimal trade-off among the predictive accuracy, generalizability and stability based on all the evaluation criteria. These findings demonstrate the promise of self-labeled methods as cost-effective, scalable tools for the educational analytics applications based on partially labeled behavioral traces.

7.1.1. Performance–Efficiency Trade-Off

Although the SSL algorithms based on ensemble methods, such as Co-Forest and CoBC, generally demonstrate improved prediction performance compared to single-model semi-supervised learning algorithms such as Self-Training, the above enhancement comes at the cost of substantially greater computation time. Since the traditional ensemble method entails the training of multiple base classifiers at each ensemble step, the training process may be computationally expensive, especially for very large training samples and intricate base classifiers. On the other hand, the Self-Training algorithm needs only one single-model training step at each iteration, which makes the latter preferable for computation-scarce environments.

7.1.2. Pedagogical Implications

In addition to the technical outcomes, the results have a number of implications for the pedagogical architecture of the DevOps for Smart Cities MOOC platform. In fact, the platform follows a competency-based approach, where the primary focus lies on the step-by-step advancement of the participants’ skills for the automation of DevOps processes within the context of smart cities. Additionally, the results obtained in the performance enhancement of the intelligent analytics and feedback tools in the platform validate the approach since the tools are capable of providing personalized support to the learners based on their distinct competencies in the automation of DevOps processes, such as the integration and deployment of processes as well as the management of data in urban environments.

7.2. Methodological Significance

This comparative study provides one of the most extensive benchmarks of self-labeled SSL algorithms applied to EDM. Unlike prior studies that focused on a single algorithm or course, this research systematically evaluates ten representative models under controlled label-scarcity scenarios. The experimental design and metric-based evaluation offer a reproducible framework for assessing semi-supervised methods in similar LA applications.

8. Conclusions

The research provided an extensive comparative study of ten self-labeled SSL algorithms for learner dropout prediction in MOOCs. Making use of labeled and unlabeled learner data, the examined procedures attempted to tackle the most enduring challenge in EDM: scarce availability of labeled examples accompanied by severe class imbalance. Through systematic experimentation on various labeled ratios, the research offered strong empirical verification that self-labeled schemes can attain prediction performance similar to, or even superior to, fully supervised models that were trained with full label information.

The experiment showed that the set-based and multi-view-based algorithms, especially Co-Forest, CoBC and Rel-Rasco, generally performed much better than the other self-labeled algorithms for all the evaluation criteria. The models proved highly stable under few-labeled settings, with high Accuracy, F1-score and MCC even when labeled training data comprised just 2–4%. Their ability to exploit unlabeled samples efficiently through iterative pseudo-labeling is a testament to the applicability of ensemble and consensus-based learning in high-dimensional, behaviorally diverse educational datasets. On the other hand, the algorithms possessing aggressive editing mechanisms, like De-Tri-Training, showed lower stability, reflecting that over-filtering can compromise the generalization ability of models in MOOC settings.

In addition to quantitative findings, the present work identifies the broader pedagogical potential of SSL for analytics for learning. The shown potential of self-labelling models to identify at-risk students at an early stage, requiring little annotation effort, indicates a potential route towards affordable, data-driven early-warning systems. These can enable instructors and providers of MOOCs to adopt proactive, individualized intervention designed to enhance learner engagement and attenuate attrition. In spite of its contributions, there are some limitations to this study. This analysis used a single MOOC dataset, rich and diverse as it is, that cannot embody the entire variability of course designs, learning modalities, or platform-specific dynamics. In addition, the experiments dealt solely with traditional self-labeled algorithms without incorporating the latest developments in graph-based or deep semi-supervised learning, which can add yet another dimension to representational power. Future studies can mitigate these limitations by (i) testing the SSL models in various MOOC platforms and subject areas, (ii) incorporating time- and sequence-aware functions in order to capture more dynamic learner behavior and (iii) investigating hybrid frameworks that fuse SSL with explainable AI methods in order to add transparency to models as well as pedagogic interpretability.

In short, the study confirms that self-identified SSL algorithms are an emergent, cost-effective and scalable means to predict MOOC dropout. By efficiently making the most of the existence of plenty of unlabeled interaction data, they can form the foundation of the next generation of intelligent LA tools that can create more adaptive, inclusive and sustainable online learning environments.

8.1. Limitations

Although the research presents good comparative evidence, the exploration of a single set of MOOC data restricts the overall generalizability of inferences to fields or sites. Furthermore, the analysis solely considers classical self-labeled SSL methodologies; new graph-based or deep SSL architectures can give divergent patterns of performance and interpretability.

8.2. Practical Implications

The findings of the study offer practical advice to MOOC designers, instructors and platform managers aiming to maximize learner retention through the implementation of evidence-based practices. Firstly, the proved effectiveness of self-lebeled SSL models operating under low annotation rates implies that loosely labeled sets can produce accurate early-signals of disengagement. This offers promising opportunities for the mass deployment of predictive analytics in resource-lean learning settings. Secondly, the enhanced accuracy of consensus-based and multiple-view ensemble algorithms like Co-Forest and Rel-Rasco hints at the value of keeping such designs at the forefront of institutional analytics pipelines in order to benefit most from robustness and interpretability in decision-making. Thirdly, the inclusion of such SSL-based predictive models in the MOOC dashboard can facilitate the monitoring of learner trajectories in real-time, thereby making it possible for instructors to provide timely feedback, adaptive learning advice and individualized help. These findings collectively contribute towards the vision of intelligent, evidence-based learning environments that combine automation with human-oriented pedagogy.

8.3. Future Work

It is imperative that future studies explore the combination of self-labeled SSL with explainable AI techniques to maximize transparency in the reasoning underlying prediction. Deploying these models in time-series learning traces can, in turn, reveal disengagement’s temporal dynamics and enable sooner, more understandable intervention.

Though this research systematically evaluated ten classical semi-supervised machine learning algorithms based on self-labeling, most classical semi-supervised machine learning algorithms presented in the literature have actually been proposed before the rise of the DL paradigm. Future research should hopefully concentrate on the examination of contemporary advanced algorithms for semi-supervised learning, such as algorithms based on consistency regularization (e.g., Mean Teacher [48], FixMatch [49], MixMatch [50]), and graph-based/transformer-based frameworks [51] utilizing representation learning, among others, compared to the existing methods for self-labeling to achieve a deeper insight into the evolution of algorithms and their possible utility for prediction in educational settings.

Author Contributions

Conceptualization, G.R. and G.K.; methodology, G.R. and S.K.; software, G.R. and G.D.; validation, G.K., S.K. and A.K.; writing—original draft preparation, G.R.; writing—review and editing, G.K. and T.P.; supervision, S.K. and A.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The dataset analyzed in this study is available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Yuan, L.; Powell, S. MOOCs and open education: Implications for Higher Education: A white paper. JISC Cetis 2013, 1–21. [Google Scholar] [CrossRef]
Feng, W.; Tang, J.; Liu, T.X. Understanding dropouts in MOOCs. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 517–524. [Google Scholar]
Kizilcec, R.F.; Piech, C.; Schneider, E. Deconstructing disengagement: Analyzing learner subpopulations in massive open online courses. In Proceedings of the Third International Conference on Learning Analytics and Knowledge, Leuven, Belgium, 8–12 April 2013; pp. 170–179. [Google Scholar]
Liyanagunawardena, T.R.; Adams, A.A.; Williams, S.A. MOOCs: A systematic study of the published literature 2008–2012. Int. Rev. Res. Open Distrib. Learn. 2013, 14, 202–227. [Google Scholar] [CrossRef]
Moreno-Marcos, P.M.; Alario-Hoyos, C.; Muñoz-Merino, P.J.; Estevez-Ayres, I.; Kloos, C.D. A learning analytics methodology for understanding social interactions in MOOCs. IEEE Trans. Learn. Technol. 2018, 12, 442–455. [Google Scholar] [CrossRef]
Khalil, H.; Ebner, M. MOOCs completion rates and possible methods to improve retention-A literature review. In Proceedings of the EdMedia 2014–World Conference on Educational Media and Technology, Tampere, Finland, 23–26 June 2014; Viteli, J., Leikomaa, M., Eds.; Association for the Advancement of Computing in Education (AACE): Waynesville, NC, USA, 2014; pp. 1305–1313. [Google Scholar]
Chen, J.; Fang, B.; Zhang, H.; Xue, X. A systematic review for MOOC dropout prediction from the perspective of machine learning. Interact. Learn. Environ. 2024, 32, 1642–1655. [Google Scholar] [CrossRef]
Kloft, M.; Stiehler, F.; Zheng, Z.; Pinkwart, N. Predicting MOOC dropout over weeks using machine learning methods. In Proceedings of the EMNLP 2014 Workshop on Analysis of Large Scale Social Interaction in MOOCs, Doha, Qatar, 25–29 October 2014; pp. 60–65. [Google Scholar]
Zhu, X.; Goldberg, A. Introduction to Semi-Supervised Learning; Morgan & Claypool Publishers: San Rafael, CA, USA, 2009. [Google Scholar]
Goel, Y.; Goyal, R. On the effectiveness of self-training in MOOC dropout prediction. Open Comput. Sci. 2020, 10, 246–258. [Google Scholar] [CrossRef]
Dalipi, F.; Imran, A.S.; Kastrati, Z. MOOC dropout prediction using machine learning techniques: Review and research challenges. In Proceedings of the 2018 IEEE Global Engineering Education Conference (EDUCON), Canary Islands, Spain, 18–20 April 2018; pp. 1007–1014. [Google Scholar]
Kostopoulos, G.; Kotsiantis, S.; Pintelas, P. Estimating student dropout in distance higher education using semi-supervised techniques. In Proceedings of the 19th Panhellenic Conference on Informatics, Athens, Greece, 1–3 October 2015; pp. 38–43. [Google Scholar]
Li, W.; Gao, M.; Li, H.; Xiong, Q.; Wen, J.; Wu, Z. Dropout prediction in MOOCs using behavior features and multi-view semi-supervised learning. In Proceedings of the 2016 International Joint Conference on Neural Networks (IJCNN), Vancouver, BC, Canada, 24–29 July 2016; pp. 3130–3137. [Google Scholar]
Zhou, S.; Cao, L.; Zhang, R.; Sun, G. M-ISFCM: A semisupervised method for anomaly detection of MOOC learning behavior. In Proceedings of the International Conference of Pioneering Computer Scientists, Engineers and Educators, Chengdu, China, 19–22 August 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 323–336. [Google Scholar]
Cam, H.N.T.; Sarlan, A.; Arshad, N.I. A hybrid model integrating recurrent neural networks and the semi-supervised support vector machine for identification of early student dropout risk. PeerJ Comput. Sci. 2024, 10, e2572. [Google Scholar] [CrossRef] [PubMed]
Psathas, G.; Chatzidaki, T.K.; Demetriadis, S.N. Predictive modeling of student dropout in MOOCs and self-regulated learning. Computers 2023, 12, 194. [Google Scholar] [CrossRef]
Chi, Z.; Zhang, S.; Shi, L. Analysis and prediction of MOOC learners’ dropout behavior. Appl. Sci. 2023, 13, 1068. [Google Scholar] [CrossRef]
Chapelle, O.; Scholkopf, B.; Zien, A. Semi-supervised learning (chapelle, o. et al., eds.; 2006) [book reviews]. IEEE Trans. Neural Netw. 2009, 20, 542. [Google Scholar] [CrossRef]
Gui, J.; Chen, T.; Zhang, J.; Cao, Q.; Sun, Z.; Luo, H.; Tao, D. A survey on self-supervised learning: Algorithms, applications, and future trends. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 9052–9071. [Google Scholar] [CrossRef]
Kostopoulos, G.; Kotsiantis, S. Exploiting semi-supervised learning in the education field: A critical survey. In Advances in Machine Learning/Deep Learning-Based Technologies: Selected Papers in Honour of Professor Nikolaos G. Bourbakis–Vol. 2; Springer: Cham, Switzerland, 2021; pp. 79–94. [Google Scholar]
Triguero, I.; García, S.; Herrera, F. Self-labeled techniques for semi-supervised learning: Taxonomy, software and empirical study. Knowl. Inf. Syst. 2015, 42, 245–284. [Google Scholar] [CrossRef]
Arazo, E.; Ortego, D.; Albert, P.; O’Connor, N.E.; McGuinness, K. Pseudo-labeling and confirmation bias in deep semi-supervised learning. In Proceedings of the 2020 International joint conference on neural networks (IJCNN), Glasgow, UK, 19–24 July 2020; pp. 1–8. [Google Scholar]
Yarowsky, D. Unsupervised word sense disambiguation rivaling supervised methods. In Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics, Cambridge, MA, USA, 26–30 June 1995; pp. 189–196. [Google Scholar]
Zou, Y.; Yu, Z.; Liu, X.; Kumar, B.; Wang, J. Confidence regularized self-training. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 5982–5991. [Google Scholar]
Li, M.; Zhou, Z.H. SETRED: Self-training with editing. In Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining, Hanoi, Vietnam, 18–20 May 2005; Springer: Berlin/Heidelberg, Germany, 2005; pp. 611–621. [Google Scholar]
Blum, A.; Mitchell, T. Combining labeled and unlabeled data with co-training. In Proceedings of the Eleventh Annual Conference on Computational Learning Theory, Madison, WI, USA, 24–26 July 1998; pp. 92–100. [Google Scholar]
Zhou, Z.H.; Li, M. Semi-supervised learning by disagreement. Knowl. Inf. Syst. 2010, 24, 415–439. [Google Scholar] [CrossRef]
Hady, M.F.A.; Schwenker, F. Co-training by committee: A new semi-supervised learning framework. In Proceedings of the 2008 IEEE International Conference on Data Mining Workshops, Pisa, Italy, 15–19 December 2008; pp. 563–572. [Google Scholar]
Zhou, Y.; Goldman, S. Democratic co-learning. In Proceedings of the 16th IEEE International Conference on Tools with Artificial Intelligence, Boca Raton, FL, USA, 15–17 November 2004; pp. 594–602. [Google Scholar]
Wang, J.; Luo, S.w.; Zeng, X.h. A random subspace method for co-training. In Proceedings of the 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), Hong Kong, China, 1–8 June 2008; pp. 195–200. [Google Scholar]
Yaslan, Y.; Cataltepe, Z. Co-training with relevant random subspaces. Neurocomputing 2010, 73, 1652–1661. [Google Scholar] [CrossRef]
Li, M.; Zhou, Z.H. Improve computer-aided diagnosis with machine learning techniques using undiagnosed samples. IEEE Trans. Syst. Man Cybern. Part A Syst. Hum. 2007, 37, 1088–1098. [Google Scholar] [CrossRef]
Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Zhou, Z.H.; Li, M. Tri-training: Exploiting unlabeled data using three classifiers. IEEE Trans. Knowl. Data Eng. 2005, 17, 1529–1541. [Google Scholar] [CrossRef]
Deng, C.; Guo, M.Z. Tri-training and data editing based semi-supervised clustering algorithm. In Proceedings of the Mexican International Conference on Artificial Intelligence, Apizaco, Mexico, 13–17 November 2006; Springer: Berlin/Heidelberg, Germany, 2006; pp. 641–651. [Google Scholar]
Panagiotakopoulos, T.; Lazarinis, F.; Stefani, A.; Kameas, A. A competency-based specialization course for smart city professionals. Res. Pract. Technol. Enhanc. Learn. 2024, 19, 13. [Google Scholar] [CrossRef]
Panagiotakopoulos, T.; Kotsiantis, S.; Kostopoulos, G.; Iatrellis, O.; Kameas, A. Early dropout prediction in MOOCs through supervised learning and hyperparameter optimization. Electronics 2021, 10, 1701. [Google Scholar] [CrossRef]
Kizilcec, R.F.; Halawa, S. Attrition and achievement gaps in online learning. In Proceedings of the Second (2015) ACM Conference on Learning@ Scale, Vancouver, BC, Canada, 14–18 March 2015; pp. 57–66. [Google Scholar]
Joksimović, S.; Poquet, O.; Kovanović, V.; Dowell, N.; Mills, C.; Gašević, D.; Dawson, S.; Graesser, A.C.; Brooks, C. How do we model learning at scale? A systematic review of research on MOOCs. Rev. Educ. Res. 2018, 88, 43–86. [Google Scholar] [CrossRef]
Crossley, S.; Dascalu, M.; McNamara, D.S.; Baker, R.; Trausan-Matu, S. Predicting Success in Massive Open Online Courses (MOOCs) Using Cohesion Network Analysis; International Society of the Learning Sciences: Philadelphia, PA, USA, 2017. [Google Scholar]
Romero, C.; Ventura, S. Educational data mining and learning analytics: An updated survey. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2020, 10, e1355. [Google Scholar] [CrossRef]
Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
Garrido-Labrador, J.L.; Maudes-Raedo, J.M.; Rodríguez, J.J.; García-Osorio, C.I. SSLearn: A semi-supervised learning library for Python. SoftwareX 2025, 29, 102024. [Google Scholar] [CrossRef]
Sokolova, M.; Lapalme, G. A systematic analysis of performance measures for classification tasks. Inf. Process. Manag. 2009, 45, 427–437. [Google Scholar] [CrossRef]
He, H.; Garcia, E.A. Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 2009, 21, 1263–1284. [Google Scholar] [CrossRef]
Chicco, D.; Jurman, G. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genom. 2020, 21, 6. [Google Scholar] [CrossRef] [PubMed]
Kostopoulos, G.; Fazakis, N.; Kotsiantis, S.; Dimakopoulos, Y. Enhancing Semi-Supervised Learning in Educational Data Mining Through Synthetic Data Generation Using Tabular Variational Autoencoder. Algorithms 2025, 18, 663. [Google Scholar] [CrossRef]
Tarvainen, A.; Valpola, H. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Sohn, K.; Berthelot, D.; Carlini, N.; Zhang, Z.; Zhang, H.; Raffel, C.A.; Cubuk, E.D.; Kurakin, A.; Li, C.L. Fixmatch: Simplifying semi-supervised learning with consistency and confidence. Adv. Neural Inf. Process. Syst. 2020, 33, 596–608. [Google Scholar]
Berthelot, D.; Carlini, N.; Goodfellow, I.; Papernot, N.; Oliver, A.; Raffel, C.A. Mixmatch: A holistic approach to semi-supervised learning. In Proceedings of the 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, BC, Canada, 8–14 December 2019. [Google Scholar]
Song, Z.; Yang, X.; Xu, Z.; King, I. Graph-based semi-supervised learning: A comprehensive review. IEEE Trans. Neural Netw. Learn. Syst. 2022, 34, 8174–8194. [Google Scholar] [CrossRef]

Table 1. Configuration Parameters of SSL Methods.

Method	Default Base Estimator(s)	Key Hyperparameters (Defaults)
Self-Training	DecisionTreeClassifier *	threshold = 0.75; criterion = threshold; max_iter = 10
Setred	kNeighborsClassifier (k = 3)	max_iterations = 40; poolsize = 0.25; rejection_threshold = 0.05
Co-Training	DecisionTreeClassifier *	max_iterations = 30; poolsize = 75; threshold = 0.5; force_second_view = True
CoBC	BaggingClassifier	max_iterations = 100; poolsize = 100; min_instances_for_class = 3
Democratic Co-Learning	DecisionTreeClassifier *, GaussianNB, kNeighborsClassifier (k = 3)	alpha = 0.95; q_exp = 2; expand_only_mislabeled = True
Rasco	DecisionTreeClassifier *	max_iterations = 10; n_estimators = 30; subspace_size = None
Rel-Rasco	DecisionTreeClassifier *	max_iterations = 10; n_estimators = 30; subspace_size = None
Co-Forest	DecisionTreeClassifier *	n_estimators = 7; threshold = 0.75; bootstrap = True
Tri-Training	DecisionTreeClassifier *	collection_size = 3; max_iterations = None
De-Tri-Training	DecisionTreeClassifier * (uses kNeighborsClassifier (k = 3) internally)	same as Tri-Training + depuration step

* DecisionTreeClassifier: split_criterion = gini.

Table 2. Average Accuracy Results.

SSL Algorithms	2%	3%	4%	20%	40%	100%
Self-Training	87.36%	89.90%	92.04%	94.14%	94.25%	91.43%
Setred	88.73%	88.94%	89.75%	94.99%	94.88%	90.30%
Co-Training	87.32%	90.21%	93.50%	93.82%	94.46%	92.76%
CoBC	90.46%	92.60%	93.46%	95.10%	95.20%	94.83%
Democratic Co-Learning	80.21%	82.38%	84.32%	92.32%	94.78%	88.11%
Rasco	86.91%	91.73%	93.05%	94.56%	94.46%	93.82%
Rel-Rasco	89.29%	91.83%	93.71%	95.20%	95.31%	93.13%
Co-Forest	90.97%	90.72%	93.76%	95.20%	94.99%	94.03%
Tri-Training	87.52%	91.53%	94.11%	94.14%	94.14%	93.82%
De-Tri-Training	72.83%	72.83%	72.88%	72.70%	72.80%	72.75%

Table 3. Average F1-score Results.

SSL Algorithms	2%	3%	4%	20%	40%	100%
Self-Training	85.17%	89.04%	92.75%	94.04%	94.25%	94.40%
Setred	88.99%	88.94%	90.13%	95.00%	94.88%	95.27%
Co-Training	88.77%	90.29%	92.92%	93.72%	94.14%	94.48%
CoBC	88.02%	91.65%	93.64%	94.84%	95.00%	95.55%
Democratic Co-Learning	80.85%	83.83%	85.82%	93.10%	94.68%	94.41%
Rasco	87.69%	91.86%	93.45%	94.47%	94.73%	95.57%
Rel-Rasco	88.11%	92.79%	93.51%	95.11%	95.31%	95.41%
Co-Forest	89.07%	91.92%	94.58%	95.11%	94.74%	94.81%
Tri-Training	89.45%	89.64%	93.96%	94.47%	94.26%	94.67%
De-Tri-Training	69.68%	70.03%	69.67%	70.29%	70.34%	69.61%

Table 4. Average MCC Results.

SSL Algorithms	2%	3%	4%	20%	40%	100%
Self-Training	81.07%	86.30%	89.64%	91.50%	92.31%	91.96%
Setred	83.97%	84.25%	85.77%	93.15%	92.98%	93.20%
Co-Training	82.29%	87.23%	88.05%	91.48%	91.95%	92.04%
CoBC	84.55%	87.59%	92.04%	93.25%	93.50%	93.87%
Democratic Co-Learning	72.42%	76.32%	80.24%	91.10%	92.67%	92.25%
Rasco	84.42%	87.84%	90.36%	92.89%	93.38%	93.28%
Rel-Rasco	83.94%	88.56%	90.88%	93.47%	93.90%	93.39%
Co-Forest	86.16%	88.07%	91.99%	93.09%	93.03%	92.83%
Tri-Training	82.33%	86.88%	91.50%	92.16%	92.01%	92.13%
De-Tri-Training	59.08%	58.82%	58.65%	60.37%	60.33%	58.88%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Raftopoulos, G.; Kostopoulos, G.; Davrazos, G.; Panagiotakopoulos, T.; Kotsiantis, S.; Kameas, A. Comparative Analysis of Self-Labeled Algorithms for Predicting MOOC Dropout: A Case Study. Appl. Sci. 2025, 15, 12025. https://doi.org/10.3390/app152212025

AMA Style

Raftopoulos G, Kostopoulos G, Davrazos G, Panagiotakopoulos T, Kotsiantis S, Kameas A. Comparative Analysis of Self-Labeled Algorithms for Predicting MOOC Dropout: A Case Study. Applied Sciences. 2025; 15(22):12025. https://doi.org/10.3390/app152212025

Chicago/Turabian Style

Raftopoulos, George, Georgios Kostopoulos, Gregory Davrazos, Theodor Panagiotakopoulos, Sotiris Kotsiantis, and Achilles Kameas. 2025. "Comparative Analysis of Self-Labeled Algorithms for Predicting MOOC Dropout: A Case Study" Applied Sciences 15, no. 22: 12025. https://doi.org/10.3390/app152212025

APA Style

Raftopoulos, G., Kostopoulos, G., Davrazos, G., Panagiotakopoulos, T., Kotsiantis, S., & Kameas, A. (2025). Comparative Analysis of Self-Labeled Algorithms for Predicting MOOC Dropout: A Case Study. Applied Sciences, 15(22), 12025. https://doi.org/10.3390/app152212025

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Comparative Analysis of Self-Labeled Algorithms for Predicting MOOC Dropout: A Case Study

Abstract

1. Introduction

2. Related Work

3. Semi-Supervised Learning

4. Dataset Description

Dataset Attributes

5. Experimental Design

5.1. Configuration Parameters

5.2. Evaluation Metrics

6. Results and Analysis

6.1. Overall Performance Trends

6.2. Accuracy Analysis

6.3. F1-Score Analysis

6.4. Matthews Correlation Coefficient (MCC) Analysis

7. Discussion

7.1. Summary of Findings

7.1.1. Performance–Efficiency Trade-Off

7.1.2. Pedagogical Implications

7.2. Methodological Significance

8. Conclusions

8.1. Limitations

8.2. Practical Implications

8.3. Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI