1. Introduction
In recent years, student dropout has become an increasing concern, with significant financial and operational impacts on the educational sector. Australian university dropout rates have reached unprecedented levels. According to the
Australian Government Department of Education, Skills and Employment (
2019) and Institute of Public Affairs (IPA) (
McKee, 2024), only 62% of domestic students who commenced a bachelor’s degree in 2017 had completed their studies by 2022. This marks a record low completion rate. This high attrition rate not only leads to the loss of tuition income but also incurs additional costs to recruit and support new students, creating significant financial burdens for universities (
Seidman, 2016;
Norton & Cherastidtham, 2018;
Universities Australia, 2022). Moreover, high dropout rates can affect institutional reputations, influence future enrolment numbers, and impact public funding (
Pascarella & Terenzini, 2005;
Mitchell Institute, 2023;
Ross, 2023). From the students’ perspective, withdrawing from university often leads to substantial debts without the benefit of a degree, impacting their future employment prospects and contributing to personal financial strain (
Seidman, 2019;
National Centre for Vocational Education Research, 2023;
Fryer, 2024). As a result, addressing dropout rates has become a critical priority for educational institutions seeking to maintain both financial stability and academic excellence.
Identifying at-risk students—those likely to encounter academic or personal challenges that hinder their progress—is essential for reducing dropout rates (
Heissrer & Parette, 2002;
Osborne & Lang, 2023;
Ncube & Ngulube, 2024). At-risk students often face a combination of academic struggles, social alienation, financial stress, or personal crises, all of which heighten their chances of dropping out (
R. Liu et al., 2022;
Fryer, 2024;
National Centre for Vocational Education Research, 2023). Early detection of such students is crucial, as timely and targeted interventions can prevent challenges from escalating and improve student retention (
Jevons & Lindsay, 2018;
Sahni, 2023). Research consistently shows that interventions delivered after significant academic setbacks are often less effective, underscoring the importance of proactive identification and support (
McMillan & Reed, 2010;
Severson et al., 2007;
Ncube & Ngulube, 2024).
Traditional approaches for identifying at-risk students have relied primarily on academic performance metrics and manual interventions by educators (
Hung et al., 2015). Strategies such as monitoring grades, conducting one-on-one counselling sessions, or using simple statistical models have been widely adopted (
Oreopoulos et al., 2017). While these approaches provide some support, they tend to be reactive rather than proactive (
Osborne & Lang, 2023;
Ncube & Ngulube, 2024). Grades often indicate a problem only after it has become significant, making timely intervention difficult (
Linden et al., 2023). Furthermore, they largely overlook critical non-academic factors, such as student engagement, socio-economic status, and emotional well-being, which can significantly influence student success (
Matz et al., 2023;
Seidman, 2019;
Australian Institute of Health and Welfare, 2023). As educational institutions increasingly seek data-driven strategies to reduce dropout rates, it becomes critical to understand not just which students are at risk but also why they disengage or underperform (
Shoaib et al., 2024;
Xiong et al., 2024). Clustering-based approaches offer a solution by revealing hidden patterns in student behaviour, engagement, and academic progression that traditional models often miss (
Chen et al., 2023;
Mohamed Nafuri et al., 2022). These patterns can inform more personalised and timely interventions tailored to student subgroups, addressing the root causes of attrition more effectively (
T. Liu et al., 2022;
Romero & Ventura, 2020). Thus, the integration of clustering techniques directly supports the broader educational goal of reducing dropout and improving student retention (
Bahel et al., 2021;
Oyelade et al., 2010).
In recent years, Educational Data Mining (EDM) has gained significant attention as a powerful tool for understanding student behaviour and improving educational outcomes (
Romero & Ventura, 2020;
Trakunphutthirak & Lee, 2021;
Shoaib et al., 2024). EDM involves applying data mining techniques, such as clustering, classification, and regression, to analyse large and complex datasets, providing richer insights into student performance, engagement, and learning patterns (
Romero & Ventura, 2013;
Chen et al., 2023;
Xiong et al., 2024). EDM facilitates the early identification of at-risk students through examining student interactions, assessments, and socio-demographic factors (
K. S. Na & Tasir, 2017;
Shoaib et al., 2024;
Romero & Ventura, 2020). The concept of data literacy in education has evolved alongside advancements in data-driven decision-making. In earlier decades, data literacy primarily focused on understanding descriptive statistics and using basic academic metrics to inform teaching and learning (
Prinsloo & Slade, 2014;
Wolff et al., 2016). With the rise of digital learning environments and the increasing availability of behavioural data, the scope of data literacy expanded to include the interpretation of complex learning analytics and predictive models (
Ifenthaler & Yau, 2020). In parallel, the use of clustering techniques in educational contexts dates back to early applications in the 2000s, where clustering was used to segment students by performance levels or learning styles (
Oyelade et al., 2010;
Romero & Ventura, 2013). Over time, more sophisticated algorithms such as DBSCAN and hierarchical clustering were adopted to analyse behavioural patterns and personalise learning interventions (
Xiong et al., 2024). Today, as institutions seek to improve retention and equity through targeted support, cluster-based predictive models offer a promising way to leverage both academic and behavioural data for more nuanced risk identification. While surveys have extensively reviewed predictive analytics and EDM techniques, few have comprehensively examined the contributions of clustering-based approaches in identifying and supporting at-risk students (
Romero & Ventura, 2020;
Chen et al., 2023). The existing literature often focuses on classification models or general predictive systems, overlooking the unique advantages and challenges posed by clustering techniques (
Shoaib et al., 2024;
Trakunphutthirak & Lee, 2021). To fill this gap, this paper conducts a systematic literature review on clustering-based at-risk prediction models. The review aims to identify gaps and limitations in current approaches to at-risk student identification, highlight what is missing in the studies, and propose a roadmap for future research to improve model efficacy and practical implementation. The main contributions of this paper include: (1) Survey, systematization and analysis of clustering-based prediction model applied to at-risk student identification in the academic literature to date. (2) Insights into the different perspectives on challenges related to generalizability, data integration, and ethical considerations. (3) Recommendations for advancing clustering-based techniques to address real-world educational challenges.
The remainder of this paper is structured as follows:
Section 2 discusses related surveys, highlighting their contributions and limitations.
Section 3 presents the systematic review methodology.
Section 4 details the findings, interprets their implications, discusses challenges and proposes future directions, while
Section 5 concludes with actionable recommendations for practice and research.
2. Materials and Methods
2.1. Related Survey
Early reviews have considerably explored classification and regression models as core methods in EDM (
Romero & Ventura, 2020;
T. Liu et al., 2022;
Tosun & Kalaycıoğlu, 2024). These models are commonly used to analyse student behaviours, predict academic outcomes, and identify at-risk students. Recent studies continue to show the value of classification techniques, particularly in monitoring student performance and engagement (
Shoaib et al., 2024;
López-Meneses et al., 2024). Despite their effectiveness, classification and regression models are often applied as standalone predictive systems or combined with traditional statistical methods (
Xiong et al., 2024;
Zhang et al., 2023). These studies normally treat clustering as an additional tool, mostly for data preprocessing or feature engineering, rather than integrating it as a central component of predictive systems.
Clustering has been broadly acknowledged as an effective method for categorising students and courses based on shared attributes, which enables tailored support for at-risk students (
Shoaib et al., 2024;
Trakunphutthirak & Lee, 2021). Recent research demonstrates the application of clustering techniques, including k-means and Density-Based Spatial Clustering of Applications with Noise (DBSCAN) to identify student groups with similar behavioural or academic patterns (
Oyelade et al., 2010;
Mohamed Nafuri et al., 2022;
Bahel et al., 2021;
Chen et al., 2023;
López-Meneses et al., 2024). These clustering techniques help improve understanding of student diversity while facilitating targeted support strategies (
Y. Liu et al., 2022;
Zhang et al., 2023). Importantly, several studies have demonstrated that clustering can directly assist in reducing student dropout by revealing patterns of disengagement or declining academic performance (
Oyelade et al., 2010;
Chen et al., 2023;
Mohamed Nafuri et al., 2022). However, existing reviews often overlook how these insights translate into proactive interventions for at-risk groups, limiting the connection between clustering outcomes and dropout mitigation strategies (
Romero & Ventura, 2020;
Shoaib et al., 2024). The integration of clustering into prediction workflows still has space to develop, as most studies use clustering mainly for segmentation but not using clustering insights to enhance prediction models (
Romero & Ventura, 2020;
Xiong et al., 2024).
The integration of clustering techniques with predictive modelling has been studied through hybrid approaches that aim to enhance prediction accuracy (
Aldowah et al., 2019;
Bholowalia & Kumar, 2014;
Y. Liu et al., 2022). Hybrid models combine clustering techniques with classification techniques, which use both unsupervised and supervised learning to improve the early identification of at-risk students (
Romero & Ventura, 2020;
Xiong et al., 2024). Recent studies showed the capability of hybrid approaches to increase predictive accuracy by reducing noise and enhancing feature representation (
Zhang et al., 2023;
Shoaib et al., 2024). Advanced models, those combining clustering and hybrid classification algorithms, have shown higher accuracy and generalisability compared to traditional methods (
Chen et al., 2023;
López-Meneses et al., 2024). However, many studies face limitations due to relying on specific datasets, reducing the generalisability of hybrid models across diverse educational contexts (
Jovanović et al., 2021;
K. Na & Tasir, 2017;
Zhang et al., 2023). This limitation highlights the need for stronger validation practices to ensure model applicability across different student populations and learning environments (
Y. Liu et al., 2022;
Romero & Ventura, 2020).
Generalisability and fairness continue to be major concerns for predictive systems in education (
Mathrani et al., 2021;
Prakash et al., 2014;
Xu et al., 2021). Predictive models should consider diverse student demographics and institutional characteristics to avoid biased or unfair outcomes (
Jovanović et al., 2021;
K. S. Na & Tasir, 2017;
T. Liu et al., 2022;
Mathrani et al., 2021). Ethical considerations including transparency, equity, and fairness are particularly important for cluster-based prediction models, where clustering results can greatly affect both accuracy and fairness aspects of later predictions (
Shoaib et al., 2024;
Xiong et al., 2024). Studies have raised concerns about the risk of biased outcomes when clustering features such as socio-economic status or demographic attributes are used without applying proper fairness constraints (
Romero & Ventura, 2020;
Zhang et al., 2023;
López-Meneses et al., 2024). Existing reviews provide basic understanding of EDM techniques, clustering methods, and hybrid approaches, they often fail to fully explore cluster-based prediction models as an organised framework (
López-Meneses et al., 2024;
Chen et al., 2023;
Zhang et al., 2023;
Romero & Ventura, 2020;
Shoaib et al., 2024). This paper addresses these gaps by systematically reviewing the literature on cluster-based prediction models, analysing their strengths and limitations, and highlighting their capability to create scalable, generalisable, and ethical appropriate solutions for identifying and supporting at-risk students.
2.2. Cluster-Based Prediction Models
Cluster-based prediction models are increasingly important in educational research for their ability to enhance predictive accuracy and provide tailored insights into student behaviours (
Aldowah et al., 2019;
R. Liu, 2022;
Shoaib et al., 2024). These behaviours often include early indicators of dropout, such as reduced engagement, inconsistent participation, or performance decline, patterns that clustering can help identify more effectively than traditional approaches (
Bahel et al., 2021;
Ramanathan et al., 2018;
Mohamed Nafuri et al., 2022). These models integrate clustering with supervised learning algorithms to identify at-risk students more effectively and enable timely interventions (
Oyelade et al., 2010;
Mohamed Nafuri et al., 2022;
Romero & Ventura, 2020;
Xiong et al., 2024). Clustering groups students or courses that share similar characteristics, allowing researchers to explore patterns and trends that are not visible through traditional methods (
López-Meneses et al., 2024;
Zhang et al., 2023;
Aldowah et al., 2019;
Bahel et al., 2021).
Cluster-based prediction models are increasingly used in educational research to improve predictive accuracy and gain deeper insights into student behaviours (
Aldowah et al., 2019;
R. Liu, 2022;
Iatrellis et al., 2020). These models generally involve two main phases: clustering and prediction, which together improve the identification of students at risk of dropout by translating raw behavioural and academic data into actionable profiles (
Xiong et al., 2024;
Oyelade et al., 2010). In the clustering phase, algorithms such as k-means and DBSCAN are used to find meaningful groups within educational data (
Sisovic et al., 2016;
Bahel et al., 2021;
Ramanathan et al., 2018). The clustering process typically uses behavioural data from virtual learning environments, academic metrics like grades, and demographic attributes such as age and regional background (
Romero & Ventura, 2020;
Xu et al., 2021). Following clustering, the prediction phase integrates these groupings into supervised learning models. This integration can involve using cluster labels as features or building separate predictive models for each cluster (
Xiong et al., 2024;
Shoaib et al., 2024). Algorithms like Random Forest, Logistic Regression, and Neural Networks are commonly employed, particularly for predicting dropout risks or academic success (
López et al., 2012;
Mohamed Nafuri et al., 2022;
Prakash et al., 2014;
Chen et al., 2023;
Viswanathan & Kumar, 2021;
Aldowah et al., 2019). This hybrid approach combines unsupervised and supervised learning methods, resulting in more robust predictions (
Francis & Babu, 2019;
Injadat et al., 2020;
Xu et al., 2021).
Cluster-based prediction models offer several advantages over traditional predictive approaches. One key benefit is improved personalisation, as clustering enables models to tailor interventions to the specific needs of each student group (
T. Liu et al., 2022;
Shoaib et al., 2024;
Iatrellis et al., 2020). Research shows that grouping students based on similar engagement patterns not only helps tailor academic support but also enables early intervention to prevent potential dropout (
Romero & Ventura, 2020;
Xiong et al., 2024;
Oyelade et al., 2010;
Mohamed Nafuri et al., 2022). These models also offer greater adaptability, allowing them to adjust to diverse student populations and institutional contexts through dynamically organising students and courses (
Bahel et al., 2021;
Sisovic et al., 2016;
Ramanathan et al., 2018). Scalability is another advantage, as clustering methods like k-means are suitable for processing large datasets, enabling efficient analysis at scale (
Francis & Babu, 2019;
Viswanathan & Kumar, 2021;
Shoaib et al., 2024).
Despite their advantages, cluster-based prediction models face several challenges that can limit their effectiveness in educational settings:
Generalisability: Models developed on specific datasets often struggle to perform well when applied to different educational contexts because of differences in demographics, teaching methods, and institutional structures (
Jovanović et al., 2021;
Mathrani et al., 2021). These variations can reduce the reliability and accuracy of predictions, raising concerns about the robustness and applicability of these models across diverse educational contexts (
López-Meneses et al., 2024;
Xiong et al., 2024).
Cluster-based prediction models have shown ability in various educational contexts. These models effectively predict student dropout rates by clustering students with similar behavioural and academic performance patterns, enabling timely interventions to support at-risk students (
López et al., 2012;
Sisovic et al., 2016;
Iatrellis et al., 2020). Cluster-based models also support curriculum design by grouping courses according to difficulty levels and student performance, aiding resource allocation and instructional planning (
Mohamed Nafuri et al., 2022;
Bahel et al., 2021;
Ramanathan et al., 2018;
Injadat et al., 2020). These applications highlight the versatility and value of cluster-based prediction models in enhancing educational data mining and supporting student success (
Oyelade et al., 2010;
Jovanović et al., 2021;
Y. Liu et al., 2022;
Xiong et al., 2024).
2.3. Methodology
This study systematically reviews the literature on cluster-based prediction models in educational settings. It explores how clustering and predictive modelling techniques have been used in previous research, examines the types of datasets and methodologies involved, and identifies gaps in the literature to guide future studies. The aim is to understand how researchers have implemented and assessed cluster-based prediction models and to discover insights that can help develop the field further. This review is guided by the central research question: How have cluster-based prediction models been applied and evaluated in educational contexts? We conduct a systematic review based on PRISMA guidelines and the Joanna Briggs Institute Reviewer’s Manual to ensure a comprehensive and unbiased systematic review (
Figure 1).
The literature search was conducted using a combination of electronic databases, including IEEE Xplore, Scopus, ScienceDirect, and SpringerLink. These databases were chosen for their extensive coverage of research in educational data mining and machine learning. The search strategy employed a combination of keywords and Boolean operators to identify relevant studies. The primary search terms were: “cluster-based prediction models”, “clustering and supervised learning in education”, “educational data mining clustering”, “predictive modelling for at-risk students”. To broaden the scope, additional keywords such as “k-means,” “DBSCAN,” “student dropout prediction,” and “personalised learning” were also included. The search was limited to peer-reviewed journal articles and conference proceedings published between 2010 and 2025 to capture recent advancements in the field.
After removing duplicates, a total of 650 articles published between January 2010 and February 2025 were identified using the search terms. Exclusion criteria were then applied to filter out studies that were not. Papers were excluded if they met any of the following conditions:
The publication format was not a peer-reviewed journal article or conference proceeding.
The paper was not written in English.
A more recent version by the same authors was available, in which case only the latest version was included.
The study did not combine clustering with predictive modelling techniques.
The research focus was unrelated to educational contexts.
The study provided only a high-level description without sufficient detail to address the research question.
The paper was published prior to 2010.
Through a systematic process of screening titles, keywords, abstracts, and full texts against these criteria, 147 articles were shortlisted in the final stage. From this shortlist, 61 articles were identified as relevant for analysis of cluster-based prediction models in educational settings. To ensure both relevance and quality, the inclusion and exclusion criteria were carefully applied.
Inclusion Criteria:
Studies focused on cluster-based prediction models applied in educational settings.
Research articles that combine clustering algorithms with predictive modelling.
Papers that offer actual data support, including datasets, methodologies, and results.
Publications in English to ensure they can be easily accessed.
Exclusion Criteria:
The number of publications identified in the initial search has grown significantly over the past decade. Only a limited number of studies appeared between 2010 and 2014, but interest in the topic grew rapidly after 2018. A sharp rise is observed from 2020 onwards, reflecting a growing research emphasis on integrating clustering with predictive modelling techniques in response to the increased availability of educational data and the demand for more personalised, data-informed interventions (
Figure 2).
The study selection process followed a systematic and iterative approach: First, Initial Screening was conducted by going through titles and abstracts of identified studies to exclude unrelated publications with duplicates removed at this stage. Next, Full-text articles were reviewed against the inclusion and exclusion criteria to ensure they met the study’s requirements. Finally, the final set of studies was chosen to ensure relevance and consistency with the scope of the systematic review. Data extraction focused on gathering important information from each study. This included research objectives and research questions, methodologies applied (including clustering and predictive algorithms), datasets (such as size, type, and context), as well as key findings, limitations, and recommendations. The extracted data were synthesised to identify common patterns, methodological trends, and gaps in the literature. A thematic analysis approach was employed to categorise findings into areas such as clustering methodologies, integration strategies, and educational applications. For each included study, data extraction focused on capturing key methodological and outcome variables, including the type of clustering techniques used, predictive models employed, feature categories considered, evaluation metrics applied, and the educational context of application (such as online, blended, or traditional settings). The data analysed in this review consisted of information reported in the selected published studies. No primary data were collected, and all findings are based on the synthesis of secondary data extracted from peer-reviewed literature. A data extraction template was used to ensure consistency. The synthesis followed a qualitative approach, where findings were grouped into thematic categories aligned with the research questions of this review. These categories included clustering methodologies, integration strategies with predictive models, commonly used features, evaluation practices, and observed challenges and limitations. No formal meta-analysis was conducted, as the heterogeneity of study designs and reported outcomes did not allow for quantitative synthesis. Instead, this review presents a structured narrative synthesis of the current state of the literature.
4. Findings and Discussion
This section presents the findings of the review in response to the main research question and supporting sub-questions, focusing on how clustering-based predictive models are used in EDM to identify and support at-risk students.
Research Question:
How can clustering-based predictive models be effectively utilised to identify and support at-risk students in diverse educational contexts?
Clustering-based predictive models combine unsupervised learning techniques with supervised machine learning algorithms to uncover meaningful patterns in complex student datasets. Various clustering approaches, such as partitioning methods, hierarchical techniques, and density-based algorithms have been applied in the reviewed studies, each chosen based on dataset structure, noise levels, and analytical goals (
Parack et al., 2012;
Sahni, 2023). These groupings support machine learning models to predict dropout risk, academic failure, or other learning outcomes (
T. Liu et al., 2022;
Dass et al., 2021).
One key advantage of this integrated approach is its ability to process multi-dimensional data, including assessment results, LMS activity, socio-demographic information, and behavioural logs. Hierarchical clustering, for instance, has been used to identify engagement patterns across learning timelines and contexts, with those patterns used to strengthen prediction models focused on identifying students at risk of failure (
Severson et al., 2007;
Mohamed Nafuri et al., 2022). Density-based approaches which have demonstrated effectiveness in managing noisy and irregular data have also been used to cluster, improving the stability and accuracy of dropout predictions in such contexts (
Linden et al., 2023;
Nayak et al., 2023). The flexibility of clustering-based predictive models allows them to operate effectively in diverse educational settings. In online learning environments, researchers have made extensive use of LMS-derived data, including login frequency, page access, and discussion forum activity to support early risk identification (
K. Na & Tasir, 2017;
Aldowah et al., 2019). In more traditional classroom settings, predictive models have successfully incorporated academic records and demographic features to support student success when behavioural data is limited (
Marbouti et al., 2016;
Jovanović et al., 2021). This adaptability highlights the generalizability of these models across different contexts.
The performance of clustering-based predictive models depends on the choice of clustering techniques, machine learning algorithms, and feature selection strategies. While some clustering techniques are valued for their speed and simplicity, they rely on assumptions about cluster shape or size that may not hold in real-world educational data (
R. Liu et al., 2022). Others offer more flexibility in handling outliers or uneven distributions but require careful parameter tuning or carry high computational costs, which can reduce scalability in larger datasets (
Park et al., 2016;
Linden et al., 2023;
Nayak et al., 2023). The choice of predictive models also influences model effectiveness. Interpretable models like logistic regression and decision trees are commonly used for their interpretability and alignment with educational decision-making needs, but they may struggle with complex, non-linear relationships in the data (
Romero & Ventura, 2020;
A. Khan & Ghosh, 2020). More advanced techniques like ensemble methods and gradient boosting algorithms are increasingly adopted for their strong performance in high-dimensional and heterogeneous datasets (
R. Liu et al., 2022;
Jovanović et al., 2021;
Zhang et al., 2023). The features used in these models also influence their performance. Academic metrics, such as grades, attendance, or assessment performance are commonly included due to their direct correlation with student outcomes, but they alone are insufficient to capture the complexity of student success (
Marbouti et al., 2016;
Sahni, 2023). Engagement data from LMS platforms, behavioural indicators like procrastination or submission timing, and socio-demographic factors provide additional layers of insight (
Ben Soussia et al., 2021;
Han, 2023).
Although clustering-based predictive models demonstrate clear benefits, they still face challenges concerning their fairness and generalisability. Many studies focus on data derived from digital learning environments, particularly LMS logs, which limits their relevance to more traditional or blended contexts (
Romero & Ventura, 2020;
Jovanović et al., 2021). Furthermore, ethical concerns regarding fairness, privacy, and transparency arise when incorporating socio-demographic features that may inadvertently influence model decisions in ways that disadvantage specific student groups (
A. Khan & Ghosh, 2020;
Miguéis et al., 2018).
4.1. Clustering Application in EDM
This section addresses the first sub-research question: How can clustering techniques be used in educational data mining (EDM)? It examines how clustering has been applied to extract patterns from educational datasets, focusing on student profiling, personalised learning, and curriculum design.
4.1.1. Clustering for Academic and Behavioural Profiling
Clustering techniques have been used in educational data mining to reveal underlying performance patterns by analysing student scores, participation, and engagement trends. Studies have shown that clusters formed based on assessment data and attendance records often correspond to performance levels such as high-achieving, average, and at-risk, enabling timely support and targeted interventions (
Parack et al., 2012;
López et al., 2012). These clusters also capture students who consistently excel across different subjects, which helps institutions replicate effective teaching strategies and align instructional design with successful learner behaviours (
Linden et al., 2023;
Han, 2023).
Beyond academic records, clustering behavioural data from learning management systems—such as login frequency, content interaction, and forum participation—has helped uncover engagement patterns that are not immediately visible through grades alone. Disengaged students, for example, can be grouped based on irregular access or last-minute submissions, both of which have been linked to higher dropout rates (
K. Na & Tasir, 2017;
Ben Soussia et al., 2021). Attendance clustering, too, has revealed trends where students with sporadic participation often show signs of academic struggle, reinforcing the link between behavioural habits and performance outcomes (
Han, 2023;
Linden et al., 2023).
Behavioural profiling through clustering is also applied to understand how students interact with different content types. Distinct engagement preferences—such as favouring quizzes over readings, or discussion forums over video lectures—have informed curriculum adjustments that align better with student learning preferences, increasing participation and satisfaction (
Mohamed Nafuri et al., 2022;
K. Na & Tasir, 2017). Similarly, course-level analyses have identified classes with persistently low engagement. Clustering these courses based on participation metrics has led to instructional redesigns that have improved subsequent student interaction and academic success (
Sahni, 2023;
Romero & Ventura, 2020). In many studies, behavioural clusters have been incorporated as features in predictive models, improving classification accuracy for identifying at-risk students. Groupings based on LMS behaviour, when used in combination with demographic and academic data, have improved the precision of models like decision trees and ensemble learners (
Jovanović et al., 2021;
R. Liu et al., 2022;
Shoaib et al., 2022). This integration allows for early intervention strategies grounded in actual student activity rather than retrospective performance alone.
Clustering for academic and behavioural profiling, when based on robust features and well-pre-processed data, contributes substantially to identifying both students who need support and those whose successful patterns can be scaled more broadly. The growing body of research reflects the value of these techniques in building richer, more actionable learner profiles (
López et al., 2012;
Dass et al., 2021;
Moubayed et al., 2020).
4.1.2. Clustering for Personalised Learning
Clustering has become a valuable tool for enabling personalised learning by identifying patterns in student engagement, resource preferences, and learning pace. These patterns allow educators to tailor instructional strategies and support mechanisms to better suit individual and group needs, which has been shown to improve participation, satisfaction, and performance in various learning environments (
Mohamed Nafuri et al., 2022;
Romero & Ventura, 2020).
Several studies have used clustering to find how students engage with different types of learning materials. When students are grouped according to their dominant content formats, such as favouring video-based content, interactive quizzes, or reading materials, students responded more positively to content tailored to their engagement style (
Mohamed Nafuri et al., 2022;
K. S. Na & Tasir, 2017). This alignment between content type and learner preference not only increased interaction but also supported deeper comprehension and retention. Differences in students’ pacing have been captured through clustering based on time spent engaging with learning modules. This information helped educators differentiate instructional paths, providing faster learners with more advanced tasks and slower learners with additional support (
Han, 2023). Tailored materials not only improve engagement but also positively influence performance, especially when resources are aligned with students’ preferred learning modalities (
Han, 2023;
Mohamed Nafuri et al., 2022;
Shoaib et al., 2022).
Beyond content preferences, clustering has also been applied to identify behavioural and cognitive challenges. Some studies have identified student groups that consistently struggled with types of assessment tasks, prompting targeted interventions to address conceptual or procedural difficulties (
Sahni, 2023;
Y. Liu et al., 2022). Whether linked to conceptual misunderstandings, task-specific difficulties, or motivational barriers, these insights inform targeted instructional support (
Romero & Ventura, 2020;
A. Khan & Ghosh, 2020). Studies have demonstrated that providing extra support or engagement prompts to less active clusters can enhance participation and foster a more inclusive learning environment (
K. Na & Tasir, 2017;
Dass et al., 2021;
T. Liu et al., 2022). Supporting slower-paced learners with additional scaffolding and allowing advanced learners to progress independently has been shown to improve retention and reduce frustration (
Han, 2023;
Linden et al., 2023;
Santoso & Yulia, 2019).
Personalisation also extends to students’ collaborative preferences. Clustering has helped distinguish between learners who thrive in group discussions and those who prefer independent work (
Han, 2023;
Linden et al., 2023;
Santoso & Yulia, 2019). Educators have used these insights to assign roles in collaborative activities or offer alternative learning paths that align with students’ social learning styles, resulting in improved engagement and learner satisfaction (
Ben Soussia et al., 2021;
Romero & Ventura, 2020). This adaptability is particularly valuable in hybrid environments, where in-person and online participation patterns vary widely (
Ben Soussia et al., 2021).
The integration of clustering into adaptive learning systems further enhances personalisation by adjusting the content delivery sequence (
Y. Liu et al., 2022;
Yağcı, 2022). Students showing early mastery can be routed toward more advanced topics, while those needing reinforcement receive additional materials, leading to improvements in both learning efficiency and retention (
Jovanović et al., 2021).
4.1.3. Clustering for Curriculum and Course Design
Several studies have shown that clustering can guide curriculum and course development by analysing trends in student engagement, learning outcomes, and course delivery (
Sahni, 2023;
K. Na & Tasir, 2017;
Ben Soussia et al., 2021). These insights help institutions identify gaps, streamline course progression, and align curricula with student needs. Clustering methods have been used to classify courses based on participation levels, with consistently low-engagement courses flagged for redesign (
Romero & Ventura, 2020;
López et al., 2012). Redesigning such courses with clearer content and more interactive components has led to measurable improvements in participation and student satisfaction (
K. S. Na & Tasir, 2017;
Linden et al., 2023).
Cluster analysis of assessment data has also revealed repeated performance struggles in specific fields, such as quantitative reasoning or critical analysis, prompting targeted interventions like remedial workshops or redesigned learning modules (
K. Na & Tasir, 2017;
Romero & Ventura, 2020;
Santoso & Yulia, 2019). When these patterns are identified early, institutions have introduced targeted workshops or modified curricular content to bridge skill gaps (
Jovanović et al., 2021). When applied to curriculum alignment, clustering techniques have been used to identify overlapping content between adjacent courses or inconsistencies in learning objectives that may burden students with redundant materials (
Jovanović et al., 2021). Aligning outcomes across programs using clustering insights has helped streamline course progression and reduce student cognitive load (
López et al., 2012;
Yağcı, 2022). Other studies have extended this approach to ensure that program-level competencies match evolving industry needs, enhancing the employability of graduates (
Namoun & Alshanqiti, 2020). The selection and design of elective courses have similarly benefited from clustering applications. Student preferences, combined with academic performance and enrolment trends, have been analysed to identify popular thematic areas, such as artificial intelligence and sustainability, guiding institutions to expand offerings that resonate with both student interest and job market demand (
Ben Soussia et al., 2021;
Romero & Ventura, 2020). Clustering has also informed the evaluation of delivery formats by linking learning outcomes to students’ interaction with different modes of content (
A. Khan & Ghosh, 2020;
Yağcı, 2022). Higher achievement in courses with a mix of visual, textual, and interactive components has supported the shift toward blended models in several institutions (
Han, 2023;
Linden et al., 2023). These models allow flexibility while maintaining structure, improving participation in both synchronous and asynchronous components (
Jovanović et al., 2021).
4.2. Integrated of Clustering and Predictive Models
This section addresses the second sub-research question: How are predictive models integrated with clustering techniques to identify at-risk students? It explores how clustering outputs are used to enhance predictive modelling through data preprocessing, feature engineering, and hybrid modelling frameworks, with applications across various learning environments and institutional contexts.
4.2.1. Impact of Integration
Integrating clustering with predictive modelling improves predictive accuracy by structuring raw data into cohesive groups. This preprocessing transforms unstructured or noisy educational data into meaningful groupings, which serve as inputs for predictive algorithms, to ensure that predictive algorithms work with more interpretable and homogenous subsets of data (
Romero & Ventura, 2020;
López et al., 2012). Models built on cluster-informed data consistently perform better than those trained on unstructured features, with studies reporting improvements in precision, recall, and overall predictive stability (
Sahni, 2023;
Romero & Ventura, 2020;
Parack et al., 2012;
T. Liu et al., 2022).
Apart from accuracy, integrated models offer greater interpretability. Clusters derived from behavioural or academic data provide detailed profiles of student needs that help educators to trace prediction outcomes back to meaningful patterns such as disengagement, attendance fluctuations, or underperformance within specific subgroups (
López et al., 2012;
Linden et al., 2023;
Jovanović et al., 2021). Decision-making becomes more transparent when predictions are linked to group-level trends (
Y. Liu et al., 2022). Cluster-derived features have helped clarify why some students are flagged as at risk, such as irregular attendance, late submission patterns, or disengagement from digital platforms (
López et al., 2012;
Linden et al., 2023;
Jovanović et al., 2021). In many studies, these insights were essential for predicting and preventing dropout, especially in digital or hybrid environments where academic failure is often preceded by engagement decline (
Al-Shabandar et al., 2017). Some studies show that behavioural clusters have been used to uncover systemic patterns of non-participation that traditional models may miss, allowing institutions to prioritise support for those who show early signs of academic risk (
López et al., 2012;
Jovanović et al., 2021;
K. Na & Tasir, 2017).
Cluster-based predictive models also support personalised interventions. When predictive outputs are paired with insights about cluster characteristics, educators can design responses that match the learning pace, preferences, or behavioural needs of each group (
Sahni, 2023;
Parack et al., 2012). Studies in hybrid and online settings have demonstrated how clustering engagement data helps tailor interventions, such as offering time management support to late submitters or peer-based activities for students with collaborative tendencies (
Y. Liu et al., 2022;
Ben Soussia et al., 2021;
Romero & Ventura, 2020). When clustering was used to distinguish students by learning preference or motivation, it enabled personalised support strategies such as visual content for video-oriented learners or self-paced modules for independent learners (
Romero & Ventura, 2020;
Sahni, 2023).
Integrated models help uncover program-level and institutional insights. Clustering reveals patterns of engagement and performance that span across units or cohorts, which can be incorporated into predictive frameworks to inform decisions about curriculum design, resource allocation, or student support infrastructure (
Romero & Ventura, 2020;
Sahni, 2023). Clustering outputs such as group labels and centroids support tailored interventions to help predictive models deliver actionable insights, allowing educators to design targeted support strategies for specific student groups (
López et al., 2012;
Jovanović et al., 2021). Institutions could gain deeper insights into student behaviours and performance, enhancing both individual outcomes and institutional strategies (
Linden et al., 2023;
Namoun & Alshanqiti, 2020;
Santoso & Yulia, 2019).
4.2.2. Applications in Educational Contexts
Studies demonstrate that clustering outputs, such as student groupings based on engagement or performance, significantly improve the accuracy and interpretability of predictive models (
Romero & Ventura, 2020;
López et al., 2012). These methods combine clustering outputs with predictive algorithms to address diverse challenges in online, blended, and traditional learning environments, enabling more effective interventions and support systems (
K. S. Na & Tasir, 2017;
Linden et al., 2023;
T. Liu et al., 2022).
Online Learning Environment
Online education generates large volumes of behavioural data from learning management systems (LMS), including clickstream logs, time spent on resources, and forum participation. Clustering techniques are often used to structure this data before predictive modelling, improving dropout predictions and engagement analysis, as several studies have demonstrated that clustering enhances the detection of at-risk students by segmenting learners into distinct engagement categories, such as highly engaged, moderately engaged, and disengaged groups (
K. Na & Tasir, 2017;
Linden et al., 2023;
Romero & Ventura, 2020).
K. Na and Tasir (
2017) showed that hierarchical clustering could effectively group students into highly engaged, moderately engaged, and disengaged clusters based on LMS activity. Logistic regression models utilizing these clusters as input features achieved higher recall rates, identifying disengaged students early.
Sahni (
2023) applied DBSCAN to noisy LMS data, such as inconsistent login patterns and irregular forum participation, identifying clusters of disengaged students that would otherwise remain undetected. When these clusters were used in a Random Forest model, dropout prediction accuracy improved by 20%.
Romero and Ventura (
2020) reported similar findings, where clustering LMS engagement data enhanced the precision of machine learning models in predicting at-risk students. When integrated into predictive models, these engagement profiles significantly enhance dropout prediction accuracy, enabling targeted interventions like personalised reminders or tailored support materials to be implemented proactively (
Linden et al., 2023;
Romero & Ventura, 2020).
Blended and Hybrid Learning
Blended learning environments involve both online and in-person interactions, requiring models to handle diverse data types. Studies revealed clustering methods help segment students based on combined behavioural, engagement, and attendance data, enabling predictive models to address the needs of combined online and face-to-face educational interaction, that these clusters were then integrated into Random Forest models, improving prediction accuracy by 15% and highlighting mixed engagement patterns often missed by standalone algorithms (
T. Liu et al., 2022;
Romero & Ventura, 2020). Predictive models leveraging these clusters enabled educators to provide targeted interventions, such as offering additional group activities for collaborative learners or providing self-paced modules for independent learners (
Santoso & Yulia, 2019;
Y. Liu et al., 2022).
Traditional Classroom Setting
In traditional classrooms, where digital engagement data is limited, clustering techniques focus on academic and attendance records to enhance predictive modelling.
Marbouti et al. (
2016) applied K-means clustering to segment students by early grades and attendance patterns, producing clusters that were used in logistic regression models to identify at-risk students before midterm exams. The study found that clustering improved prediction accuracy by structuring the data into more homogeneous subsets. Clustering also aids in addressing systemic inequities in traditional settings. Combined clustering with socio-economic data, such as parental education and household income, revealing patterns of underperformance among students from disadvantaged backgrounds. Predictive models trained on these clusters identified students requiring financial aid or academic support, helping to reduce disparities in educational outcomes (
Jovanović et al., 2021;
Namoun & Alshanqiti, 2020).
Program and Curriculum Optimisation
Clustering techniques extend beyond individual student analysis to support institutional decision-making by analysing patterns across multiple courses or curricula.
Sahni (
2023) used K-means clustering to analyse engagement metrics across multiple courses, identifying clusters of courses with consistently low participation rates. These clusters were incorporated into predictive models, revealing systemic issues such as unclear objectives and inadequate support. Interventions based on these findings, such as redesigning assessments and enhancing course delivery, resulted in higher engagement and improved student outcomes in subsequent semesters.
Romero and Ventura (
2020) applied DBSCAN to cluster courses based on difficulty and dropout rates. The analysis revealed foundational courses with high attrition due to poor alignment with students’ prior knowledge. Predictive models leveraging these clusters helped institutions introduce preparatory modules and additional support for at-risk courses, reducing dropout rates by 18% in the following academic year.
López et al. (
2012) found similar applications in engineering programs, where clustering student performance and preferences informed the design of specialized elective tracks, aligning course offerings with emerging industry demands.
4.2.3. Advanced Integration Strategies
Recent work in EDM has increasingly moved toward hybrid modelling approaches, driven by the limitations of single-source predictive models and the need to capture the multifaceted nature of student engagement and risk (
R. Liu et al., 2022). These models combine academic records, behavioural indicators, engagement metrics, and socio-demographic variables to create hybrid models with enhanced predictive power (
Ben Soussia et al., 2021). Hybrid models use diverse data types to provide a comprehensive understanding of student risk profiles and enable more precise interventions (
R. Liu et al., 2022;
Santoso & Yulia, 2019). Feature engineering plays an important role in preparing educational data for clustering. In the reviewed studies, features commonly include academic metrics (grades, GPA, assessment outcomes), behavioural indicators (LMS activity patterns, clickstream data), engagement metrics (forum participation, video viewing duration), and socio-demographic attributes (age, gender, region). These features are often normalised or scaled to ensure compatibility with distance-based clustering algorithms such as k-means (
Romero & Ventura, 2020;
T. Liu et al., 2022). Dimensionality reduction techniques like PCA are applied in several studies to reduce noise and improve the stability of clustering outcomes, particularly for high-dimensional behavioural data (
Vora & Rajamani, 2019). Some studies also encode categorical features for hierarchical clustering or apply transformation steps to enhance cluster separation (
Sahni, 2023). In many cases, the cluster labels generated from these engineered features are themselves reintroduced as features for supervised prediction, acting as an additional layer of abstraction that captures group-level behaviours or risk patterns (
Jovanović et al., 2021;
K. Na & Tasir, 2017). This demonstrates that clustering not only benefits from well-engineered inputs but also serves as a feature engineering step for downstream predictive models. Integrating diverse data sources allows these models to identify at-risk students with greater precision, even in complex and varied learning environments (
Romero & Ventura, 2020;
Jovanović et al., 2021).
Ben Soussia et al. (
2021) developed a hybrid model integrating that integrated academic grades, attendance records, and engagement data, resulting metrics. This model achieved 15% higher accuracy compared to models relying solely on academic data.
R. Liu (
2022) combined clickstream data with assessment scores to identify at-risk students in significantly online learning environments, reporting improved accuracy. These multi-dimensional models provide a more complete picture of students’ risk profiles, allowing educators to tailor interventions more precisely (
Y. Liu et al., 2022;
Santoso & Yulia, 2019).
Effective hybrid modelling also depends on careful feature selection. Feature selection techniques, such as Principal Component Analysis (PCA) or feature importance rankings, help to refine the model by focusing on the most relevant predictors and removing less impactful features contribute most to the model’s performance (
Marbouti et al., 2016;
Vora & Rajamani, 2019). This process ensures that models remain computationally efficient and focused on the predictors most strongly associated with academic risk. Studies have shown that when feature selection incorporates academic metrics, behavioural logs, and socio-economic indicators, the resulting models not only improve accuracy but also provide deeper contextual understanding (
Marbouti et al., 2016;
Namoun & Alshanqiti, 2020).
Vora and Rajamani (
2019) used Random Forest importance rankings to prioritize behavioural features in a hybrid model, ensuring that only the most relevant variables were included.
Jovanović et al. (
2021) combined socio-economic indicators, such as parental education and family income, with academic metrics to predict dropout risks. The study found that including socio-economic features improved the model’s ability to contextualize academic performance, enabling more targeted support.
Hybrid models also contribute to personalised learning by identifying patterns across academic, behavioural, and engagement data. When diverse features are integrated, models can generate recommendations aligned with students’ specific learning needs, such as offering targeted feedback or adapting instructional materials based on individual progress and preferences (
Santoso & Yulia, 2019;
Romero & Ventura, 2020). These outputs have been shown to enhance student outcomes and increase satisfaction, particularly when models are used to allocate support services to groups with similar learning profiles (
Ben Soussia et al., 2021;
Jovanović et al., 2021). Research also suggests that tailoring interventions through hybrid frameworks can reduce disengagement and improve course completion rates, especially in flexible or hybrid learning environments (
T. Liu et al., 2022;
Namoun & Alshanqiti, 2020).
Feature selection techniques improve the scalability and adaptability of hybrid models (
Matz et al., 2023). Reducing the number of features through dimensionality reduction techniques such as PCA allows models to process large volumes of institutional data more efficiently, making them suitable for use in environments with expanding digital infrastructures (
Matz et al., 2023;
Marbouti et al., 2016). Feature selection also addresses the risk of overfitting, a common issue in high-dimensional educational datasets, by retaining only the most relevant predictors (
Y. Liu et al., 2022;
Vora & Rajamani, 2019). Studies have shown that models with well-selected features perform more consistently across different cohorts and contexts, supporting the development of generalisable solutions for student support and risk identification (
Namoun & Alshanqiti, 2020;
Vora & Rajamani, 2019;
Yağcı, 2022).
4.3. Challenges and Limitations of Clustering-Based Predictive Models
This section addresses the third sub-research question: What are the challenges and limitations of applying clustering techniques and predictive models in EDM? Drawing from the reviewed literature, four interrelated dimensions are examined: technical challenges, generalisability and context-dependency, ethical and privacy considerations, and institutional barriers.
4.3.1. Technical Constraints
One major technical challenge is the computational complexity associated with clustering techniques, especially for large datasets. When working with large educational datasets, especially those containing fine-grained behavioural or engagement logs, clustering becomes increasingly resource-intensive (
Park et al., 2016). This is particularly problematic for methods that rely on careful parameter tuning, such as the minimum number of points per cluster and the distance threshold, which significantly impacts cluster quality and model outcomes; improper parameter selection often leads to over-clustering or under-clustering, reducing the reliability of results, which may not be feasible for institutions with limited computational infrastructure (
Linden et al., 2023;
Nayak et al., 2023).
Imbalanced datasets present another challenge. In most educational datasets, the proportion of at-risk students is often much smaller than non-at-risk students. This imbalance often led to high accuracy rates that fail to reflect poor performance in detecting the very group of concern (
A. Khan & Ghosh, 2020;
Dass et al., 2021). Techniques such as generating synthetic samples or applying weighted learning algorithms are often employed, but they add complexity to model development and require careful validation to avoid overfitting (
Romero & Ventura, 2020;
Dass et al., 2021).
Another challenge is the presence of noise and missing values in educational data. Educational data, particularly behavioural records from LMS platforms, often include noise, inconsistencies, and missing values. These issues degrade the performance of clustering and reduce the reliability of predictive outcomes if not properly handled (
López et al., 2012;
K. S. Na & Tasir, 2017). Preprocessing techniques, such as imputing missing values or filtering noise, improve data quality, they normally demand additional processing time. The integration of clustering output into predictive models also introduces additional layers of technical decision-making. Outputs like group labels or behavioural centroids must be carefully engineered into features that align with the assumptions of downstream algorithms (
Jovanović et al., 2021;
T. Liu et al., 2022). Without a structured feature selection process, these inputs may introduce irrelevant or redundant information, increasing the risk of overfitting where models perform well during training but generalise poorly to unseen data (
Marbouti et al., 2016;
Matz et al., 2023).
4.3.2. Generalisability and Contextual Limitations
Predictive models developed within specific educational contexts often struggle to generalise across diverse learning environments due to substantial differences in dataset characteristics and availability. Models trained using rich datasets from online or hybrid learning environments which typically include detailed clickstream logs and engagement indicators frequently fail to produce reliable outcomes when applied to traditional classroom settings where digital interaction data is sparse or unavailable (
Jovanović et al., 2021;
Romero & Ventura, 2020). Datasets collected from well-resourced institutions, with comprehensive data collection infrastructures seldom reflect the realities of institutions with limited technological capabilities, further constraining model transferability (
K. Na & Tasir, 2017;
Namoun & Alshanqiti, 2020;
Yağcı, 2022).
Clustering techniques also face generalisability challenges due to their sensitivity to dataset structure, as many clustering techniques rely on data having specific structural characteristics, such as evenly distributed or clearly defined clusters (
Namoun & Alshanqiti, 2020;
Yağcı, 2022). These assumptions often fail in real-world educational datasets, resulting in inconsistent clustering outcomes when methods are applied across different institutions or contexts (
R. Liu et al., 2022). Certain clustering algorithms demand extensive parameter adjustments tailored specifically to individual datasets, complicating efforts to standardise their implementation and limiting their effectiveness in diverse eDducational scenarios (
Linden et al., 2023;
Nayak et al., 2023).
Features like parental education, family income, or geographic location, further challenge generalisation. These socio-demographic factors, commonly included in hybrid predictive models, vary significantly across institutional and cultural boundaries (
Santoso & Yulia, 2019). Models developed in one socio-economic or regional context may therefore fail to accurately reflect student profiles in other settings, leading to potentially misleading predictions (
Jovanović et al., 2021;
Yağcı, 2022). This issue is particularly pronounced in cross-national applications, where differences in educational systems, grading standards, and engagement behaviours must be accounted for (
Santoso & Yulia, 2019;
Namoun & Alshanqiti, 2020).
Variability in institutional capacity to collect, process, store data and analyse educational data also restricts generalisability. Institutions with limited digital infrastructure frequently produce datasets that are fragmented, incomplete, or inconsistently formatted, hindering the direct application of models developed elsewhere (
Romero & Ventura, 2020;
López et al., 2012). These disparities hinder the transferability of models and require significant customization for successful implementation.
While public datasets provide benchmarks for model evaluation, their specific focus on online learning limits their generalisability to other contexts (
Y. Liu et al., 2022;
K. Na & Tasir, 2017). Private datasets, although rich in context-specific details, are often unavailable for broader validation due to privacy concerns, further restricting the scope of generalisable findings (
Jovanović et al., 2021). The integration of clustering and predictive models complicates generalisability due to the combined dependence on both clustering structure and predictive algorithm performance (
Romero & Ventura, 2020;
S. J. Kleter, 2022).
4.3.3. Ethical and Privacy Concerns
The use of sensitive data, such as socio-economic and demographic information, raises ethical concerns related to fairness and bias in predictive modelling. Models incorporating socio-economic indicators, like family income or parental education, often disproportionately classify students from disadvantaged backgrounds as at-risk, irrespective of their actual academic performance (
Jovanović et al., 2021;
Yağcı, 2022). This overrepresentation can reinforce negative stereotypes, stigmatise specific student groups, and amplify existing social inequalities within educational systems (
Namoun & Alshanqiti, 2020;
Santoso & Yulia, 2019). To address these issues, it is essential to design models that explicitly consider fairness constraints, ensuring predictions do not perpetuate biases or unintentionally disadvantage marginalised students (
Miguéis et al., 2018;
Dass et al., 2021).
Predictive accuracy improvements through inclusion of sensitive features must be balanced against the risk of embedding systemic inequities into the analytical processes (
Yağcı, 2022;
Namoun & Alshanqiti, 2020). Researchers caution that while demographic and socio-economic variables may enhance model performance, their uncritical use can obscure deeper, structural factors influencing student outcomes, potentially diverting attention from necessary institutional reforms (
Romero & Ventura, 2020;
A. Khan & Ghosh, 2020). Privacy risks also significantly increase as educational data becomes more detailed and interconnected. Integrating academic records, behavioural logs from LMS, and demographic details raises the potential for re-identifying individual students, even when data appears anonymised (
K. S. Na & Tasir, 2017;
López et al., 2012). Effective anonymisation, rigorous consent processes, and secure data management practices are critical in addressing these privacy risks (
Romero & Ventura, 2020). However, implementing robust privacy protections requires significant technical expertise and institutional resources, posing considerable challenges for smaller or resource-limited educational institutions (
K. Na & Tasir, 2017;
S. J. Kleter, 2022).
Several practical approaches can help mitigate demographic bias in cluster-based predictive models. These include using fairness-aware machine learning techniques, applying careful feature selection to reduce reliance on sensitive attributes, conducting regular bias audits across demographic subgroups, and reporting model performance separately for different student groups (
Holstein et al., 2019;
Binns, 2018;
Ifenthaler & Yau, 2020). Adopting these practices can improve the transparency and fairness of predictive models in educational contexts.
4.3.4. Institutional Barriers
Institutional barriers also affect the practical implementation of clustering and predictive models. Many institutions face resource constraints, such as limited technical infrastructure, inadequate funding, and insufficient technical expertise, which restrict their capacity to adopt advanced analytical techniques (
Romero & Ventura, 2020;
López et al., 2012). Smaller or resource-constrained institutions often lack the necessary resources to effectively collect, manage, and analyse comprehensive datasets required for sophisticated hybrid modelling approaches (
Marbouti et al., 2016;
Namoun & Alshanqiti, 2020).
Resistance to data-driven methods is another challenge, especially in traditional educational environments (
Ben Soussia et al., 2021;
Dass et al., 2021). Many stakeholders question the validity, interpretability, or fairness of model outcomes, especially when predictions influence critical decisions related to student support or resource allocation (
Miguéis et al., 2018;
A. Khan & Ghosh, 2020). Without addressing these institutional barriers through both technical and organisational strategies, the potential benefits of clustering-based predictive models remain limited in educational practice (
Jovanović et al., 2021;
Matz et al., 2023).
5. Conclusions
Clustering techniques and predictive models have significant potential in EDM, particularly for improving early identification and tailored support for at-risk students, a key strategy to reduce student dropout and improve retention. This review has synthesised the current literature to clarify how these methods enhance student performance prediction, highlighting key challenges, and suggest future directions for research. Clustering methods support the transformation of complex educational data into meaningful groups, which enables predictive models to achieve higher accuracy and more relevant insights. However, the effectiveness of clustering approaches often depends on the structure and characteristics of the dataset, requiring further refinement and adaptation to diverse educational contexts.
One notable contribution of this study is identifying how clustering can be leveraged not only for grouping students but also for improving prediction accuracy by tailoring predictive models to each distinct student group. Hybrid models that combine multiple data sources, including academic performance, engagement metrics, behavioural indicators, and socio-economic factors, provide a comprehensive understanding of student risk profiles and demonstrate improved prediction accuracy compared to single-source approaches. However, selecting the most relevant features remains a challenge due to computational complexity and variations in feature importance across datasets. Techniques such as PCA and Random Forest feature importance ranking and have been applied in this context, but their scalability and adaptability to different educational datasets require further exploration.
While clustering and predictive modelling have demonstrated substantial benefits, challenges remain in technical implementation, generalisability, ethical considerations, and institutional adoption. Computational demands and scalability issues affect the application of clustering techniques to large datasets, while imbalanced and noisy data reduce prediction reliability. The generalisability of models trained in specific educational contexts is another concern, as datasets and feature distributions and student behaviours vary widely across regions and institutions. Ethical considerations, such as bias in predictive models and privacy concerns regarding sensitive socio-economic data, further complicate the adoption of these techniques.
5.1. Limitations
This review has several limitations that should be acknowledged. First, the search was limited to peer-reviewed publications written in English, which may have excluded relevant studies published in other languages or in grey literature. Second, the review only included publications from 2010 to 2025, which may not fully capture earlier foundational work in the field. Third, although the search strategy was carefully designed and followed PRISMA guidelines, it is possible that some relevant studies were missed due to variations in terminology or indexing across databases. In addition, this review presents a qualitative synthesis of the literature rather than a formal meta-analysis, which limits the ability to statistically compare the effectiveness of different approaches. These limitations highlight the need for ongoing reviews as the field evolves and for future studies to consider broader data sources and more diverse methodological perspectives.
5.2. Practical Recommendations
To guide future research and practice, several key considerations can support the effective design of cluster-based prediction models in educational contexts. Researchers should carefully select clustering techniques that align with the structure and scale of the dataset. It is also important to assess clustering quality using multiple validation metrics to ensure robustness. When integrating clustering with predictive models, it is recommended to evaluate whether tailoring predictive models to each cluster improves performance over general models. Feature selection should prioritise variables that are both predictive and explainable. Threshold tuning should be performed in line with institutional priorities, with attention to balancing precision and recall. Finally, model evaluation should include both technical performance metrics and practical considerations, such as interpretability and alignment with support strategies. Following these guidelines can help ensure that cluster-based prediction models provide actionable insights that enable earlier interventions and reduce dropout rates through timely and personalised student support.
Future research should prioritise addressing these identified challenges to maximise the impact of clustering and predictive models in EDM. Techniques aimed at improving computational efficiency, such as optimised clustering algorithms and automated feature selection methods, require further development and testing. Efforts should also focus on enhancing model generalisability through expanded datasets collected from diverse institutions and regional contexts. Establishing transparent reporting standards and comprehensive ethical guidelines is critical for mitigating biases and safeguarding student privacy. Additionally, aligning institutional strategies and objectives with data-driven methodologies is essential for equitable and widespread adoption across educational settings. Future reviews could benefit from including broader datasets, multilingual studies, and additional modelling approaches, thus providing a richer and more inclusive understanding of how clustering and predictive techniques can enhance educational practices and outcomes to ultimately strengthen institutions’ ability to detect disengagement early and deliver interventions that prevent dropout, improving both academic outcomes and student wellbeing.