Next Article in Journal
Early Intervention Strategies for Language and Literacy Development in Young Dual Language Learners: A Literature Review
Next Article in Special Issue
Repositioning Learners as Explainers: Peer Learning Through Student-Generated Videos in Undergraduate Mathematics
Previous Article in Journal
Global–Local Learning Ecosystems for Societal Purpose: A Latin American Perspective
Previous Article in Special Issue
Teaching and Learning Trochoid Curves: The Importance of LEGO® Drawing Robots and Educational Robotics in Tertiary Mathematics Education
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Systematic Review

Improve Student Risk Prediction with Clustering Techniques: A Systematic Review in Education Data Mining

1
School of Information and Communication Technology, University of Tasmania, Hobart, TAS 7000, Australia
2
Department of AI Convergence, Chonnam National University, Gwangju 61186, Republic of Korea
*
Author to whom correspondence should be addressed.
Educ. Sci. 2025, 15(12), 1695; https://doi.org/10.3390/educsci15121695
Submission received: 24 April 2025 / Revised: 27 September 2025 / Accepted: 6 December 2025 / Published: 15 December 2025
(This article belongs to the Special Issue Technology-Enhanced Learning in Tertiary Education)

Abstract

Student dropout rates continue to present major difficulties for educational institutions, leading to academic, operational, and financial impacts. Educational Data Mining (EDM) methods, particularly those combining clustering techniques with predictive models, have demonstrated potential in identifying at-risk students early and accurately. This systematic review explores how cluster-based prediction models have been applied in educational contexts to enhance student performance prediction. A total of sixty-one relevant studies published between 2010 and 2025 were selected and analysed using PRISMA guidelines. The review focuses on the clustering techniques used, how these are integrated with predictive models, and what types of student data are involved. Key findings show that cluster-based models help capture behavioural and academic differences among students, which enables educational institutions to provide more adaptable support. The review also highlights challenges related to generalisability, scalability, and ethical concerns, especially when applying models across different institutions or datasets. The main contribution of this study is the identification of how clustering can be used not only to segment student populations but also to improve prediction accuracy by tailoring models to each subgroup. This review contributes to the literature by emphasising the practical benefits of cluster-based predictive modelling and providing clear directions for further studies aimed at reducing student dropout through targeted support.

1. Introduction

In recent years, student dropout has become an increasing concern, with significant financial and operational impacts on the educational sector. Australian university dropout rates have reached unprecedented levels. According to the Australian Government Department of Education, Skills and Employment (2019) and Institute of Public Affairs (IPA) (McKee, 2024), only 62% of domestic students who commenced a bachelor’s degree in 2017 had completed their studies by 2022. This marks a record low completion rate. This high attrition rate not only leads to the loss of tuition income but also incurs additional costs to recruit and support new students, creating significant financial burdens for universities (Seidman, 2016; Norton & Cherastidtham, 2018; Universities Australia, 2022). Moreover, high dropout rates can affect institutional reputations, influence future enrolment numbers, and impact public funding (Pascarella & Terenzini, 2005; Mitchell Institute, 2023; Ross, 2023). From the students’ perspective, withdrawing from university often leads to substantial debts without the benefit of a degree, impacting their future employment prospects and contributing to personal financial strain (Seidman, 2019; National Centre for Vocational Education Research, 2023; Fryer, 2024). As a result, addressing dropout rates has become a critical priority for educational institutions seeking to maintain both financial stability and academic excellence.
Identifying at-risk students—those likely to encounter academic or personal challenges that hinder their progress—is essential for reducing dropout rates (Heissrer & Parette, 2002; Osborne & Lang, 2023; Ncube & Ngulube, 2024). At-risk students often face a combination of academic struggles, social alienation, financial stress, or personal crises, all of which heighten their chances of dropping out (R. Liu et al., 2022; Fryer, 2024; National Centre for Vocational Education Research, 2023). Early detection of such students is crucial, as timely and targeted interventions can prevent challenges from escalating and improve student retention (Jevons & Lindsay, 2018; Sahni, 2023). Research consistently shows that interventions delivered after significant academic setbacks are often less effective, underscoring the importance of proactive identification and support (McMillan & Reed, 2010; Severson et al., 2007; Ncube & Ngulube, 2024).
Traditional approaches for identifying at-risk students have relied primarily on academic performance metrics and manual interventions by educators (Hung et al., 2015). Strategies such as monitoring grades, conducting one-on-one counselling sessions, or using simple statistical models have been widely adopted (Oreopoulos et al., 2017). While these approaches provide some support, they tend to be reactive rather than proactive (Osborne & Lang, 2023; Ncube & Ngulube, 2024). Grades often indicate a problem only after it has become significant, making timely intervention difficult (Linden et al., 2023). Furthermore, they largely overlook critical non-academic factors, such as student engagement, socio-economic status, and emotional well-being, which can significantly influence student success (Matz et al., 2023; Seidman, 2019; Australian Institute of Health and Welfare, 2023). As educational institutions increasingly seek data-driven strategies to reduce dropout rates, it becomes critical to understand not just which students are at risk but also why they disengage or underperform (Shoaib et al., 2024; Xiong et al., 2024). Clustering-based approaches offer a solution by revealing hidden patterns in student behaviour, engagement, and academic progression that traditional models often miss (Chen et al., 2023; Mohamed Nafuri et al., 2022). These patterns can inform more personalised and timely interventions tailored to student subgroups, addressing the root causes of attrition more effectively (T. Liu et al., 2022; Romero & Ventura, 2020). Thus, the integration of clustering techniques directly supports the broader educational goal of reducing dropout and improving student retention (Bahel et al., 2021; Oyelade et al., 2010).
In recent years, Educational Data Mining (EDM) has gained significant attention as a powerful tool for understanding student behaviour and improving educational outcomes (Romero & Ventura, 2020; Trakunphutthirak & Lee, 2021; Shoaib et al., 2024). EDM involves applying data mining techniques, such as clustering, classification, and regression, to analyse large and complex datasets, providing richer insights into student performance, engagement, and learning patterns (Romero & Ventura, 2013; Chen et al., 2023; Xiong et al., 2024). EDM facilitates the early identification of at-risk students through examining student interactions, assessments, and socio-demographic factors (K. S. Na & Tasir, 2017; Shoaib et al., 2024; Romero & Ventura, 2020). The concept of data literacy in education has evolved alongside advancements in data-driven decision-making. In earlier decades, data literacy primarily focused on understanding descriptive statistics and using basic academic metrics to inform teaching and learning (Prinsloo & Slade, 2014; Wolff et al., 2016). With the rise of digital learning environments and the increasing availability of behavioural data, the scope of data literacy expanded to include the interpretation of complex learning analytics and predictive models (Ifenthaler & Yau, 2020). In parallel, the use of clustering techniques in educational contexts dates back to early applications in the 2000s, where clustering was used to segment students by performance levels or learning styles (Oyelade et al., 2010; Romero & Ventura, 2013). Over time, more sophisticated algorithms such as DBSCAN and hierarchical clustering were adopted to analyse behavioural patterns and personalise learning interventions (Xiong et al., 2024). Today, as institutions seek to improve retention and equity through targeted support, cluster-based predictive models offer a promising way to leverage both academic and behavioural data for more nuanced risk identification. While surveys have extensively reviewed predictive analytics and EDM techniques, few have comprehensively examined the contributions of clustering-based approaches in identifying and supporting at-risk students (Romero & Ventura, 2020; Chen et al., 2023). The existing literature often focuses on classification models or general predictive systems, overlooking the unique advantages and challenges posed by clustering techniques (Shoaib et al., 2024; Trakunphutthirak & Lee, 2021). To fill this gap, this paper conducts a systematic literature review on clustering-based at-risk prediction models. The review aims to identify gaps and limitations in current approaches to at-risk student identification, highlight what is missing in the studies, and propose a roadmap for future research to improve model efficacy and practical implementation. The main contributions of this paper include: (1) Survey, systematization and analysis of clustering-based prediction model applied to at-risk student identification in the academic literature to date. (2) Insights into the different perspectives on challenges related to generalizability, data integration, and ethical considerations. (3) Recommendations for advancing clustering-based techniques to address real-world educational challenges.
The remainder of this paper is structured as follows: Section 2 discusses related surveys, highlighting their contributions and limitations. Section 3 presents the systematic review methodology. Section 4 details the findings, interprets their implications, discusses challenges and proposes future directions, while Section 5 concludes with actionable recommendations for practice and research.

2. Materials and Methods

2.1. Related Survey

Numerous studies have examined how EDM techniques could be applied for predicting student performance and identifying at-risk students (Romero & Ventura, 2020; López-Meneses et al., 2024; Chen et al., 2023; Shoaib et al., 2024; Xiong et al., 2024). While these reviews provide valuable insights into the development of predictive models and clustering methods, they often fail to fully explore cluster-based prediction models as an integrated framework.
Early reviews have considerably explored classification and regression models as core methods in EDM (Romero & Ventura, 2020; T. Liu et al., 2022; Tosun & Kalaycıoğlu, 2024). These models are commonly used to analyse student behaviours, predict academic outcomes, and identify at-risk students. Recent studies continue to show the value of classification techniques, particularly in monitoring student performance and engagement (Shoaib et al., 2024; López-Meneses et al., 2024). Despite their effectiveness, classification and regression models are often applied as standalone predictive systems or combined with traditional statistical methods (Xiong et al., 2024; Zhang et al., 2023). These studies normally treat clustering as an additional tool, mostly for data preprocessing or feature engineering, rather than integrating it as a central component of predictive systems.
Clustering has been broadly acknowledged as an effective method for categorising students and courses based on shared attributes, which enables tailored support for at-risk students (Shoaib et al., 2024; Trakunphutthirak & Lee, 2021). Recent research demonstrates the application of clustering techniques, including k-means and Density-Based Spatial Clustering of Applications with Noise (DBSCAN) to identify student groups with similar behavioural or academic patterns (Oyelade et al., 2010; Mohamed Nafuri et al., 2022; Bahel et al., 2021; Chen et al., 2023; López-Meneses et al., 2024). These clustering techniques help improve understanding of student diversity while facilitating targeted support strategies (Y. Liu et al., 2022; Zhang et al., 2023). Importantly, several studies have demonstrated that clustering can directly assist in reducing student dropout by revealing patterns of disengagement or declining academic performance (Oyelade et al., 2010; Chen et al., 2023; Mohamed Nafuri et al., 2022). However, existing reviews often overlook how these insights translate into proactive interventions for at-risk groups, limiting the connection between clustering outcomes and dropout mitigation strategies (Romero & Ventura, 2020; Shoaib et al., 2024). The integration of clustering into prediction workflows still has space to develop, as most studies use clustering mainly for segmentation but not using clustering insights to enhance prediction models (Romero & Ventura, 2020; Xiong et al., 2024).
The integration of clustering techniques with predictive modelling has been studied through hybrid approaches that aim to enhance prediction accuracy (Aldowah et al., 2019; Bholowalia & Kumar, 2014; Y. Liu et al., 2022). Hybrid models combine clustering techniques with classification techniques, which use both unsupervised and supervised learning to improve the early identification of at-risk students (Romero & Ventura, 2020; Xiong et al., 2024). Recent studies showed the capability of hybrid approaches to increase predictive accuracy by reducing noise and enhancing feature representation (Zhang et al., 2023; Shoaib et al., 2024). Advanced models, those combining clustering and hybrid classification algorithms, have shown higher accuracy and generalisability compared to traditional methods (Chen et al., 2023; López-Meneses et al., 2024). However, many studies face limitations due to relying on specific datasets, reducing the generalisability of hybrid models across diverse educational contexts (Jovanović et al., 2021; K. Na & Tasir, 2017; Zhang et al., 2023). This limitation highlights the need for stronger validation practices to ensure model applicability across different student populations and learning environments (Y. Liu et al., 2022; Romero & Ventura, 2020).
Generalisability and fairness continue to be major concerns for predictive systems in education (Mathrani et al., 2021; Prakash et al., 2014; Xu et al., 2021). Predictive models should consider diverse student demographics and institutional characteristics to avoid biased or unfair outcomes (Jovanović et al., 2021; K. S. Na & Tasir, 2017; T. Liu et al., 2022; Mathrani et al., 2021). Ethical considerations including transparency, equity, and fairness are particularly important for cluster-based prediction models, where clustering results can greatly affect both accuracy and fairness aspects of later predictions (Shoaib et al., 2024; Xiong et al., 2024). Studies have raised concerns about the risk of biased outcomes when clustering features such as socio-economic status or demographic attributes are used without applying proper fairness constraints (Romero & Ventura, 2020; Zhang et al., 2023; López-Meneses et al., 2024). Existing reviews provide basic understanding of EDM techniques, clustering methods, and hybrid approaches, they often fail to fully explore cluster-based prediction models as an organised framework (López-Meneses et al., 2024; Chen et al., 2023; Zhang et al., 2023; Romero & Ventura, 2020; Shoaib et al., 2024). This paper addresses these gaps by systematically reviewing the literature on cluster-based prediction models, analysing their strengths and limitations, and highlighting their capability to create scalable, generalisable, and ethical appropriate solutions for identifying and supporting at-risk students.

2.2. Cluster-Based Prediction Models

Cluster-based prediction models are increasingly important in educational research for their ability to enhance predictive accuracy and provide tailored insights into student behaviours (Aldowah et al., 2019; R. Liu, 2022; Shoaib et al., 2024). These behaviours often include early indicators of dropout, such as reduced engagement, inconsistent participation, or performance decline, patterns that clustering can help identify more effectively than traditional approaches (Bahel et al., 2021; Ramanathan et al., 2018; Mohamed Nafuri et al., 2022). These models integrate clustering with supervised learning algorithms to identify at-risk students more effectively and enable timely interventions (Oyelade et al., 2010; Mohamed Nafuri et al., 2022; Romero & Ventura, 2020; Xiong et al., 2024). Clustering groups students or courses that share similar characteristics, allowing researchers to explore patterns and trends that are not visible through traditional methods (López-Meneses et al., 2024; Zhang et al., 2023; Aldowah et al., 2019; Bahel et al., 2021).
Cluster-based prediction models are increasingly used in educational research to improve predictive accuracy and gain deeper insights into student behaviours (Aldowah et al., 2019; R. Liu, 2022; Iatrellis et al., 2020). These models generally involve two main phases: clustering and prediction, which together improve the identification of students at risk of dropout by translating raw behavioural and academic data into actionable profiles (Xiong et al., 2024; Oyelade et al., 2010). In the clustering phase, algorithms such as k-means and DBSCAN are used to find meaningful groups within educational data (Sisovic et al., 2016; Bahel et al., 2021; Ramanathan et al., 2018). The clustering process typically uses behavioural data from virtual learning environments, academic metrics like grades, and demographic attributes such as age and regional background (Romero & Ventura, 2020; Xu et al., 2021). Following clustering, the prediction phase integrates these groupings into supervised learning models. This integration can involve using cluster labels as features or building separate predictive models for each cluster (Xiong et al., 2024; Shoaib et al., 2024). Algorithms like Random Forest, Logistic Regression, and Neural Networks are commonly employed, particularly for predicting dropout risks or academic success (López et al., 2012; Mohamed Nafuri et al., 2022; Prakash et al., 2014; Chen et al., 2023; Viswanathan & Kumar, 2021; Aldowah et al., 2019). This hybrid approach combines unsupervised and supervised learning methods, resulting in more robust predictions (Francis & Babu, 2019; Injadat et al., 2020; Xu et al., 2021).
Cluster-based prediction models offer several advantages over traditional predictive approaches. One key benefit is improved personalisation, as clustering enables models to tailor interventions to the specific needs of each student group (T. Liu et al., 2022; Shoaib et al., 2024; Iatrellis et al., 2020). Research shows that grouping students based on similar engagement patterns not only helps tailor academic support but also enables early intervention to prevent potential dropout (Romero & Ventura, 2020; Xiong et al., 2024; Oyelade et al., 2010; Mohamed Nafuri et al., 2022). These models also offer greater adaptability, allowing them to adjust to diverse student populations and institutional contexts through dynamically organising students and courses (Bahel et al., 2021; Sisovic et al., 2016; Ramanathan et al., 2018). Scalability is another advantage, as clustering methods like k-means are suitable for processing large datasets, enabling efficient analysis at scale (Francis & Babu, 2019; Viswanathan & Kumar, 2021; Shoaib et al., 2024).
Despite their advantages, cluster-based prediction models face several challenges that can limit their effectiveness in educational settings:
Cluster-based prediction models have shown ability in various educational contexts. These models effectively predict student dropout rates by clustering students with similar behavioural and academic performance patterns, enabling timely interventions to support at-risk students (López et al., 2012; Sisovic et al., 2016; Iatrellis et al., 2020). Cluster-based models also support curriculum design by grouping courses according to difficulty levels and student performance, aiding resource allocation and instructional planning (Mohamed Nafuri et al., 2022; Bahel et al., 2021; Ramanathan et al., 2018; Injadat et al., 2020). These applications highlight the versatility and value of cluster-based prediction models in enhancing educational data mining and supporting student success (Oyelade et al., 2010; Jovanović et al., 2021; Y. Liu et al., 2022; Xiong et al., 2024).

2.3. Methodology

This study systematically reviews the literature on cluster-based prediction models in educational settings. It explores how clustering and predictive modelling techniques have been used in previous research, examines the types of datasets and methodologies involved, and identifies gaps in the literature to guide future studies. The aim is to understand how researchers have implemented and assessed cluster-based prediction models and to discover insights that can help develop the field further. This review is guided by the central research question: How have cluster-based prediction models been applied and evaluated in educational contexts? We conduct a systematic review based on PRISMA guidelines and the Joanna Briggs Institute Reviewer’s Manual to ensure a comprehensive and unbiased systematic review (Figure 1).
The literature search was conducted using a combination of electronic databases, including IEEE Xplore, Scopus, ScienceDirect, and SpringerLink. These databases were chosen for their extensive coverage of research in educational data mining and machine learning. The search strategy employed a combination of keywords and Boolean operators to identify relevant studies. The primary search terms were: “cluster-based prediction models”, “clustering and supervised learning in education”, “educational data mining clustering”, “predictive modelling for at-risk students”. To broaden the scope, additional keywords such as “k-means,” “DBSCAN,” “student dropout prediction,” and “personalised learning” were also included. The search was limited to peer-reviewed journal articles and conference proceedings published between 2010 and 2025 to capture recent advancements in the field.
After removing duplicates, a total of 650 articles published between January 2010 and February 2025 were identified using the search terms. Exclusion criteria were then applied to filter out studies that were not. Papers were excluded if they met any of the following conditions:
  • The publication format was not a peer-reviewed journal article or conference proceeding.
  • The paper was not written in English.
  • A more recent version by the same authors was available, in which case only the latest version was included.
  • The study did not combine clustering with predictive modelling techniques.
  • The research focus was unrelated to educational contexts.
  • The study provided only a high-level description without sufficient detail to address the research question.
  • The paper was published prior to 2010.
Through a systematic process of screening titles, keywords, abstracts, and full texts against these criteria, 147 articles were shortlisted in the final stage. From this shortlist, 61 articles were identified as relevant for analysis of cluster-based prediction models in educational settings. To ensure both relevance and quality, the inclusion and exclusion criteria were carefully applied.
Inclusion Criteria:
  • Studies focused on cluster-based prediction models applied in educational settings.
  • Research articles that combine clustering algorithms with predictive modelling.
  • Papers that offer actual data support, including datasets, methodologies, and results.
  • Publications in English to ensure they can be easily accessed.
Exclusion Criteria:
  • Research not directly related to educational applications.
  • Articles lacking enough methodological detail or actual data.
The number of publications identified in the initial search has grown significantly over the past decade. Only a limited number of studies appeared between 2010 and 2014, but interest in the topic grew rapidly after 2018. A sharp rise is observed from 2020 onwards, reflecting a growing research emphasis on integrating clustering with predictive modelling techniques in response to the increased availability of educational data and the demand for more personalised, data-informed interventions (Figure 2).
The study selection process followed a systematic and iterative approach: First, Initial Screening was conducted by going through titles and abstracts of identified studies to exclude unrelated publications with duplicates removed at this stage. Next, Full-text articles were reviewed against the inclusion and exclusion criteria to ensure they met the study’s requirements. Finally, the final set of studies was chosen to ensure relevance and consistency with the scope of the systematic review. Data extraction focused on gathering important information from each study. This included research objectives and research questions, methodologies applied (including clustering and predictive algorithms), datasets (such as size, type, and context), as well as key findings, limitations, and recommendations. The extracted data were synthesised to identify common patterns, methodological trends, and gaps in the literature. A thematic analysis approach was employed to categorise findings into areas such as clustering methodologies, integration strategies, and educational applications. For each included study, data extraction focused on capturing key methodological and outcome variables, including the type of clustering techniques used, predictive models employed, feature categories considered, evaluation metrics applied, and the educational context of application (such as online, blended, or traditional settings). The data analysed in this review consisted of information reported in the selected published studies. No primary data were collected, and all findings are based on the synthesis of secondary data extracted from peer-reviewed literature. A data extraction template was used to ensure consistency. The synthesis followed a qualitative approach, where findings were grouped into thematic categories aligned with the research questions of this review. These categories included clustering methodologies, integration strategies with predictive models, commonly used features, evaluation practices, and observed challenges and limitations. No formal meta-analysis was conducted, as the heterogeneity of study designs and reported outcomes did not allow for quantitative synthesis. Instead, this review presents a structured narrative synthesis of the current state of the literature.

3. Results

This section presents the findings and analysis of the sixty-one (61) selected publications, focusing on the methodologies, outcomes, and gaps identified in the context of cluster-based prediction models for educational applications. The analysis is organised into thematic areas derived from the systematic review.

3.1. Clustering Techniques

Clustering techniques in EDM facilitate the grouping of students or courses based on shared characteristics, uncovering valuable insights into student performance, risk profiles, and behavioural patterns (K. S. Na & Tasir, 2017; Le Quy et al., 2023; Ikotun et al., 2023). Educational institutions use these techniques to explore hidden structures within datasets, which supports personalised learning and helps identify at-risk students early (Y. Liu et al., 2022; López et al., 2012; Balovsyak et al., 2023; Y. Liu et al., 2022). Different clustering algorithms are applied across studies, each offering unique strengths and limitations depending on the dataset and analytical goals (Oyelade et al., 2010; Schubert, 2023). A review of 61 studies shows that K-means clustering is the most frequently applied method, followed by hierarchical techniques and Density-Based Spatial Clustering of Applications with Noise (DBSCAN), as detailed in Table 1. Each of these algorithms presents specific strengths and limitations depending on the dataset and analytical objectives. This subsection provides a structured breakdown of their applications in EDM, further outlined in Table 2.

3.1.1. K-Means Clustering

K-means finds common approach in EDM because its relatively simple operation and effectiveness in processing large datasets (Oyelade et al., 2010; Park et al., 2016; Ikotun et al., 2023; Le Quy et al., 2023). This technique groups students based on attributes like grades or engagement levels, helping educators spot those who might need extra support (Mohamed Nafuri et al., 2022; Arora et al., 2023; Parack et al., 2012).
K-means works particularly well with cluster shapes common in educational settings; real-world educational information often presents irregular patterns that require enhanced approaches. (T. Liu et al., 2022). To address this, improved versions of K-means have been developed to better manage irregular data shapes and outliers, though these can require more computing power (T. Liu et al., 2022; I. K. Khan et al., 2024). While K-means is valued for its speed and ease of use, it’s important to consider the specific characteristics of the data and the goals of the analysis when choosing this method (López et al., 2012).

3.1.2. Hierarchical Clustering

Hierarchical clustering does not require setting the number of clusters in advance. This flexibility makes it useful for exploring complex and layered relationships in educational data (Mohamed Nafuri et al., 2022; Le Quy et al., 2023; Ikotun et al., 2023). This technique has been used to group students based on their learning behaviours and engagement patterns, which helps educators design targeted teaching strategies and interventions (Severson et al., 2007; Arora et al., 2023; Zhang et al., 2023).
The method provides detailed insights into how students interact with educational content and helps educators to identify sub-groups with similar learning needs (Park et al., 2016; Ikotun et al., 2023; Schubert, 2023). It also helps in understanding layered relationship in educational data, such as categorising students by performance levels, engagement depth, or individual learning patterns (Arora et al., 2023; Almasri et al., 2020). However, the main challenge with hierarchical clustering is its high computational cost, especially when applied to large datasets common in educational institutions (Park et al., 2016; Le Quy et al., 2023). This can limit its practical application capability compared to faster clustering methods (Y. Liu et al., 2022; Ikotun et al., 2023; Schubert, 2023). Recent studies have focused on refining hierarchical clustering algorithms to enhance their adaptability and operational effectiveness. These enhancements include reducing the complexity of distance calculations and improving the creation of tree-based structural representations (Balovsyak et al., 2023; Zhang et al., 2023).

3.1.3. Density-Based Spatial Clustering of Applications with Noise (DBSCAN)

DBSCAN proves useful for handling noisy and irregular datasets, which are common in real-world educational environments (Sahni, 2023; Bhuyan & Borah, 2023; Schubert, 2023). With DBSCAN, there is no need to decide how many clusters are to be set in advance and it is useful for identifying clusters of different shapes and sizes (Chamorro-Atalaya et al., 2023; Sharma et al., 2024). This flexibility makes it suitable for analysing complex student behaviour patterns that do not follow uniform or linear trends (Murphy et al., 2024; Schubert, 2023; Almasri et al., 2020). It is useful for clustering students based on how they interact especially in online courses, helping educators identify struggling students, adapt teaching materials, and the adjustment of support strategies (R. Liu et al., 2022; Zhang et al., 2023).
One of the main strengths is its ability to identify outliers as noise rather than forcing them into clusters, which enhances the accuracy of pattern detection (Sahni, 2023; Schubert, 2023; Sharma et al., 2024). However, its performance depends heavily on setting the right parameters, especially epsilon (ε), which determines the neighbourhood distance, and the minimum number of a cluster (Nayak et al., 2023; K. S. Na & Tasir, 2017; Jayaprakash et al., 2020; Schubert, 2023). Choosing the wrong parameters might split single clusters unnecessarily or merge distinct clusters accidentally, leading to misleading conclusions (Park et al., 2016; Chamorro-Atalaya et al., 2023; Zhang et al., 2023). Recent research has focused on adaptive parameter tuning methods to enhance DBSCAN’s performance in different educational contexts (Le Quy et al., 2023; R. Liu et al., 2022).

3.2. Integration with Predictive Models

The integration of clustering techniques with predictive models has become a key methodology in EDM, enhancing the accuracy of identifying at-risk students (Goren et al., 2024; Pradeep et al., 2015; Zhang et al., 2023). It is often organised through the Cluster–Predict (CP) framework, which consists of two main stages as Figure 3 (R. Liu et al., 2022). In the first stage, data is segmented into homogenous groups using clustering techniques. This segmentation uncovers hidden patterns and relationships, reducing noise and improving the focus of predictive analysis (Parack et al., 2012; Linden et al., 2023; Pradeep et al., 2015). The second stage involves applying predictive models, such as Random Forests, Logistic Regression, or Neural Networks, to the clustered data. This approach leverages features derived from clustering, including cluster labels or distances from cluster centroids, to enhance predictive accuracy (Mohamed Nafuri et al., 2022; Namoun & Alshanqiti, 2020; Goren et al., 2024). Studies have shown that clustering improves predictive models’ ability to identify at-risk students by segmenting the dataset into manageable and meaningful subsets (Romero & Ventura, 2020; Zhang et al., 2023).

3.2.1. Common Integration Approaches

Two main strategies integrate clustering with predictive models: clustering as a preprocessing step and clustering for feature engineering. Clustering as a preprocessing segments dataset into cohesive groups before applying predictive models, ensuring that machine learning algorithms operate on structured subsets with shared characteristics (Parack et al., 2012; Severson et al., 2007; Viswanathan & Kumar, 2021). For feature engineering involves extracting new features, such as cluster membership or density, which enhance model’s ability to identify complex relationships within the data (Mohamed Nafuri et al., 2022; Linden et al., 2023; Anjum & Badugu, 2020). A summary of the strengths and limitations of each integration technique is presented in Table 3.
Clustering as a Preprocessing Step
Clustering as a preprocessing step is to improve the accuracy and interpretability of predictive models (López et al., 2012; K. S. Na & Tasir, 2017; Mohamed Nafuri et al., 2022). This method has been widely applied to organise students based on academic performance, learning behaviours, and participation in digital learning environments (López et al., 2012; K. S. Na & Tasir, 2017; Mohamed Nafuri et al., 2022). It helps refine input data, reduces noise, and improves the precision of predictive models used for academic performance analysis and student retention forecasting (Sarker et al., 2024; Viswanathan & Kumar, 2021; Romero & Ventura, 2020).
Different clustering techniques have been used for preprocessing, each suited to specific types of data. K-means clustering has been applied to divide students into performance groups based on academic results, where students are assigned to clusters representing high, medium, or low performance. This segmentation has been used to support early identification of students at risk of failing, enabling institutions to implement targeted interventions (López et al., 2012; Parack et al., 2012). Hierarchical clustering has been used to examine engagement trends across multiple courses, helping researchers track learning patterns and instructional effectiveness (Shaleena & Paul, 2015; Severson et al., 2007). The ability to group students based on multi-level engagement data allows for a better understanding of behavioural differences across disciplines and course structures (Shaleena & Paul, 2015; Severson et al., 2007). Some studies have used clustering to identify participation trends in online learning platforms, revealing variations in student interactions, such as the frequency and depth of engagement with course materials (K. Na & Tasir, 2017). This approach provided a deeper understanding of student behaviour across different courses and learning activities, enabling more precise and targeted interventions (K. Na & Tasir, 2017; Mohamed Nafuri et al., 2022).
Clustering for Feature Engineering
Clustering for feature engineering is to extract meaningful representations from raw data, enhancing the predictive capacity of predictive models (Romero & Ventura, 2020; Sarker et al., 2024). Clustering allows for the creation of additional features such as distance from cluster centroids or patterns within clusters that highlight similarities among students (Balovsyak et al., 2023). These new features provide predictive models with a more structured view of student behaviours, making it easier to identify at-risk students and tailor interventions accordingly (Romero & Ventura, 2020). Proper feature engineering is essential for successful clustering and varies depending on the algorithm. In many reviewed studies, behavioural and academic features were normalised or scaled for compatibility with distance-based clustering such as k-means, while categorical variables were encoded to suit hierarchical methods (Romero & Ventura, 2020; Y. Liu et al., 2022). Dimensionality reduction techniques like PCA were also used to handle high-dimensional engagement data, reducing noise and improving clustering quality (Vora & Rajamani, 2019). These preprocessing steps directly influence clustering outcomes and improve the interpretability and effectiveness of hybrid prediction models.
Several studies have explored the benefits of integrating clustering with predictive modelling. Mohamed Nafuri et al. (2022) combined hierarchical clustering with logistic regression, incorporating features like average engagement scores and the variation within clusters. Their findings showed that using these cluster-based features significantly improved the model’s accuracy in detecting students at risk of academic failure. Sahni (2023) applied DBSCAN to student engagement data and used cluster density as an additional input fora random forest model. This hybrid approach enabled the model to identify disengaged students (Namoun & Alshanqiti, 2020; Nayak et al., 2023). Unlike conventional feature selection techniques, which rely on predefined variables, clustering helps to uncover hidden relationships that might not be immediately obvious (Han, 2023; Chaudhry et al., 2023). This is particularly valuable in educational settings where student data is diverse and influenced by multiple factors, such as engagement levels, academic history, and socio-economic background (Sarker et al., 2024).

3.2.2. Key Predictive Models Used

A Predictive Model is used to analyse diverse data sources, including academic performance, behavioural engagement, and socio-economic factors, to assess student risk (Jovanović et al., 2021; Romero & Ventura, 2020). Research in EDM has explored a broad range of predictive approaches, which can be grouped into traditional machine learning models and advanced machine learning models (Romero & Ventura, 2020; K. S. Na & Tasir, 2017; Y. Liu et al., 2022). These categories help distinguish between interpretable yet limited models and more complex, higher-performing techniques (Nayak et al., 2023; Marbouti et al., 2016). The effectiveness of these models depends on the type of data used and the algorithms employed to analyse student performance and engagement patterns (Sarker et al., 2024; Balovsyak et al., 2023). This section reviews the literature regarding predictive modelling for the identification of at-risk students, focusing on models traditionally used in EDM, discussing a range of prediction models to present a comprehensive view of how various techniques can be employed to identify at-risk students by reviewing their strengths, limitations, and applicability.
Traditional Predictive Models
Traditional predictive models are used in EDM due to their simplicity, interpretability, and effectiveness in structured datasets (Y. Liu et al., 2022; Romero & Ventura, 2020). These models, including logistic regression, decision trees, and support vector machines (SVMs), have been extensively employed in classification problems within education (Marbouti et al., 2016; Jawthari & Stoffova, 2022). Their popularity stems from their ability to provide transparent decision-making processes, making them valuable for educators and administrators seeking to understand the factors influencing student outcomes (Marbouti et al., 2016; A. Khan & Ghosh, 2020).
Logistic regression has been widely applied in EDM since it is more interpretable, thus showing which of the academic metrics are most relevant in predicting the performances of the students (Romero & Ventura, 2020; T. Liu et al., 2022). Studies have demonstrated that early academic performance metrics, such as initial coursework grades and assessment scores, are strong predictors of student retention and achievement (Marbouti et al. 2016; Nayak et al., 2023) This predictability also allows educators to flag students in advance who are likely to need additional support (Romero & Ventura, 2020; T. Liu et al., 2022; Jawthari & Stoffova, 2022). However, relying only on academic metrics may present limitations. While academic grades can be a strong predictor, they may fail to capture socio-emotional or behavioural factors that influence student retention and engagement (Y. Liu et al., 2022; Jovanović et al., 2021). Students experiencing personal or socio-economic challenges may not exhibit immediate academic decline, leading to missed opportunities for early intervention if models depend only on grade-based indicators (Santoso & Yulia, 2019; Shoaib et al., 2022). Decision Trees, another commonly used traditional model, offer hierarchical decision-making frameworks that facilitate clear rule-based interpretations (Romero & Ventura, 2020; Dass et al., 2021). However, they do risk being overfitted, when applied to high-dimensional educational datasets, which can reduce their generalisability in diverse academic settings (Shoaib et al., 2022; A. Khan & Ghosh, 2020; Nayak et al., 2023). SVMs, on the other hand, are effective in handling binary classification problems, such as predicting dropout likelihood, but they often struggle with non-linear relationships in complex educational datasets (Miguéis et al., 2018; Namoun & Alshanqiti, 2020). While SVMs can achieve high accuracy, their lack of interpretability makes them less practical for educational applications where transparency in decision-making is essential (K. Na & Tasir, 2017; López et al., 2012).
Advanced Machine Learning Approaches
The limitations of traditional predictive models have led to the adoption of more advanced machine learning techniques, which aim to handle complex, multidimensional educational data more effectively (López et al., 2012; Romero & Ventura, 2020). These models offer enhanced accuracy in identifying at-risk students and predicting academic outcomes by capturing nonlinear relationships and interactions between features (A. Khan & Ghosh, 2020; Nayak et al., 2023). Random Forest remains a widely used method because of its ability to manage diverse datasets while reducing overfitting through combined learning. It works by aggregating multiple decision trees to have stable predictions across various student performance indicators (Dass et al., 2021; Namoun & Alshanqiti, 2020). It also provides feature importance scores, which help educators in identifying variables to know how often students access systems or engage in digital discussions that show student’s disengagement (Linden et al., 2023; Shoaib et al., 2022). Neural networks as a deep learning model are also popular because its ability to detect nuanced patterns in large-scale educational datasets. They excel at mapping complex behaviours such as learning engagement or procrastination tendencies, often used to predict dropout risk or classify student motivation levels (R. Liu et al., 2022; Nayak et al., 2023). More recently, gradient boosting algorithms such as LightGBM, XGBoost, and CatBoost have been adopted in educational data mining for their high efficiency, scalability, and predictive performance. LightGBM is specifically designed for speed and accuracy in large datasets with numerous categorical variables, making it highly suitable for educational settings with behavioural, demographic, and academic data (Wang et al., 2022). Studies using LightGBM have shown improved prediction accuracy and faster model training compared to traditional methods (Zhang et al., 2023; R. Liu, 2022). XGBoost with similar advantage has been applied to predict dropout rate, especially in handling class imbalance and sparse input features (Fryer, 2024; Dass et al., 2021). CatBoost, with its effective handling of categorical features without extensive preprocessing, is becoming a strong candidate for predicting educational outcomes that involve socio-demographic variables (Shoaib et al., 2024). In addition to individual models, combined multiple algorithms have been shown to increase model stability and generalisability. Studies combining logistic regression, decision trees, and neural networks within combination frameworks have achieved higher prediction accuracy across diverse student populations (Linden et al., 2023; A. Khan & Ghosh, 2020). Hybrid approaches that integrate academic records, digital engagement, and socio-economic data also offer a comprehensive perspective on student risk. These models perform better than single-source approaches by capturing the multidimensional nature of student experiences (Y. Liu et al., 2022; Nayak et al., 2023).

3.3. Key Features Identified

This section categorises and analyses the features commonly used in predictive models across the reviewed studies. These features are organised into four primary categories: academic performance metrics, engagement metrics, socio-economic and demographic factors, and behavioural indicators. Their frequency of use is also evaluated to provide a comprehensive understanding of their prevalence and importance. Beyond their predictive value, these features are also frequently used as inputs for clustering, especially in hybrid models. To support clustering methods such as k-means or DBSCAN, academic and engagement features are often scaled or normalised, while categorical variables like demographic attributes are encoded to suit hierarchical or density-based clustering techniques (Romero & Ventura, 2020; T. Liu et al., 2022). Proper feature engineering ensures that the selected features are compatible with the structure of the clustering algorithm and improves the interpretability of the resulting student groups. An analysis of the reviewed studies reveals distinct trends in feature usage as shown in Table 4.

3.3.1. Academic Performance Metrics

Academic performance metrics, such as grades and test scores, have been widely used to assess student success. Numerous studies highlight the importance of academic data in predicting at-risk students. Both Mohamed Nafuri et al. (2022) and Linden et al. (2023) found that students’ grades in early assessments were among the most powerful predictors of future performance. Early academic performance, especially within the first few weeks of a course, often signals students who may need interventions later (Marbouti et al., 2016; Nayak et al., 2023; Shoaib et al., 2022). Although grades and test scores are reliable predictors, relying on them alone can lead to an incomplete understanding of student risk (Jovanović et al., 2021; Shoaib et al., 2022). T. Liu et al. (2022) argued that models focusing only on grades tend to miss behavioural and socio-emotional factors that could signal risk long before academic decline becomes evident. This insight suggests that academic metrics should be part of a larger predictive framework rather than the sole focus of risk detection models (Santoso & Yulia, 2019; Shovon & Haque, 2012; Yağcı, 2022).

3.3.2. Engagement Metrics

Behavioural engagement is another critical predictor, as many studies show that disengagement often precedes poor academic performance (Tan et al., 2022; Moubayed et al., 2020; Tempelaar et al., 2020). K. Na and Tasir (2017) demonstrated that login frequency, time spent on learning platforms, and participation in online discussions were significant indicators of potential academic struggles. These metrics help identify students who may disengage from the course before their academic performance starts to be impact (Aldowah et al., 2019; Ben Soussia et al., 2021). Integrating engagement metrics with academic performance data enhances the accuracy of predictive models (Moubayed et al., 2020; Cole et al., 2021). Marbouti et al. (2016) found that incorporating engagement metrics into models that also used academic data could predict academic failure with higher precision. Studies have shown that students who actively participated in discussion forums and consistently logged into their learning management system (LMS) were significantly less likely to fail, even if their initial grades were not particularly high (Y. Liu et al., 2022; Miguéis et al., 2018). However, the effectiveness of engagement data depends on the availability of digital footprints (Moubayed et al., 2020). While LMS data is accessible in online and hybrid environments, traditional settings often lack comparable behavioural data, limiting the broader applicability of these model (A. Khan & Ghosh, 2020; Matz et al., 2023).

3.3.3. Socio-Economic and Demographic Factors

Socio-economic status (SES) and demographic data offer valuable insight into student performance prediction models (Asif et al., 2017). Research shows that students from lower-income backgrounds often face academic challenges due to financial pressures, limited access to learning resources, and reduced support networks (Asif et al., 2017; Bishop & Nasrabadi, 2006; Jovanović et al., 2021). Models can include data points like parental education levels, family income, and geographical location to account for disparities that purely academic and engagement data might miss (Hassan et al., 2019; Jovanović et al., 2021). Studies have demonstrated that integrating SES variables enhanced the predictions of dropout risk and learning outcomes in diverse educational settings (Miguéis et al., 2018; K. S. Na & Tasir, 2017; Jovanović et al., 2021). Socio-economic factors help uncover broader structural challenges that affect learning engagement and achievement (Santoso & Yulia, 2019; Shoaib et al., 2022). However, using socio-economic and demographic data raises ethical concerns about privacy and the potential for bias (A. Khan & Ghosh, 2020; Miguéis et al., 2018). Some studies highlight that there is a risk of perpetuating stereotypes or disproportionately flagging students from disadvantaged backgrounds as “at-risk” without sufficient context (Yağcı, 2022; Jovanović et al., 2021).

3.3.4. Behavioural Indicators

Behavioural indicators have significant effect on academic performance, such as students who delay coursework, submit assignments at the last minute, or show inconsistent study habits often struggle to maintain steady academic progress (Kim et al., 2021; Yao et al., 2019; Han, 2023). Studies have found that academic procrastination is linked to increased stress, lower motivation, and poorer learning outcomes, which make it an important factor to consider in predictive models (Xu et al., 2021; Aldowah et al., 2019; Ben Soussia et al., 2021; Jovanović et al., 2021). Research analysing student activity revealed that students who maintain consistent study routines, structured daily schedules, and regular sleep patterns tend to perform better academically (Yao et al., 2019). Students whose work pattern has been inconsistent and erratic during the semester are at a higher risk of academic failure, even if their initial grades are strong (Akçapınar et al., 2019). These findings suggest that behavioural patterns, when combined with academic performance and socio-economic factors, can enhance the accuracy of student risk predictions (Han, 2023; Miguéis et al., 2018; Nayak et al., 2023; Kim et al., 2021).

3.4. Evaluation Metrics

The evaluation of clustering and predictive models in EDM requires a range of performance metrics to ensure the reliability and accuracy. These metrics help researchers compare different models and assess their effectiveness in identifying at-risk students (Romero & Ventura, 2020; A. Khan & Ghosh, 2020). This section provides an overview of common evaluation metrics, analyses their application in different studies, and discusses challenges in establishing standardised evaluation frameworks across EDM research.

3.4.1. Clustering-Specific Metrics

Evaluating clustering models in EDM requires internal validation metrics that assess both intra-cluster cohesion and inter-cluster separation. The Silhouette Score remains one of the most frequently used metrics, measuring how similar a data point is to its own cluster compared to other clusters. Studies such as Dass et al. (2021) reported moderate Silhouette Scores in educational datasets, indicating clusters that are meaningful but affected by overlapping learning behaviours. When applied to engagement or performance data, this score helps quantify how well a given clustering algorithm segments students based on patterns in learning activity or academic outcomes (López et al., 2012; Romero & Ventura, 2020). The Davies-Bouldin Index, which calculates the average ratio of within-cluster distances to between-cluster distances, is often used to compare the compactness and separation of clusters. Sahni (2023) demonstrated its utility in identifying the relative performance of hierarchical clustering and DBSCAN, while Zhang et al. (2023) used the index to refine cluster parameter tuning in large educational datasets. Its sensitivity to cluster shape and density makes it particularly relevant for analysing student behaviour patterns where noise and irregular distributions are common (Jayaprakash et al., 2020; Schubert, 2023). Another widely applied metric, the Dunn Index, emphasises the distance between clusters relative to their size. This measure has been leveraged in studies examining density-based clustering models, especially when applied to noisy environments or irregular attendance data (K. S. Na & Tasir, 2017; A. Khan & Ghosh, 2020). Though it can effectively indicate well-separated clusters in small to medium datasets, its performance degrades when the number of clusters increases or when high-dimensional features are involved (A. Khan & Ghosh, 2020; R. Liu et al., 2022). While these metrics serve different purposes, their application varies depending on the clustering method, data type, and educational context. Studies incorporating multiple metrics report more reliable insights into clustering quality than those relying on a single measure (Mohamed Nafuri et al., 2022; Dass et al., 2021; Le Quy et al., 2023). Particularly in EDM, where behavioural and demographic data are often high-dimensional and noisy, cross-metric validation has become standard practice to ensure clusters reflect underlying patterns in student performance, engagement, or risk (Han, 2023; Linden et al., 2023; Schubert, 2023).

3.4.2. Evaluating Predictive Models

Evaluating predictive models in EDM requires a balanced consideration of multiple performance metrics, particularly in the context of identifying at-risk students where data imbalance is common. Figure 4 summarises the approximate frequency of evaluation metrics used in the reviewed studies. Accuracy remains a widely used metric due to its simplicity and interpretability. It quantifies the proportion of correct predictions relative to all predictions made. In balanced datasets, it provides a reliable snapshot of model performance across student categories (Romero & Ventura, 2020; Marbouti et al., 2016). However, its reliability declines in imbalanced settings, where a high accuracy score can mask poor performance in detecting minority classes, such as students at risk of dropout (A. Khan & Ghosh, 2020; Jovanović et al., 2021).
Many studies used precision and recall evaluating model performance in identifying at-risk students. Precision was valued for ensuring that flagged students were truly at risk, which helps allocate support resources effectively (Ben Soussia et al., 2021; López et al., 2012). Recall was important for capturing the full range of students needing intervention, particularly in studies addressing first-year dropout prevention (Dass et al., 2021; T. Liu et al., 2022). Trade-offs between these two measures were commonly discussed, as balancing precision and recall is critical for designing effective early intervention models. In practical applications, balancing precision and recall remains a key challenge in identifying at-risk students. Higher recall ensures that more potentially at-risk students are identified. This reduces the risk of missing students who may disengage or drop out. However, this often lowers precision, which increases false positives and may overwhelm academic support services (Linden et al., 2023; Romero & Ventura, 2020). On the other hand, models tuned for higher precision help institutions allocate resources more efficiently. However, this can result in missing students who would benefit from early intervention (Ben Soussia et al., 2021; Jovanović et al., 2021). Threshold adjustment is a widely used approach for managing this trade-off (Xing & Du, 2019). The choice of threshold should reflect institutional priorities. For example, it may be preferable to prioritise recall in first-year courses where dropout rates are higher and prioritise precision when resources for intervention are limited (Ifenthaler & Yau, 2020; Xing & Du, 2019). The impact of threshold tuning also depends on the underlying data distribution and the imbalance between classes, which adds complexity to model calibration (Ben Soussia et al., 2021; Y. Liu et al., 2022; Xing & Du, 2019). Regular evaluation and threshold adjustment are important to ensure predictive models remain aligned with the evolving needs and priorities of educational institutions (Ifenthaler & Yau, 2020; Linden et al., 2023). Changes in threshold selection directly influence performance metrics such as precision, recall, F1-score, and AUC-ROC. Lowering the threshold generally increases recall but reduces precision, while raising the threshold improves precision at the cost of recall (Huang et al., 2019; Linden et al., 2023). These shifts in performance metrics affect model calibration and its suitability for different institutional priorities, which highlights the importance of incorporating threshold tuning into model evaluation and reporting. The F1-score, by integrating both precision and recall into a single metric, provides a more stable evaluation of model effectiveness, especially in datasets where class distributions are skewed. Its use is prevalent in EDM applications focused on flagging at-risk learners, where the cost of false negatives is high (Miguéis et al., 2018; Jovanović et al., 2021). Models with a balanced F1-score demonstrate the capacity to minimize both over-warning and under-warning, which is vital in developing trustworthy intervention tools (A. Khan & Ghosh, 2020). Beyond classification accuracy and balance, AUC-ROC offers insights into a model’s discriminatory power across varying thresholds. This metric is essential in contexts where binary classification is not sufficient to reflect subtle differences in student performance trajectories. AUC scores approaching 1 reflect high discriminatory ability, while those near 0.5 indicate random classification. Comparative studies have shown that models such as Random Forest consistently outperform simpler algorithms like logistic regression in AUC metrics, highlighting their advantage in capturing non-linear patterns and complex feature interactions (K. S. Na & Tasir, 2017; Dass et al., 2021; Y. Liu et al., 2022). ROC curve analysis also supported threshold tuning by helping researchers understand how model performance shifts across different sensitivity and specificity levels (Romero & Ventura, 2020; Linden et al., 2023).

3.5. Dataset Characteristics

The characteristics of datasets plays a pivotal role in the development and evaluation of clustering and predictive models in EDM. The datasets used in the reviewed studies vary significantly in terms of source, size, collection methods, and limitations, influencing the generalizability and applicability of the findings. Reviewed literature shows that datasets used across EDM research vary significantly in source, scale, and structure, which influence the generalisability and applicability of the findings.
The Open University Learning Analytics Dataset (OULAD) is one of the most frequently used open-access datasets in EDM research (Kuzilek et al., 2017; Lima et al., 2020). It contains over 32,000 student records and includes a broad range of features, such as demographic information, assessment scores, and detailed behavioural logs from a virtual learning environment. These features are frequently used in clustering tasks, especially those that apply k-means and DBSCAN, after preprocessing steps like scaling continuous features (e.g., click counts) and encoding categorical variables (e.g., gender, region). Its structured nature and multi-module composition support model training, evaluation, and cross-study comparisons. Several studies have used OULAD to build and test dropout prediction models, assess digital engagement, or evaluate clustering techniques (Mohamed Nafuri et al., 2022; Shoaib et al., 2022; Adnan et al., 2021).
In addition to OULAD, many researchers use institution-specific datasets, usually extracted from internal learning management systems. These datasets often include login frequency, session durations, assessment participation, and student forum activity (Ben Soussia et al., 2021; Sahni, 2023; Dass et al., 2021). While sample sizes vary, they usually range from a few hundred to a few thousand students, these datasets are often designed to capture real-time behavioural data and are well-suited for clustering students based on engagement profiles. Clustering techniques such as hierarchical clustering or density-based models are often applied to these data due to their ability to detect irregular behavioural groups without predefined labels. They offer detailed insight into student behaviour, support the design of context-specific interventions, and tend to be restricted to single institutions or courses, limiting the potential for generalisation (Jovanović et al., 2021; Moubayed et al., 2020).
A number of studies use private academic datasets collected through institutional collaboration with faculty or academic offices. These datasets usually cover grade reports, attendance logs, and demographic profiles (Marbouti et al., 2016; Hassan et al., 2019). These datasets typically support the use of distance-based clustering methods, as performance-related features like GPA, test scores, and attendance rates are well-suited to numeric similarity-based grouping (Marbouti et al., 2016; Shoaib et al., 2022). While these datasets often offer higher feature granularity and support longitudinal clustering to track student progress over time (Hassan et al., 2019; Jovanović et al., 2021), their limited accessibility and inconsistent formatting restrict reproducibility and cross-institutional comparisons (A. Khan & Ghosh, 2020; Moubayed et al., 2020).
A smaller but important group of studies collect data through surveys and interviews, targeting socio-emotional factors, motivation levels, or home environments (A. Khan & Ghosh, 2020; Miguéis et al., 2018; Shoaib et al., 2024). These datasets, though smaller in size (often <500 students), enrich clustering by introducing latent behavioural or affective dimensions that are difficult to capture from system logs alone (A. Khan & Ghosh, 2020; Han, 2023). These features are often transformed into ordinal or numerical scales before applying clustering techniques to ensure compatibility with distance-based algorithms (Miguéis et al., 2018; Shoaib et al., 2024). However, due to their small sample size and higher cost of collection, their use remains limited in large-scale studies (Hassan et al., 2019; Han, 2023).
The clustering techniques used in the reviewed studies are closely tied to the nature of the dataset and feature availability. Distance-based clustering (e.g., k-means) is commonly applied when datasets include numerical behavioural and academic features (Romero & Ventura, 2020; Mohamed Nafuri et al., 2022). Density-based and hierarchical methods are preferred when data involve heterogeneous engagement patterns or categorical variables, as these algorithms handle varied data structures more flexibly (K. S. Na & Tasir, 2017; Xiong et al., 2024). The scale of datasets typically ranges from a few hundred records in institution-specific studies (Ben Soussia et al., 2021; Dass et al., 2021) to tens of thousands in publicly available datasets like OULAD (Kuzilek et al., 2017; Lima et al., 2020), enabling a wide range of clustering applications depending on the research goal and data structure.

4. Findings and Discussion

This section presents the findings of the review in response to the main research question and supporting sub-questions, focusing on how clustering-based predictive models are used in EDM to identify and support at-risk students.
Research Question:
How can clustering-based predictive models be effectively utilised to identify and support at-risk students in diverse educational contexts?
Clustering-based predictive models combine unsupervised learning techniques with supervised machine learning algorithms to uncover meaningful patterns in complex student datasets. Various clustering approaches, such as partitioning methods, hierarchical techniques, and density-based algorithms have been applied in the reviewed studies, each chosen based on dataset structure, noise levels, and analytical goals (Parack et al., 2012; Sahni, 2023). These groupings support machine learning models to predict dropout risk, academic failure, or other learning outcomes (T. Liu et al., 2022; Dass et al., 2021).
One key advantage of this integrated approach is its ability to process multi-dimensional data, including assessment results, LMS activity, socio-demographic information, and behavioural logs. Hierarchical clustering, for instance, has been used to identify engagement patterns across learning timelines and contexts, with those patterns used to strengthen prediction models focused on identifying students at risk of failure (Severson et al., 2007; Mohamed Nafuri et al., 2022). Density-based approaches which have demonstrated effectiveness in managing noisy and irregular data have also been used to cluster, improving the stability and accuracy of dropout predictions in such contexts (Linden et al., 2023; Nayak et al., 2023). The flexibility of clustering-based predictive models allows them to operate effectively in diverse educational settings. In online learning environments, researchers have made extensive use of LMS-derived data, including login frequency, page access, and discussion forum activity to support early risk identification (K. Na & Tasir, 2017; Aldowah et al., 2019). In more traditional classroom settings, predictive models have successfully incorporated academic records and demographic features to support student success when behavioural data is limited (Marbouti et al., 2016; Jovanović et al., 2021). This adaptability highlights the generalizability of these models across different contexts.
The performance of clustering-based predictive models depends on the choice of clustering techniques, machine learning algorithms, and feature selection strategies. While some clustering techniques are valued for their speed and simplicity, they rely on assumptions about cluster shape or size that may not hold in real-world educational data (R. Liu et al., 2022). Others offer more flexibility in handling outliers or uneven distributions but require careful parameter tuning or carry high computational costs, which can reduce scalability in larger datasets (Park et al., 2016; Linden et al., 2023; Nayak et al., 2023). The choice of predictive models also influences model effectiveness. Interpretable models like logistic regression and decision trees are commonly used for their interpretability and alignment with educational decision-making needs, but they may struggle with complex, non-linear relationships in the data (Romero & Ventura, 2020; A. Khan & Ghosh, 2020). More advanced techniques like ensemble methods and gradient boosting algorithms are increasingly adopted for their strong performance in high-dimensional and heterogeneous datasets (R. Liu et al., 2022; Jovanović et al., 2021; Zhang et al., 2023). The features used in these models also influence their performance. Academic metrics, such as grades, attendance, or assessment performance are commonly included due to their direct correlation with student outcomes, but they alone are insufficient to capture the complexity of student success (Marbouti et al., 2016; Sahni, 2023). Engagement data from LMS platforms, behavioural indicators like procrastination or submission timing, and socio-demographic factors provide additional layers of insight (Ben Soussia et al., 2021; Han, 2023).
Although clustering-based predictive models demonstrate clear benefits, they still face challenges concerning their fairness and generalisability. Many studies focus on data derived from digital learning environments, particularly LMS logs, which limits their relevance to more traditional or blended contexts (Romero & Ventura, 2020; Jovanović et al., 2021). Furthermore, ethical concerns regarding fairness, privacy, and transparency arise when incorporating socio-demographic features that may inadvertently influence model decisions in ways that disadvantage specific student groups (A. Khan & Ghosh, 2020; Miguéis et al., 2018).

4.1. Clustering Application in EDM

This section addresses the first sub-research question: How can clustering techniques be used in educational data mining (EDM)? It examines how clustering has been applied to extract patterns from educational datasets, focusing on student profiling, personalised learning, and curriculum design.

4.1.1. Clustering for Academic and Behavioural Profiling

Clustering techniques have been used in educational data mining to reveal underlying performance patterns by analysing student scores, participation, and engagement trends. Studies have shown that clusters formed based on assessment data and attendance records often correspond to performance levels such as high-achieving, average, and at-risk, enabling timely support and targeted interventions (Parack et al., 2012; López et al., 2012). These clusters also capture students who consistently excel across different subjects, which helps institutions replicate effective teaching strategies and align instructional design with successful learner behaviours (Linden et al., 2023; Han, 2023).
Beyond academic records, clustering behavioural data from learning management systems—such as login frequency, content interaction, and forum participation—has helped uncover engagement patterns that are not immediately visible through grades alone. Disengaged students, for example, can be grouped based on irregular access or last-minute submissions, both of which have been linked to higher dropout rates (K. Na & Tasir, 2017; Ben Soussia et al., 2021). Attendance clustering, too, has revealed trends where students with sporadic participation often show signs of academic struggle, reinforcing the link between behavioural habits and performance outcomes (Han, 2023; Linden et al., 2023).
Behavioural profiling through clustering is also applied to understand how students interact with different content types. Distinct engagement preferences—such as favouring quizzes over readings, or discussion forums over video lectures—have informed curriculum adjustments that align better with student learning preferences, increasing participation and satisfaction (Mohamed Nafuri et al., 2022; K. Na & Tasir, 2017). Similarly, course-level analyses have identified classes with persistently low engagement. Clustering these courses based on participation metrics has led to instructional redesigns that have improved subsequent student interaction and academic success (Sahni, 2023; Romero & Ventura, 2020). In many studies, behavioural clusters have been incorporated as features in predictive models, improving classification accuracy for identifying at-risk students. Groupings based on LMS behaviour, when used in combination with demographic and academic data, have improved the precision of models like decision trees and ensemble learners (Jovanović et al., 2021; R. Liu et al., 2022; Shoaib et al., 2022). This integration allows for early intervention strategies grounded in actual student activity rather than retrospective performance alone.
Clustering for academic and behavioural profiling, when based on robust features and well-pre-processed data, contributes substantially to identifying both students who need support and those whose successful patterns can be scaled more broadly. The growing body of research reflects the value of these techniques in building richer, more actionable learner profiles (López et al., 2012; Dass et al., 2021; Moubayed et al., 2020).

4.1.2. Clustering for Personalised Learning

Clustering has become a valuable tool for enabling personalised learning by identifying patterns in student engagement, resource preferences, and learning pace. These patterns allow educators to tailor instructional strategies and support mechanisms to better suit individual and group needs, which has been shown to improve participation, satisfaction, and performance in various learning environments (Mohamed Nafuri et al., 2022; Romero & Ventura, 2020).
Several studies have used clustering to find how students engage with different types of learning materials. When students are grouped according to their dominant content formats, such as favouring video-based content, interactive quizzes, or reading materials, students responded more positively to content tailored to their engagement style (Mohamed Nafuri et al., 2022; K. S. Na & Tasir, 2017). This alignment between content type and learner preference not only increased interaction but also supported deeper comprehension and retention. Differences in students’ pacing have been captured through clustering based on time spent engaging with learning modules. This information helped educators differentiate instructional paths, providing faster learners with more advanced tasks and slower learners with additional support (Han, 2023). Tailored materials not only improve engagement but also positively influence performance, especially when resources are aligned with students’ preferred learning modalities (Han, 2023; Mohamed Nafuri et al., 2022; Shoaib et al., 2022).
Beyond content preferences, clustering has also been applied to identify behavioural and cognitive challenges. Some studies have identified student groups that consistently struggled with types of assessment tasks, prompting targeted interventions to address conceptual or procedural difficulties (Sahni, 2023; Y. Liu et al., 2022). Whether linked to conceptual misunderstandings, task-specific difficulties, or motivational barriers, these insights inform targeted instructional support (Romero & Ventura, 2020; A. Khan & Ghosh, 2020). Studies have demonstrated that providing extra support or engagement prompts to less active clusters can enhance participation and foster a more inclusive learning environment (K. Na & Tasir, 2017; Dass et al., 2021; T. Liu et al., 2022). Supporting slower-paced learners with additional scaffolding and allowing advanced learners to progress independently has been shown to improve retention and reduce frustration (Han, 2023; Linden et al., 2023; Santoso & Yulia, 2019).
Personalisation also extends to students’ collaborative preferences. Clustering has helped distinguish between learners who thrive in group discussions and those who prefer independent work (Han, 2023; Linden et al., 2023; Santoso & Yulia, 2019). Educators have used these insights to assign roles in collaborative activities or offer alternative learning paths that align with students’ social learning styles, resulting in improved engagement and learner satisfaction (Ben Soussia et al., 2021; Romero & Ventura, 2020). This adaptability is particularly valuable in hybrid environments, where in-person and online participation patterns vary widely (Ben Soussia et al., 2021).
The integration of clustering into adaptive learning systems further enhances personalisation by adjusting the content delivery sequence (Y. Liu et al., 2022; Yağcı, 2022). Students showing early mastery can be routed toward more advanced topics, while those needing reinforcement receive additional materials, leading to improvements in both learning efficiency and retention (Jovanović et al., 2021).

4.1.3. Clustering for Curriculum and Course Design

Several studies have shown that clustering can guide curriculum and course development by analysing trends in student engagement, learning outcomes, and course delivery (Sahni, 2023; K. Na & Tasir, 2017; Ben Soussia et al., 2021). These insights help institutions identify gaps, streamline course progression, and align curricula with student needs. Clustering methods have been used to classify courses based on participation levels, with consistently low-engagement courses flagged for redesign (Romero & Ventura, 2020; López et al., 2012). Redesigning such courses with clearer content and more interactive components has led to measurable improvements in participation and student satisfaction (K. S. Na & Tasir, 2017; Linden et al., 2023).
Cluster analysis of assessment data has also revealed repeated performance struggles in specific fields, such as quantitative reasoning or critical analysis, prompting targeted interventions like remedial workshops or redesigned learning modules (K. Na & Tasir, 2017; Romero & Ventura, 2020; Santoso & Yulia, 2019). When these patterns are identified early, institutions have introduced targeted workshops or modified curricular content to bridge skill gaps (Jovanović et al., 2021). When applied to curriculum alignment, clustering techniques have been used to identify overlapping content between adjacent courses or inconsistencies in learning objectives that may burden students with redundant materials (Jovanović et al., 2021). Aligning outcomes across programs using clustering insights has helped streamline course progression and reduce student cognitive load (López et al., 2012; Yağcı, 2022). Other studies have extended this approach to ensure that program-level competencies match evolving industry needs, enhancing the employability of graduates (Namoun & Alshanqiti, 2020). The selection and design of elective courses have similarly benefited from clustering applications. Student preferences, combined with academic performance and enrolment trends, have been analysed to identify popular thematic areas, such as artificial intelligence and sustainability, guiding institutions to expand offerings that resonate with both student interest and job market demand (Ben Soussia et al., 2021; Romero & Ventura, 2020). Clustering has also informed the evaluation of delivery formats by linking learning outcomes to students’ interaction with different modes of content (A. Khan & Ghosh, 2020; Yağcı, 2022). Higher achievement in courses with a mix of visual, textual, and interactive components has supported the shift toward blended models in several institutions (Han, 2023; Linden et al., 2023). These models allow flexibility while maintaining structure, improving participation in both synchronous and asynchronous components (Jovanović et al., 2021).

4.2. Integrated of Clustering and Predictive Models

This section addresses the second sub-research question: How are predictive models integrated with clustering techniques to identify at-risk students? It explores how clustering outputs are used to enhance predictive modelling through data preprocessing, feature engineering, and hybrid modelling frameworks, with applications across various learning environments and institutional contexts.

4.2.1. Impact of Integration

Integrating clustering with predictive modelling improves predictive accuracy by structuring raw data into cohesive groups. This preprocessing transforms unstructured or noisy educational data into meaningful groupings, which serve as inputs for predictive algorithms, to ensure that predictive algorithms work with more interpretable and homogenous subsets of data (Romero & Ventura, 2020; López et al., 2012). Models built on cluster-informed data consistently perform better than those trained on unstructured features, with studies reporting improvements in precision, recall, and overall predictive stability (Sahni, 2023; Romero & Ventura, 2020; Parack et al., 2012; T. Liu et al., 2022).
Apart from accuracy, integrated models offer greater interpretability. Clusters derived from behavioural or academic data provide detailed profiles of student needs that help educators to trace prediction outcomes back to meaningful patterns such as disengagement, attendance fluctuations, or underperformance within specific subgroups (López et al., 2012; Linden et al., 2023; Jovanović et al., 2021). Decision-making becomes more transparent when predictions are linked to group-level trends (Y. Liu et al., 2022). Cluster-derived features have helped clarify why some students are flagged as at risk, such as irregular attendance, late submission patterns, or disengagement from digital platforms (López et al., 2012; Linden et al., 2023; Jovanović et al., 2021). In many studies, these insights were essential for predicting and preventing dropout, especially in digital or hybrid environments where academic failure is often preceded by engagement decline (Al-Shabandar et al., 2017). Some studies show that behavioural clusters have been used to uncover systemic patterns of non-participation that traditional models may miss, allowing institutions to prioritise support for those who show early signs of academic risk (López et al., 2012; Jovanović et al., 2021; K. Na & Tasir, 2017).
Cluster-based predictive models also support personalised interventions. When predictive outputs are paired with insights about cluster characteristics, educators can design responses that match the learning pace, preferences, or behavioural needs of each group (Sahni, 2023; Parack et al., 2012). Studies in hybrid and online settings have demonstrated how clustering engagement data helps tailor interventions, such as offering time management support to late submitters or peer-based activities for students with collaborative tendencies (Y. Liu et al., 2022; Ben Soussia et al., 2021; Romero & Ventura, 2020). When clustering was used to distinguish students by learning preference or motivation, it enabled personalised support strategies such as visual content for video-oriented learners or self-paced modules for independent learners (Romero & Ventura, 2020; Sahni, 2023).
Integrated models help uncover program-level and institutional insights. Clustering reveals patterns of engagement and performance that span across units or cohorts, which can be incorporated into predictive frameworks to inform decisions about curriculum design, resource allocation, or student support infrastructure (Romero & Ventura, 2020; Sahni, 2023). Clustering outputs such as group labels and centroids support tailored interventions to help predictive models deliver actionable insights, allowing educators to design targeted support strategies for specific student groups (López et al., 2012; Jovanović et al., 2021). Institutions could gain deeper insights into student behaviours and performance, enhancing both individual outcomes and institutional strategies (Linden et al., 2023; Namoun & Alshanqiti, 2020; Santoso & Yulia, 2019).

4.2.2. Applications in Educational Contexts

Studies demonstrate that clustering outputs, such as student groupings based on engagement or performance, significantly improve the accuracy and interpretability of predictive models (Romero & Ventura, 2020; López et al., 2012). These methods combine clustering outputs with predictive algorithms to address diverse challenges in online, blended, and traditional learning environments, enabling more effective interventions and support systems (K. S. Na & Tasir, 2017; Linden et al., 2023; T. Liu et al., 2022).
Online Learning Environment
Online education generates large volumes of behavioural data from learning management systems (LMS), including clickstream logs, time spent on resources, and forum participation. Clustering techniques are often used to structure this data before predictive modelling, improving dropout predictions and engagement analysis, as several studies have demonstrated that clustering enhances the detection of at-risk students by segmenting learners into distinct engagement categories, such as highly engaged, moderately engaged, and disengaged groups (K. Na & Tasir, 2017; Linden et al., 2023; Romero & Ventura, 2020). K. Na and Tasir (2017) showed that hierarchical clustering could effectively group students into highly engaged, moderately engaged, and disengaged clusters based on LMS activity. Logistic regression models utilizing these clusters as input features achieved higher recall rates, identifying disengaged students early. Sahni (2023) applied DBSCAN to noisy LMS data, such as inconsistent login patterns and irregular forum participation, identifying clusters of disengaged students that would otherwise remain undetected. When these clusters were used in a Random Forest model, dropout prediction accuracy improved by 20%. Romero and Ventura (2020) reported similar findings, where clustering LMS engagement data enhanced the precision of machine learning models in predicting at-risk students. When integrated into predictive models, these engagement profiles significantly enhance dropout prediction accuracy, enabling targeted interventions like personalised reminders or tailored support materials to be implemented proactively (Linden et al., 2023; Romero & Ventura, 2020).
Blended and Hybrid Learning
Blended learning environments involve both online and in-person interactions, requiring models to handle diverse data types. Studies revealed clustering methods help segment students based on combined behavioural, engagement, and attendance data, enabling predictive models to address the needs of combined online and face-to-face educational interaction, that these clusters were then integrated into Random Forest models, improving prediction accuracy by 15% and highlighting mixed engagement patterns often missed by standalone algorithms (T. Liu et al., 2022; Romero & Ventura, 2020). Predictive models leveraging these clusters enabled educators to provide targeted interventions, such as offering additional group activities for collaborative learners or providing self-paced modules for independent learners (Santoso & Yulia, 2019; Y. Liu et al., 2022).
Traditional Classroom Setting
In traditional classrooms, where digital engagement data is limited, clustering techniques focus on academic and attendance records to enhance predictive modelling. Marbouti et al. (2016) applied K-means clustering to segment students by early grades and attendance patterns, producing clusters that were used in logistic regression models to identify at-risk students before midterm exams. The study found that clustering improved prediction accuracy by structuring the data into more homogeneous subsets. Clustering also aids in addressing systemic inequities in traditional settings. Combined clustering with socio-economic data, such as parental education and household income, revealing patterns of underperformance among students from disadvantaged backgrounds. Predictive models trained on these clusters identified students requiring financial aid or academic support, helping to reduce disparities in educational outcomes (Jovanović et al., 2021; Namoun & Alshanqiti, 2020).
Program and Curriculum Optimisation
Clustering techniques extend beyond individual student analysis to support institutional decision-making by analysing patterns across multiple courses or curricula. Sahni (2023) used K-means clustering to analyse engagement metrics across multiple courses, identifying clusters of courses with consistently low participation rates. These clusters were incorporated into predictive models, revealing systemic issues such as unclear objectives and inadequate support. Interventions based on these findings, such as redesigning assessments and enhancing course delivery, resulted in higher engagement and improved student outcomes in subsequent semesters. Romero and Ventura (2020) applied DBSCAN to cluster courses based on difficulty and dropout rates. The analysis revealed foundational courses with high attrition due to poor alignment with students’ prior knowledge. Predictive models leveraging these clusters helped institutions introduce preparatory modules and additional support for at-risk courses, reducing dropout rates by 18% in the following academic year. López et al. (2012) found similar applications in engineering programs, where clustering student performance and preferences informed the design of specialized elective tracks, aligning course offerings with emerging industry demands.

4.2.3. Advanced Integration Strategies

Recent work in EDM has increasingly moved toward hybrid modelling approaches, driven by the limitations of single-source predictive models and the need to capture the multifaceted nature of student engagement and risk (R. Liu et al., 2022). These models combine academic records, behavioural indicators, engagement metrics, and socio-demographic variables to create hybrid models with enhanced predictive power (Ben Soussia et al., 2021). Hybrid models use diverse data types to provide a comprehensive understanding of student risk profiles and enable more precise interventions (R. Liu et al., 2022; Santoso & Yulia, 2019). Feature engineering plays an important role in preparing educational data for clustering. In the reviewed studies, features commonly include academic metrics (grades, GPA, assessment outcomes), behavioural indicators (LMS activity patterns, clickstream data), engagement metrics (forum participation, video viewing duration), and socio-demographic attributes (age, gender, region). These features are often normalised or scaled to ensure compatibility with distance-based clustering algorithms such as k-means (Romero & Ventura, 2020; T. Liu et al., 2022). Dimensionality reduction techniques like PCA are applied in several studies to reduce noise and improve the stability of clustering outcomes, particularly for high-dimensional behavioural data (Vora & Rajamani, 2019). Some studies also encode categorical features for hierarchical clustering or apply transformation steps to enhance cluster separation (Sahni, 2023). In many cases, the cluster labels generated from these engineered features are themselves reintroduced as features for supervised prediction, acting as an additional layer of abstraction that captures group-level behaviours or risk patterns (Jovanović et al., 2021; K. Na & Tasir, 2017). This demonstrates that clustering not only benefits from well-engineered inputs but also serves as a feature engineering step for downstream predictive models. Integrating diverse data sources allows these models to identify at-risk students with greater precision, even in complex and varied learning environments (Romero & Ventura, 2020; Jovanović et al., 2021).
Ben Soussia et al. (2021) developed a hybrid model integrating that integrated academic grades, attendance records, and engagement data, resulting metrics. This model achieved 15% higher accuracy compared to models relying solely on academic data. R. Liu (2022) combined clickstream data with assessment scores to identify at-risk students in significantly online learning environments, reporting improved accuracy. These multi-dimensional models provide a more complete picture of students’ risk profiles, allowing educators to tailor interventions more precisely (Y. Liu et al., 2022; Santoso & Yulia, 2019).
Effective hybrid modelling also depends on careful feature selection. Feature selection techniques, such as Principal Component Analysis (PCA) or feature importance rankings, help to refine the model by focusing on the most relevant predictors and removing less impactful features contribute most to the model’s performance (Marbouti et al., 2016; Vora & Rajamani, 2019). This process ensures that models remain computationally efficient and focused on the predictors most strongly associated with academic risk. Studies have shown that when feature selection incorporates academic metrics, behavioural logs, and socio-economic indicators, the resulting models not only improve accuracy but also provide deeper contextual understanding (Marbouti et al., 2016; Namoun & Alshanqiti, 2020). Vora and Rajamani (2019) used Random Forest importance rankings to prioritize behavioural features in a hybrid model, ensuring that only the most relevant variables were included. Jovanović et al. (2021) combined socio-economic indicators, such as parental education and family income, with academic metrics to predict dropout risks. The study found that including socio-economic features improved the model’s ability to contextualize academic performance, enabling more targeted support.
Hybrid models also contribute to personalised learning by identifying patterns across academic, behavioural, and engagement data. When diverse features are integrated, models can generate recommendations aligned with students’ specific learning needs, such as offering targeted feedback or adapting instructional materials based on individual progress and preferences (Santoso & Yulia, 2019; Romero & Ventura, 2020). These outputs have been shown to enhance student outcomes and increase satisfaction, particularly when models are used to allocate support services to groups with similar learning profiles (Ben Soussia et al., 2021; Jovanović et al., 2021). Research also suggests that tailoring interventions through hybrid frameworks can reduce disengagement and improve course completion rates, especially in flexible or hybrid learning environments (T. Liu et al., 2022; Namoun & Alshanqiti, 2020).
Feature selection techniques improve the scalability and adaptability of hybrid models (Matz et al., 2023). Reducing the number of features through dimensionality reduction techniques such as PCA allows models to process large volumes of institutional data more efficiently, making them suitable for use in environments with expanding digital infrastructures (Matz et al., 2023; Marbouti et al., 2016). Feature selection also addresses the risk of overfitting, a common issue in high-dimensional educational datasets, by retaining only the most relevant predictors (Y. Liu et al., 2022; Vora & Rajamani, 2019). Studies have shown that models with well-selected features perform more consistently across different cohorts and contexts, supporting the development of generalisable solutions for student support and risk identification (Namoun & Alshanqiti, 2020; Vora & Rajamani, 2019; Yağcı, 2022).

4.3. Challenges and Limitations of Clustering-Based Predictive Models

This section addresses the third sub-research question: What are the challenges and limitations of applying clustering techniques and predictive models in EDM? Drawing from the reviewed literature, four interrelated dimensions are examined: technical challenges, generalisability and context-dependency, ethical and privacy considerations, and institutional barriers.

4.3.1. Technical Constraints

One major technical challenge is the computational complexity associated with clustering techniques, especially for large datasets. When working with large educational datasets, especially those containing fine-grained behavioural or engagement logs, clustering becomes increasingly resource-intensive (Park et al., 2016). This is particularly problematic for methods that rely on careful parameter tuning, such as the minimum number of points per cluster and the distance threshold, which significantly impacts cluster quality and model outcomes; improper parameter selection often leads to over-clustering or under-clustering, reducing the reliability of results, which may not be feasible for institutions with limited computational infrastructure (Linden et al., 2023; Nayak et al., 2023).
Imbalanced datasets present another challenge. In most educational datasets, the proportion of at-risk students is often much smaller than non-at-risk students. This imbalance often led to high accuracy rates that fail to reflect poor performance in detecting the very group of concern (A. Khan & Ghosh, 2020; Dass et al., 2021). Techniques such as generating synthetic samples or applying weighted learning algorithms are often employed, but they add complexity to model development and require careful validation to avoid overfitting (Romero & Ventura, 2020; Dass et al., 2021).
Another challenge is the presence of noise and missing values in educational data. Educational data, particularly behavioural records from LMS platforms, often include noise, inconsistencies, and missing values. These issues degrade the performance of clustering and reduce the reliability of predictive outcomes if not properly handled (López et al., 2012; K. S. Na & Tasir, 2017). Preprocessing techniques, such as imputing missing values or filtering noise, improve data quality, they normally demand additional processing time. The integration of clustering output into predictive models also introduces additional layers of technical decision-making. Outputs like group labels or behavioural centroids must be carefully engineered into features that align with the assumptions of downstream algorithms (Jovanović et al., 2021; T. Liu et al., 2022). Without a structured feature selection process, these inputs may introduce irrelevant or redundant information, increasing the risk of overfitting where models perform well during training but generalise poorly to unseen data (Marbouti et al., 2016; Matz et al., 2023).

4.3.2. Generalisability and Contextual Limitations

Predictive models developed within specific educational contexts often struggle to generalise across diverse learning environments due to substantial differences in dataset characteristics and availability. Models trained using rich datasets from online or hybrid learning environments which typically include detailed clickstream logs and engagement indicators frequently fail to produce reliable outcomes when applied to traditional classroom settings where digital interaction data is sparse or unavailable (Jovanović et al., 2021; Romero & Ventura, 2020). Datasets collected from well-resourced institutions, with comprehensive data collection infrastructures seldom reflect the realities of institutions with limited technological capabilities, further constraining model transferability (K. Na & Tasir, 2017; Namoun & Alshanqiti, 2020; Yağcı, 2022).
Clustering techniques also face generalisability challenges due to their sensitivity to dataset structure, as many clustering techniques rely on data having specific structural characteristics, such as evenly distributed or clearly defined clusters (Namoun & Alshanqiti, 2020; Yağcı, 2022). These assumptions often fail in real-world educational datasets, resulting in inconsistent clustering outcomes when methods are applied across different institutions or contexts (R. Liu et al., 2022). Certain clustering algorithms demand extensive parameter adjustments tailored specifically to individual datasets, complicating efforts to standardise their implementation and limiting their effectiveness in diverse eDducational scenarios (Linden et al., 2023; Nayak et al., 2023).
Features like parental education, family income, or geographic location, further challenge generalisation. These socio-demographic factors, commonly included in hybrid predictive models, vary significantly across institutional and cultural boundaries (Santoso & Yulia, 2019). Models developed in one socio-economic or regional context may therefore fail to accurately reflect student profiles in other settings, leading to potentially misleading predictions (Jovanović et al., 2021; Yağcı, 2022). This issue is particularly pronounced in cross-national applications, where differences in educational systems, grading standards, and engagement behaviours must be accounted for (Santoso & Yulia, 2019; Namoun & Alshanqiti, 2020).
Variability in institutional capacity to collect, process, store data and analyse educational data also restricts generalisability. Institutions with limited digital infrastructure frequently produce datasets that are fragmented, incomplete, or inconsistently formatted, hindering the direct application of models developed elsewhere (Romero & Ventura, 2020; López et al., 2012). These disparities hinder the transferability of models and require significant customization for successful implementation.
While public datasets provide benchmarks for model evaluation, their specific focus on online learning limits their generalisability to other contexts (Y. Liu et al., 2022; K. Na & Tasir, 2017). Private datasets, although rich in context-specific details, are often unavailable for broader validation due to privacy concerns, further restricting the scope of generalisable findings (Jovanović et al., 2021). The integration of clustering and predictive models complicates generalisability due to the combined dependence on both clustering structure and predictive algorithm performance (Romero & Ventura, 2020; S. J. Kleter, 2022).

4.3.3. Ethical and Privacy Concerns

The use of sensitive data, such as socio-economic and demographic information, raises ethical concerns related to fairness and bias in predictive modelling. Models incorporating socio-economic indicators, like family income or parental education, often disproportionately classify students from disadvantaged backgrounds as at-risk, irrespective of their actual academic performance (Jovanović et al., 2021; Yağcı, 2022). This overrepresentation can reinforce negative stereotypes, stigmatise specific student groups, and amplify existing social inequalities within educational systems (Namoun & Alshanqiti, 2020; Santoso & Yulia, 2019). To address these issues, it is essential to design models that explicitly consider fairness constraints, ensuring predictions do not perpetuate biases or unintentionally disadvantage marginalised students (Miguéis et al., 2018; Dass et al., 2021).
Predictive accuracy improvements through inclusion of sensitive features must be balanced against the risk of embedding systemic inequities into the analytical processes (Yağcı, 2022; Namoun & Alshanqiti, 2020). Researchers caution that while demographic and socio-economic variables may enhance model performance, their uncritical use can obscure deeper, structural factors influencing student outcomes, potentially diverting attention from necessary institutional reforms (Romero & Ventura, 2020; A. Khan & Ghosh, 2020). Privacy risks also significantly increase as educational data becomes more detailed and interconnected. Integrating academic records, behavioural logs from LMS, and demographic details raises the potential for re-identifying individual students, even when data appears anonymised (K. S. Na & Tasir, 2017; López et al., 2012). Effective anonymisation, rigorous consent processes, and secure data management practices are critical in addressing these privacy risks (Romero & Ventura, 2020). However, implementing robust privacy protections requires significant technical expertise and institutional resources, posing considerable challenges for smaller or resource-limited educational institutions (K. Na & Tasir, 2017; S. J. Kleter, 2022).
Several practical approaches can help mitigate demographic bias in cluster-based predictive models. These include using fairness-aware machine learning techniques, applying careful feature selection to reduce reliance on sensitive attributes, conducting regular bias audits across demographic subgroups, and reporting model performance separately for different student groups (Holstein et al., 2019; Binns, 2018; Ifenthaler & Yau, 2020). Adopting these practices can improve the transparency and fairness of predictive models in educational contexts.

4.3.4. Institutional Barriers

Institutional barriers also affect the practical implementation of clustering and predictive models. Many institutions face resource constraints, such as limited technical infrastructure, inadequate funding, and insufficient technical expertise, which restrict their capacity to adopt advanced analytical techniques (Romero & Ventura, 2020; López et al., 2012). Smaller or resource-constrained institutions often lack the necessary resources to effectively collect, manage, and analyse comprehensive datasets required for sophisticated hybrid modelling approaches (Marbouti et al., 2016; Namoun & Alshanqiti, 2020).
Resistance to data-driven methods is another challenge, especially in traditional educational environments (Ben Soussia et al., 2021; Dass et al., 2021). Many stakeholders question the validity, interpretability, or fairness of model outcomes, especially when predictions influence critical decisions related to student support or resource allocation (Miguéis et al., 2018; A. Khan & Ghosh, 2020). Without addressing these institutional barriers through both technical and organisational strategies, the potential benefits of clustering-based predictive models remain limited in educational practice (Jovanović et al., 2021; Matz et al., 2023).

5. Conclusions

Clustering techniques and predictive models have significant potential in EDM, particularly for improving early identification and tailored support for at-risk students, a key strategy to reduce student dropout and improve retention. This review has synthesised the current literature to clarify how these methods enhance student performance prediction, highlighting key challenges, and suggest future directions for research. Clustering methods support the transformation of complex educational data into meaningful groups, which enables predictive models to achieve higher accuracy and more relevant insights. However, the effectiveness of clustering approaches often depends on the structure and characteristics of the dataset, requiring further refinement and adaptation to diverse educational contexts.
One notable contribution of this study is identifying how clustering can be leveraged not only for grouping students but also for improving prediction accuracy by tailoring predictive models to each distinct student group. Hybrid models that combine multiple data sources, including academic performance, engagement metrics, behavioural indicators, and socio-economic factors, provide a comprehensive understanding of student risk profiles and demonstrate improved prediction accuracy compared to single-source approaches. However, selecting the most relevant features remains a challenge due to computational complexity and variations in feature importance across datasets. Techniques such as PCA and Random Forest feature importance ranking and have been applied in this context, but their scalability and adaptability to different educational datasets require further exploration.
While clustering and predictive modelling have demonstrated substantial benefits, challenges remain in technical implementation, generalisability, ethical considerations, and institutional adoption. Computational demands and scalability issues affect the application of clustering techniques to large datasets, while imbalanced and noisy data reduce prediction reliability. The generalisability of models trained in specific educational contexts is another concern, as datasets and feature distributions and student behaviours vary widely across regions and institutions. Ethical considerations, such as bias in predictive models and privacy concerns regarding sensitive socio-economic data, further complicate the adoption of these techniques.

5.1. Limitations

This review has several limitations that should be acknowledged. First, the search was limited to peer-reviewed publications written in English, which may have excluded relevant studies published in other languages or in grey literature. Second, the review only included publications from 2010 to 2025, which may not fully capture earlier foundational work in the field. Third, although the search strategy was carefully designed and followed PRISMA guidelines, it is possible that some relevant studies were missed due to variations in terminology or indexing across databases. In addition, this review presents a qualitative synthesis of the literature rather than a formal meta-analysis, which limits the ability to statistically compare the effectiveness of different approaches. These limitations highlight the need for ongoing reviews as the field evolves and for future studies to consider broader data sources and more diverse methodological perspectives.

5.2. Practical Recommendations

To guide future research and practice, several key considerations can support the effective design of cluster-based prediction models in educational contexts. Researchers should carefully select clustering techniques that align with the structure and scale of the dataset. It is also important to assess clustering quality using multiple validation metrics to ensure robustness. When integrating clustering with predictive models, it is recommended to evaluate whether tailoring predictive models to each cluster improves performance over general models. Feature selection should prioritise variables that are both predictive and explainable. Threshold tuning should be performed in line with institutional priorities, with attention to balancing precision and recall. Finally, model evaluation should include both technical performance metrics and practical considerations, such as interpretability and alignment with support strategies. Following these guidelines can help ensure that cluster-based prediction models provide actionable insights that enable earlier interventions and reduce dropout rates through timely and personalised student support.
Future research should prioritise addressing these identified challenges to maximise the impact of clustering and predictive models in EDM. Techniques aimed at improving computational efficiency, such as optimised clustering algorithms and automated feature selection methods, require further development and testing. Efforts should also focus on enhancing model generalisability through expanded datasets collected from diverse institutions and regional contexts. Establishing transparent reporting standards and comprehensive ethical guidelines is critical for mitigating biases and safeguarding student privacy. Additionally, aligning institutional strategies and objectives with data-driven methodologies is essential for equitable and widespread adoption across educational settings. Future reviews could benefit from including broader datasets, multilingual studies, and additional modelling approaches, thus providing a richer and more inclusive understanding of how clustering and predictive techniques can enhance educational practices and outcomes to ultimately strengthen institutions’ ability to detect disengagement early and deliver interventions that prevent dropout, improving both academic outcomes and student wellbeing.

Author Contributions

Conceptualization, Y.L. and S.Y.; methodology, Y.L.; formal analysis, Y.L., J.M. and S.-H.K.; investigation, Y.L.; data curation, Y.L. and M.M.R.; writing—original draft preparation, Y.L.; writing—review and editing, S.Y.; supervision, S.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Acknowledgments

I sincerely appreciate Soonja Yeom for careful guidance, theoretical support, thorough discussion and constructive comments. The authors would like to acknowledge the use of ChatGPT-4.1 in helping to improve the clarity and flow of the writing in this paper. The tool was used to support editing and ensure the content was clear and concise. It was not involved in any part of the research process, and all findings and insights presented in the paper are the result of the authors’ original work and intellectual contributions. This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (RS-2023-00219107).

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Adnan, M., Habib, A., Ashraf, J., Mussadiq, S., Raza, A. A., Abid, M., Bashir, M., & Khan, S. U. (2021). Predicting at-risk students at different percentages of course length for early intervention using machine learning models. IEEE Access, 9, 7519–7539. [Google Scholar] [CrossRef]
  2. Akçapınar, G., Altun, A., & Aşkar, P. (2019). Using learning analytics to develop early-warning system for at-risk students. International Journal of Educational Technology in Higher Education, 16(1), 1–20. [Google Scholar] [CrossRef]
  3. Aldowah, H., Al-Samarraie, H., & Fauzy, W. M. (2019). Educational data mining and learning analytics for 21st century higher education: A review and synthesis. Telematics and Informatics, 37, 13–49. [Google Scholar] [CrossRef]
  4. Almasri, A., Alkhawaldeh, R. S., & Çelebi, E. (2020). Clustering-based EMT model for predicting student performance. Arabian Journal for Science and Engineering, 45(12), 10067–10078. [Google Scholar] [CrossRef]
  5. Al-Shabandar, R., Hussain, A., Laws, A., Keight, R., Lunn, J., & Radi, N. (2017, May 14–19). Machine learning approaches to predict learning outcomes in massive open online courses. 2017 International Joint Conference on Neural Networks (IJCNN) (pp. 713–720), Anchorage, AK, USA. [Google Scholar]
  6. Anjum, N., & Badugu, S. (2020). A study of different techniques in educational data mining. In Advances in decision sciences, image processing, security and computer vision (pp. 562–571). Springer. [Google Scholar] [CrossRef]
  7. Arora, N., Chauhan, V., & Chaudhary, A. (2023). Recent approaches to high-dimension data grouping and specific hierarchical grouping techniques. Global Journal of Enterprise Information System, 15(1), 73–80. [Google Scholar]
  8. Asif, R., Merceron, A., Ali, S. A., & Haider, N. G. (2017). Analyzing undergraduate students’ performance using educational data mining. Computers & Education, 113, 177–194. [Google Scholar] [CrossRef]
  9. Australian Government Department of Education, Skills and Employment. (2019). Factors affecting higher education completions. Australian Government Department of Education, Skills and Employment.
  10. Australian Institute of Health and Welfare. (2023). Risk assessment approaches in child protection. Available online: https://aifs.gov.au/resources/resource-sheets/risk-assessment-approaches-child-protection (accessed on 26 October 2024).
  11. Bahel, V., Malewar, S., & Thomas, A. (2021, March 17–18). Student interest group prediction using clustering analysis: An EDM approach. 2021 International Conference on Computational Intelligence and Knowledge Economy (ICCIKE), Dubai, United Arab Emirates. [Google Scholar]
  12. Balovsyak, S., Derevyanchuk, O., Kravchenko, H., Ushenko, Y., & Hu, Z. (2023). Clustering students according to their academic achievement using fuzzy logic. arXiv, arXiv:2312.10047. [Google Scholar] [CrossRef]
  13. Ben Soussia, A., Roussanaly, A., & Boyer, A. (2021). An in-depth methodology to predict at-risk learners. In Technology-enhanced learning for a free, safe, and sustainable world (pp. 193–206). Springer. [Google Scholar] [CrossRef]
  14. Bholowalia, P., & Kumar, A. (2014). EBK-means: A clustering technique based on elbow method and k-means in WSN. International Journal of Computer Applications, 105(9), 17–24. [Google Scholar]
  15. Bhuyan, R., & Borah, S. (2023). A survey of some density based clustering techniques. arXiv, arXiv:2306.09256. [Google Scholar] [CrossRef]
  16. Binns, R. (2018, February 23–24). Fairness in machine learning: Lessons from political philosophy. 2018 Conference on Fairness, Accountability, and Transparency (pp. 149–159), New York, NY, USA. [Google Scholar]
  17. Bishop, C. M., & Nasrabadi, N. M. (2006). Pattern recognition and machine learning (Vol. 4). Springer. [Google Scholar]
  18. Chamorro-Atalaya, O., Arévalo-Tuesta, J., Balarezo-Mares, D., Gonzáles-Pacheco, A., Mendoza-León, O., Quipuscoa-Silvestre, M., Tomás-Quispe, G., & Suarez-Bazalar, R. (2023). K-fold cross-validation through identification of the opinion classification algorithm for the satisfaction of university students. International Journal of Online & Biomedical Engineering, 19(11), 140–158. [Google Scholar]
  19. Chaudhry, M., Shafi, I., Mahnoor, M., Vargas, D. L. R., Thompson, E. B., & Ashraf, I. (2023). A systematic literature review on identifying patterns using unsupervised clustering algorithms: A data mining perspective. Symmetry, 15(9), 1679. [Google Scholar] [CrossRef]
  20. Chen, Y., Zhang, X., Li, H., & Xiong, Z. (2023). Educational data mining for early warning systems in higher education: A systematic review. Journal of Educational Technology Development and Exchange, 16(3), 15–32. [Google Scholar]
  21. Cole, A. W., Lennon, L., & Weber, N. L. (2021). Student perceptions of online active learning practices and online learning climate predict online course engagement. Interactive Learning Environments, 29(5), 866–880. [Google Scholar] [CrossRef]
  22. Dass, S., Gary, K., & Cunningham, J. (2021). Predicting student dropout in self-paced MOOC course using random forest model. Information, 12(11), 476. [Google Scholar] [CrossRef]
  23. Francis, B. K., & Babu, S. S. (2019). Predicting Academic Performance of Students Using a Hybrid Data Mining Approach. Journal of Medical Systems, 43(6), 162. [Google Scholar] [CrossRef]
  24. Fryer, P. (2024, October 30). Aussie uni dropouts hit record highs. UOWTV. Available online: https://www.uowtv.com/aussie-uni-dropouts-hit-record-highs/ (accessed on 11 January 2025).
  25. Goren, O., Cohen, L., & Rubinstein, A. (2024, July 14–17). Early prediction of student dropout in higher education using machine learning models. 17th International Conference on Educational Data Mining (pp. 349–359), Atlanta, GA, USA. [Google Scholar]
  26. Han, H. (2023). Fuzzy clustering algorithm for university students’ psychological fitness and performance detection. Heliyon, 9(8), e18550. [Google Scholar] [CrossRef]
  27. Hassan, H., Anuar, S., & Ahmad, N. B. (2019). Students’ performance prediction model using meta-classifier approach. In Engineering applications of neural networks (pp. 221–231). Springer. [Google Scholar] [CrossRef]
  28. Heissrer, D., & Parette, P. (2002). Advising at risk students in college and university settings. College Student Journal, 36(1), 69–83. [Google Scholar]
  29. Holstein, K., Wortman Vaughan, J., Daumé, H., III, Dudik, M., & Wallach, H. (2019, May 4–9). Improving fairness in machine learning systems: What do industry practitioners need? 2019 CHI Conference on Human Factors in Computing Systems (pp. 1–16), Glasgow, UK. [Google Scholar]
  30. Huang, S., Fang, N., & Huang, Y. (2019). Early identification of at-risk students using learning analytics data and performance indicators. International Journal of Educational Technology in Higher Education, 16(1), 1–20. [Google Scholar]
  31. Hung, J., Hsu, Y., & Rice, K. (2015). Integrating data mining in program evaluation of K-12 online education. Educational Technology & Society, 15(3), 27–41. [Google Scholar]
  32. Iatrellis, O., Savvas, I. Κ., Fitsilis, P., & Gerogiannis, V. C. (2020). A two-phase machine learning approach for predicting student outcomes. Education and Information Technologies, 26(1), 69–88. [Google Scholar] [CrossRef]
  33. Ifenthaler, D., & Yau, J. Y.-K. (2020). Utilising learning analytics for study success: Reflections on current empirical findings. International Journal of Learning Analytics and Artificial Intelligence for Education, 2(1), 1–17. [Google Scholar]
  34. Ikotun, A. M., Ezugwu, A. E., Abualigah, L., Abuhaija, B., & Heming, J. (2023). K-means clustering algorithms: A comprehensive review, variants analysis, and advances in the era of big data. Information Sciences, 622, 178–210. [Google Scholar] [CrossRef]
  35. Injadat, M., Moubayed, A., Nassif, A. B., & Shami, A. (2020). Systematic ensemble model selection approach for educational data mining. Knowledge-Based Systems, 200, 105992. [Google Scholar] [CrossRef]
  36. Jawthari, M., & Stoffova, V. (2022). Weekly prediction of at-risk students using data mining. In EDULEARN22 Proceedings (pp. 9230–9236). IATED. [Google Scholar]
  37. Jayaprakash, S., Krishnan, S., & Jaiganesh, V. (2020, March 12–14). Predicting students academic performance using an improved random forest classifier. 2020 International Conference on Emerging Smart Computing and Informatics (ESCI), Pune, India. [Google Scholar]
  38. Jevons, C., & Lindsay, S. (2018). An innovative multidisciplinary approach to identifying at-risk students in primary schools. Australian Journal of Guidance and Counselling, 13(2), 159–166. [Google Scholar] [CrossRef]
  39. Jovanović, J., Saqr, M., Joksimović, S., & Gašević, D. (2021). Students matter the most in learning analytics: The effects of internal and instructional conditions in predicting academic success. Computers & Education, 172, 104251. [Google Scholar] [CrossRef]
  40. Khan, A., & Ghosh, S. K. (2020). Student performance analysis and prediction in classroom learning: A review of educational data mining studies. Education and Information Technologies, 26(1), 205–240. [Google Scholar] [CrossRef]
  41. Khan, I. K., Daud, H. B., Zainuddin, N. B., Sokkalingam, R., Museeb, A., & Inayat, A. (2024). Addressing limitations of the K-means clustering algorithm: Outliers, non-spherical data, and optimal cluster selection. AIMS Mathematics, 9(9), 25070–25097. [Google Scholar] [CrossRef]
  42. Kim, A., Nikseresht, F., Dutcher, J. M., Tumminia, M., Villalba, D., Cohen, S., Creswel, K., Creswell, D., Dey, A. K., Mankoff, J., & Doryab, A. (2021). Understanding health and behavioural trends of successful students through machine learning models. arXiv, arXiv:2102.04212. [Google Scholar]
  43. Kleter, S. J. (2022). Investigating the generalizability of learning analytic models for predicting academic performance [Master’s thesis, Eindhoven University of Technology]. [Google Scholar]
  44. Kuzilek, J., Hlosta, M., & Zdrahal, Z. (2017). Open university learning analytics dataset. Scientific Data, 4(1), 1–8. [Google Scholar] [CrossRef]
  45. Le Quy, T., Friege, G., & Ntoutsi, E. (2023). A review of clustering models in educational data science towards fairness-aware learning. arXiv, arXiv:2301.03421. [Google Scholar] [CrossRef]
  46. Lima, M., Soares, W. L., Silva, I., & Fagundes, R. (2020). A combined model based on clustering and regression to predicting school dropout in higher education institution. International Journal of Computer Applications, 176, 1–8. [Google Scholar] [CrossRef]
  47. Linden, K., van der Ploeg, N., & Roman, N. (2023). Explainable learning analytics to identify disengaged students early in semester: An intervention supporting widening participation. Journal of Higher Education Policy and Management, 45(6), 626–640. [Google Scholar] [CrossRef]
  48. Liu, R. (2022). Data analysis of educational evaluation using K-means clustering method. Computational Intelligence and Neuroscience, 2022(1), 3762431. [Google Scholar] [CrossRef]
  49. Liu, R., Ali, S., Bilal, S. F., Sakhawat, Z., Imran, A., Almuhaimeed, A., Alzahrani, A., & Sun, G. (2022). an intelligent hybrid scheme for customer churn prediction integrating clustering and classification algorithms. Applied Sciences, 12(18), 9355. [Google Scholar] [CrossRef]
  50. Liu, T., Wang, C., Chang, L., & Gu, T. (2022). Predicting high-risk students using learning behavior. Mathematics, 10(14), 2483. [Google Scholar] [CrossRef]
  51. Liu, Y., Fan, S., Xu, S., Sajjanhar, A., Yeom, S., & Wei, Y. (2022). Predicting student performance using clickstream data and machine learning. Education Sciences, 13(1), 17. [Google Scholar] [CrossRef]
  52. López, M. I., Luna, J. M., Romero, C., & Ventura, S. (2012, June 19–21). Classification via clustering for predicting final marks based on student participation in forums. 5th International Conference on Educational Data Mining (pp. 148–151), Chania, Greece. [Google Scholar]
  53. López-Meneses, E., Mellado-Moreno, P. C., Gallardo Herrerías, C., & Pelícano-Piris, N. (2024). Educational data mining and predictive modeling in the age of artificial intelligence: An in-depth analysis of research dynamics. Computers, 14(2), 68. [Google Scholar] [CrossRef]
  54. Marbouti, F., Diefes-Dux, H. A., & Madhavan, K. (2016). Models for early prediction of at-risk students in a course using standards-based grading. Computers & Education, 103, 1–15. [Google Scholar] [CrossRef]
  55. Mathrani, A., Susnjak, T., Ramaswami, G., & Barczak, A. (2021). Perspectives on the challenges of generalizability, transparency and ethics in predictive learning analytics. Computers and Education Open, 2, 100060. [Google Scholar] [CrossRef]
  56. Matz, S. C., Bukow, C. S., Peters, H., Deacons, C., Dinu, A., & Stachl, C. (2023). Using machine learning to predict student retention from socio-demographic characteristics and app-based engagement metrics. Scientific Reports, 13(1), 5705. [Google Scholar] [CrossRef] [PubMed]
  57. McKee, B. (2024). Refocus priorities as domestic university drop outs reach record high. Institute of Public Affairs. Available online: https://ipa.org.au/publications-ipa/media-releases/refocus-priorities-as-domestic-university-drop-outs-reach-record-high (accessed on 6 January 2025).
  58. McMillan, J. H., & Reed, D. F. (2010). At-risk students and resiliency: Factors contributing to academic success. The Clearing House: A Journal of Educational Strategies, Issues and Ideas, 67(3), 137–140. [Google Scholar] [CrossRef]
  59. Miguéis, V. L., Freitas, A., Garcia, P. J. V., & Silva, A. (2018). Early segmentation of students according to their academic performance: A predictive modelling approach. Decision Support Systems, 115, 36–51. [Google Scholar] [CrossRef]
  60. Mitchell Institute. (2023). Counting the costs of lost opportunity in Australian education. Victoria University. Available online: https://www.vu.edu.au/sites/default/files/counting-the-costs-of-lost-opportunity-in-Aus-education-mitchell-institute.pdf (accessed on 26 October 2024).
  61. Mohamed Nafuri, A. F., Sani, N. S., Zainudin, N. F. A., Rahman, A. H. A., & Aliff, M. (2022). Clustering analysis for classifying student academic performance in higher education. Applied Sciences, 12(19), 9467. [Google Scholar] [CrossRef]
  62. Moubayed, A., Injadat, M., Shami, A., & Lutfiyya, H. (2020). Student engagement level in an e-learning environment: Clustering using k-means. American Journal of Distance Education, 34(2), 137–156. [Google Scholar] [CrossRef]
  63. Murphy, K., López-Pernas, S., & Saqr, M. (2024). Dissimilarity-based cluster analysis of educational data: A comparative tutorial using R. In Learning Analytics Methods and Tutorials (pp. 231–283). Springer. [Google Scholar] [CrossRef]
  64. Na, K., & Tasir, Z. (2017, April 20–23). A systematic review of learning analytics intervention contributing to student success in online learning. 2017 International Conference on Learning and TEACHING in computing and Engineering (LaTICE) (pp. 62–68), Hong Kong, China. [Google Scholar]
  65. Na, K. S., & Tasir, Z. (2017, November 16–17). Identifying at-risk students in online learning by analysing learning behaviour: A systematic review. 2017 IEEE Conference on Big Data and Analytics (ICBDA), Kuching, Malaysia. [Google Scholar]
  66. Namoun, A., & Alshanqiti, A. (2020). Predicting student performance using data mining and learning analytics techniques: A systematic literature review. Applied Sciences, 11(1), 237. [Google Scholar] [CrossRef]
  67. National Centre for Vocational Education Research. (2023). Impact of financial stress on student outcomes. Available online: https://www.ncver.edu.au/__data/assets/file/0031/16789/impact-of-financial-stress-2732.pdf (accessed on 6 January 2025).
  68. Nayak, P., Vaheed, S., Gupta, S., & Mohan, N. (2023). Predicting students’ academic performance by mining the educational data through machine learning-based classification model. Education and Information Technologies, 28(11), 14611–14637. [Google Scholar] [CrossRef]
  69. Ncube, L., & Ngulube, P. (2024, February 14–18). Methodological considerations for predicting at-risk students. 24th Australasian Computing Education Conference, Virtual Event. [Google Scholar] [CrossRef]
  70. Norton, A., & Cherastidtham, I. (2018). Dropping out: The benefits and costs of trying university. Grattan Institute. Available online: https://grattan.edu.au/wp-content/uploads/2018/04/904-dropping-out-the-benefits-and-costs-of-trying-university.pdf (accessed on 6 January 2025).
  71. Oreopoulos, P., Patterson, R., Petronijevic, U., & Pope, N. G. (2017). Traditional approaches for identifying at-risk students. Available online: https://www.nber.org/papers/w22314 (accessed on 6 January 2025).
  72. Osborne, J. B., & Lang, A. S. (2023). Predictive identification of at-risk students: Using learning management system data. Journal of Postsecondary Student Success, 2(4), 108–126. [Google Scholar] [CrossRef]
  73. Oyelade, O. J., Oladipupo, O. O., & Obagbuwa, I. C. (2010). Application of k means clustering algorithm for prediction of students academic performance. arXiv, arXiv:1002.2425. [Google Scholar] [CrossRef]
  74. Parack, S., Zahid, Z., & Merchant, F. (2012, January 3–5). Application of data mining in educational databases for predicting academic trends and patterns. 2012 IEEE International Conference on Technology Enhanced Education (ICTEE), Amritapuri, India. [Google Scholar]
  75. Park, Y., Yu, J. H., & Jo, I.-H. (2016). Clustering blended learning courses by online behavior data: A case study in a Korean higher education institute. The Internet and Higher Education, 29, 1–11. [Google Scholar] [CrossRef]
  76. Pascarella, E. T., & Terenzini, P. T. (2005). How college affects students: A third decade of research (Vol. 2). ERIC. [Google Scholar]
  77. Pradeep, A., Das, S., & Kizhekkethottam, J. J. (2015, February 25–27). Students dropout factor prediction using EDM techniques. 2015 International Conference on Soft-Computing and Network Security (ICSNS-2015), Coimbatore, India. [Google Scholar]
  78. Prakash, B. R., Hanumanthappa, D. M., & Kavitha, V. (2014). Big data in educational data mining and learning analytics. International Journal of Innovative Research in Computer and Communication Engineering, 2(12), 7515–7520. [Google Scholar] [CrossRef]
  79. Prinsloo, P., & Slade, S. (2014). Educational triage in open distance learning: Walking a moral tightrope. International Review of Research in Open and Distributed Learning, 15(4), 306–331. [Google Scholar] [CrossRef]
  80. Ramanathan, L., Parthasarathy, G., Vijayakumar, K., Lakshmanan, L., & Ramani, S. (2018). Cluster-based distributed architecture for prediction of student’s performance in higher education. Cluster Computing, 22(S1), 1329–1344. [Google Scholar] [CrossRef]
  81. Romero, C., & Ventura, S. (2013). Data mining in education. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 3(1), 12–27. [Google Scholar]
  82. Romero, C., & Ventura, S. (2020). Educational data mining and learning analytics: An updated survey. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 10(3), e1355. [Google Scholar] [CrossRef]
  83. Ross, J. (2023, September 1). Australia’s rich universities grow as others falter. Inside Higher Ed. Available online: https://www.insidehighered.com/news/global/2023/09/01/australias-rich-universities-grow-others-falter (accessed on 11 January 2025).
  84. Sahni, S. K. (2023). Re-envision of learning by integrating technology in higher education. In Innovation, leadership and governance in higher education (pp. 139–157). Springer. [Google Scholar] [CrossRef]
  85. Santoso, L. W., & Yulia. (2019). The analysis of student performance using data mining. In Advances in computer communication and computational sciences (pp. 559–573). Springer. [Google Scholar] [CrossRef]
  86. Sarker, S., Paul, M. K., Thasin, S. T. H., & Hasan, M. A. M. (2024). Analyzing students’ academic performance using educational data mining. Computers and Education: Artificial Intelligence, 7, 100263. [Google Scholar] [CrossRef]
  87. Schubert, U. (2023). En route from metal alkoxides to metal oxides: Metal oxo/alkoxo clusters. Journal of Sol-Gel Science and Technology, 105(2), 587–595. [Google Scholar] [CrossRef]
  88. Seidman, A. (2016). College student retention: A primer. A presentation. Available online: https://www.cscsr.org/docs/College_Student_Retention_APrimer_2016.pdf (accessed on 12 March 2025).
  89. Seidman, A. (2019). Minority student retention: The best of the “Journal of college student retention: Research, theory & practice”. Routledge. [Google Scholar]
  90. Severson, H. H., Walker, H. M., Hope-Doolittle, J., Kratochwill, T. R., & Gresham, F. M. (2007). Proactive, early screening to detect behaviorally at-risk students: Issues, approaches, emerging innovations, and professional practices. Journal of School Psychology, 45(2), 193–223. [Google Scholar] [CrossRef]
  91. Shaleena, K. P., & Paul, S. (2015, March 20). Data mining techniques for predicting student performance. 2015 IEEE International Conference on Engineering and Technology (ICETECH), Coimbatore, TN, India. [Google Scholar]
  92. Sharma, J., Shivani, Chatterjee, S., & Kumar, M. (2024, March). Enhancing IoT anomaly detection with DBSCAN—A data-driven approach. In International conference on computing and machine learning (pp. 107–118). Springer Nature. [Google Scholar]
  93. Shoaib, M., Sayed, N., Amara, N., Latif, A., Azam, S., & Muhammad, S. (2022). Prediction of an educational institute learning environment using machine learning and data mining. Education and Information Technologies, 27(7), 9099–9123. [Google Scholar] [CrossRef]
  94. Shoaib, M., Wang, M., & Benedict, N. (2024). Course success prediction and early identification of at-risk students using explainable artificial intelligence techniques. Electronics, 13(21), 4157. [Google Scholar] [CrossRef]
  95. Shovon, M. H. I., & Haque, M. (2012). An approach of improving students academic performance by using k means clustering algorithm and decision tree. arXiv, arXiv:1211.6340. [Google Scholar] [CrossRef]
  96. Sisovic, S., Matetic, M., & Bakaric, M. B. (2016, January 21–23). Clustering of imbalanced moodle data for early alert of student failure. IEEE 14th International Symposium on Applied Machine Intelligence and Informatics (pp. 21–23), Herlany, Slovakia. [Google Scholar]
  97. Tan, C. J., Lim, T. Y., Liew, T. K., & Lim, C. P. (2022). An intelligent tool for early drop-out prediction of distance learning students. Soft Computing, 26(12), 5901–5917. [Google Scholar] [CrossRef]
  98. Tempelaar, D., Rienties, B., & Nguyen, Q. (2020). Subjective data, objective data and the role of bias in predictive modelling: Lessons from a dispositional learning analytics application. PLoS ONE, 15(6), e0233977. [Google Scholar] [CrossRef] [PubMed]
  99. Tosun, S., & Kalaycıoğlu, D. B. (2024). Data mining approach for prediction of academic success in open and distance education. Journal of Educational Technology and Online Learning, 7(2), 168–176. [Google Scholar] [CrossRef]
  100. Trakunphutthirak, R., & Lee, V. C. S. (2021). Application of educational data mining approach for student academic performance prediction using progressive temporal data. Journal of Educational Computing Research, 39(3), 547–575. [Google Scholar] [CrossRef]
  101. Universities Australia. (2022). 2022 higher education facts and figures. Available online: https://universitiesaustralia.edu.au/wp-content/uploads/2022/09/220207-HE-Facts-and-Figures-2022_2.0.pdf (accessed on 11 January 2025).
  102. Viswanathan, S., & Kumar, S. V. (2021). Study of students’ performance prediction models using machine learning. Turkish Journal of Computer and Mathematics Education, 12(2), 3085–3091. [Google Scholar]
  103. Vora, D. R., & Rajamani, K. (2019). A hybrid classification model for prediction of academic performance of students: A big data application. Evolutionary Intelligence, 15(2), 1083–1096. [Google Scholar] [CrossRef]
  104. Wang, C., Chang, L., & Liu, T. (2022, April). Predicting student performance in online learning using a highly efficient gradient boosting decision tree. In International conference on intelligent information processing (pp. 508–521). Springer International Publishing. [Google Scholar]
  105. Wolff, C. E., Jarodzka, H., van den Bogert, N., & Boshuizen, H. P. (2016). Teacher vision: Expert and novice teachers’ perception of problematic classroom management scenes. Instructional Science, 44(3), 243–265. [Google Scholar] [CrossRef]
  106. Xing, W., & Du, D. (2019). Dropout prediction in MOOCs: Using deep learning for personalized intervention. Journal of Educational Computing Research, 57(3), 547–570. [Google Scholar] [CrossRef]
  107. Xiong, Z., Li, H., Liu, Z., & Chen, Z. (2024). A review of data mining in personalised education: Current trends and future prospects. arXiv, arXiv:2402.17236. Available online: https://arxiv.org/abs/2402.17236.
  108. Xu, F., Li, Z., Yue, J., & Qu, S. (2021). A systematic review of educational data mining. In Intelligent computing (pp. 764–780). Springer. [Google Scholar] [CrossRef]
  109. Yağcı, M. (2022). Educational data mining: Prediction of students’ academic performance using machine learning algorithms. Smart Learning Environments, 9(1), 11. [Google Scholar] [CrossRef]
  110. Yao, H., Lian, D., Cao, Y., Wu, Y., & Zhou, T. (2019). Predicting academic performance for college students: A campus behavior perspective. ACM Transactions on Intelligent Systems and Technology (TIST), 10(3), 1–21. [Google Scholar] [CrossRef]
  111. Zhang, Y., Li, M., & Wang, H. (2023). The impact of educational data mining on student performance and engagement: A meta-analysis. Educational Technology Research and Development, 71(4), 1187–1205. [Google Scholar]
Figure 1. PRISMA flow diagram shows study selection process. * Database included IEEE Xplore, Scopus, ScienceDirect, and SpringerLink. ** Records excluded after screening which are not relevant.
Figure 1. PRISMA flow diagram shows study selection process. * Database included IEEE Xplore, Scopus, ScienceDirect, and SpringerLink. ** Records excluded after screening which are not relevant.
Education 15 01695 g001
Figure 2. Number of publications identified per year in the initial database search.
Figure 2. Number of publications identified per year in the initial database search.
Education 15 01695 g002
Figure 3. Conceptual workflow of cluster-based predictive modelling in educational data mining (adapted from R. Liu et al.’s previous work) (R. Liu et al., 2022).
Figure 3. Conceptual workflow of cluster-based predictive modelling in educational data mining (adapted from R. Liu et al.’s previous work) (R. Liu et al., 2022).
Education 15 01695 g003
Figure 4. Approximate frequency of evaluation metrics used in reviewed studies.
Figure 4. Approximate frequency of evaluation metrics used in reviewed studies.
Education 15 01695 g004
Table 1. Frequency of Clustering Algorithms in Reviewed Studies.
Table 1. Frequency of Clustering Algorithms in Reviewed Studies.
Clustering MethodFrequencyPercentage (%)
K-means3459.6%
Hierarchical1831.6%
DBSCAN1424.6%
Hybrid or Custom Combinations915.8%
Others610.5%
Note: Some studies used multiple algorithms, so the percentages exceed 100%.
Table 2. Summary of Clustering Methods.
Table 2. Summary of Clustering Methods.
Clustering MethodStrengthsLimitationsContexts AppliedKey References
K-meansSimple, fast, scalableAssumes spherical clusters, sensitive to outliersImbalanced datasets, student performance grouping, engagement analysisMohamed Nafuri et al. (2022); López et al. (2012); Ikotun et al. (2023); Le Quy et al. (2023); Zhang et al. (2023); T. Liu et al. (2022)
HierarchicalFlexible, no predefined cluster requirement; provides detailed insights into nested patternsComputationally expensive; less scalable for large datasetsmulti-level engagement, learning behaviour segmentation, performance categorisationSeverson et al. (2007); Park et al. (2016); Arora et al. (2023); Balovsyak et al. (2023); Ikotun et al. (2023); Schubert (2023)
DBSCANHandles noise and outliers well; identifies clusters of varying shapes without assuming spherical structuresParameter tuning is complex and impacts clustering quality; sensitive to epsilon and minimum point settingsNoisy or irregular datasets, complex behaviour patterns; student interaction analysisSahni (2023); Nayak et al. (2023); Jayaprakash et al. (2020); Schubert (2023); T. Liu et al. (2022); Sharma et al. (2024)
Note: Some studies used multiple algorithms; thus, the percentages exceed 100%.
Table 3. Summary of integration techniques.
Table 3. Summary of integration techniques.
Integration ApproachAdvantagesLimitationsKey References
Clustering as Preprocessing StepImproves model interpretability; helps simplify data structure before predictionMay miss nonlinear relationships; performance depends on initial clustering qualityParack et al. (2012); Romero and Ventura (2020); Shoaib et al. (2022); R. Liu (2022)
Clustering for Feature EngineeringCaptures multidimensional behavioural data; robust to noiseComputationally intensive; requires careful selection of featuresSahni (2023); Dass et al. (2021); Nayak et al. (2023); Schubert (2023)
Hybrid Clustering-PredictionEnables use of multiple data sources; captures complex patterns across groupHigher computational cost, reduced interpretability compared to simpler modelsT. Liu et al. (2022); Nayak et al. (2023); Murphy et al. (2024); Namoun and Alshanqiti (2020)
Table 4. The frequency of the features used in EDM.
Table 4. The frequency of the features used in EDM.
Feature CategoryCommon FeaturesFrequency of UseKey References
Academic Performance MetricsGrades, test scores, attendance~88% Academic metrics are widely used due to their direct correlation with student success and availability across diverse educational contexts.Marbouti et al. (2016); Mohamed Nafuri et al. (2022); Romero and Ventura (2020); Linden et al. (2023); Shoaib et al. (2022); Nayak et al. (2023)
Engagement MetricsLMS login data, discussion forums activity, time spent on learning platforms~63% Engagement data provides real-time insights into student behaviour and complements academic performance indicators.K. Na and Tasir (2017); Ben Soussia et al. (2021); Aldowah et al. (2019); Y. Liu et al. (2022); Moubayed et al. (2020)
Socio-Economic and DemographicFamily income, parental education, geographic location~35% These features add valuable context but are used less frequently due to concerns over privacy and potential biases.Jovanović et al. (2021); A. Khan and Ghosh (2020); Miguéis et al. (2018)
Behavioural IndicatorsProcrastination, irregular study habits, inconsistent engagement~42% Procrastination and erratic learning routines, are increasingly recognised as key predictors of student risk.Han (2023); Ben Soussia et al. (2021); Yao et al. (2019); Akçapınar et al. (2019)
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Lu, Y.; Yeom, S.; Maktoubian, J.; Rahman, M.M.; Kim, S.-H. Improve Student Risk Prediction with Clustering Techniques: A Systematic Review in Education Data Mining. Educ. Sci. 2025, 15, 1695. https://doi.org/10.3390/educsci15121695

AMA Style

Lu Y, Yeom S, Maktoubian J, Rahman MM, Kim S-H. Improve Student Risk Prediction with Clustering Techniques: A Systematic Review in Education Data Mining. Education Sciences. 2025; 15(12):1695. https://doi.org/10.3390/educsci15121695

Chicago/Turabian Style

Lu, Yuan, Soonja Yeom, Jamal Maktoubian, Mohammad Mustaneer Rahman, and Soo-Hyung Kim. 2025. "Improve Student Risk Prediction with Clustering Techniques: A Systematic Review in Education Data Mining" Education Sciences 15, no. 12: 1695. https://doi.org/10.3390/educsci15121695

APA Style

Lu, Y., Yeom, S., Maktoubian, J., Rahman, M. M., & Kim, S.-H. (2025). Improve Student Risk Prediction with Clustering Techniques: A Systematic Review in Education Data Mining. Education Sciences, 15(12), 1695. https://doi.org/10.3390/educsci15121695

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop