1. Introduction
In the new era’s high talent demand, traditional curriculum assessments for environmental students, hindered by subjectivity and simplistic frameworks, fail to foster holistic growth. Environmental engineers now need practical skills, innovation, problem-solving, agility, and critical thinking for emerging industries [
1,
2]. Universities are advancing teaching and research reforms to meet these demands [
3]. In 2018, China’s Ministry of Education launched “Emerging Engineering Education,” urging upgrades in environmental engineering to cultivate tech talents [
4]. The course performance evaluation system, as the “baton” of reform, should be the priority. Course assessment’s purpose, design, implementation, and feedback are key in university evaluations, promoting student autonomy and standardizing teacher evaluation [
5]. However, traditional exam-focused assessments fail to fully reflect student learning. Unclear criteria increase uncertainty and subjectivity, while their narrowness stresses rote learning over innovation and teamwork. Urgently, we need to refine the system scientifically to foster autonomous learning and practical innovation. In previous years, scholars have proposed and applied diverse methods to optimize student achievement evaluation, such as comprehensive, qualitative, data-driven, and structured approaches. Comprehensive evaluation assesses learning abilities via multiple forms and criteria, integrating project assessments, process-oriented evaluations, and multi-rater reviews. Gratchev replaced exams with project-based tasks, revealing that students’ average project scores exceeded exam averages [
6]. Qualitative evaluation assesses students’ learning processes and abilities through subjective judgment and observation, including forms such as self-evaluation [
7,
8] and peer evaluation [
9]. Data-driven methods use data analytics and technology tools to collect and analyze student performance to provide data-based feedback and decision making. These approaches rely on learning analytics and predictive tools to identify student needs and potential problems. Accurate predictive modeling can be achieved by several techniques such as regression, classification, and clustering. Among these were artificial neural network, decision tree, support vector machine, K-means clustering, K-nearest neighbor, Naive Bayes, and linear regression [
10,
11]. For example, Kukkar et al. [
12] leveraged RNN + LSTM + RF techniques and gained approximately 97% accuracy. Shoaib et al. [
13] developed a Convolutional Neural Network feature learning block to extract the hidden patterns in the student data, culminating in a commendable 93% accuracy for student grade prediction and student risk prediction. Shi et al. [
14] conducted a cluster analysis of students based on three behavioral attributes (i.e., effort regulation, self-assessment, and learner participation), identifying distinct learning strategy patterns that offer new insights for teaching.
Among the various methods, the Analytic Hierarchy Process (AHP) can convert the decision-making process into a hierarchical structure of related criteria and is widely used to address complex decision-making problems across educational fields [
15,
16,
17,
18]. AHP is a structured decision-making approach that organizes complex problems into a hierarchical framework. This method begins by decomposing a decision problem into various levels, which include the overall goal, subordinate objectives, evaluation criteria, and specific alternatives. Once the hierarchy is established, a judgment matrix is constructed, allowing decision-makers to compare the elements at each level based on their relative importance [
19]. However, it is important to note that decision-making problems often involve many uncertainties, such as subjective preferences of experts, incompleteness of information, challenges of quantification, etc. Therefore, AHP must be used together with other classical uncertainty methods, such as the Fuzzy set method [
20], Delphi method [
21], Interval number [
19], and Grey system theory [
22]. I-AHP enhances traditional AHP by incorporating interval numbers instead of single-point values when constructing pairwise judgment matrices. This approach mitigates the impact of subjective bias and more accurately captures the inherent uncertainty in the decision-making process. Qin et al. [
19] used I-AHP to conduct a comprehensive evaluation of teaching quality in the mathematics classroom, and the results showed that the method has a greater ability to deal with uncertainty. Therefore, a more scientific and rational student course performance evaluation system can be established with the help of I-AHP. However, it is difficult for I-AHP to mine latent information from the final evaluation data, and the current evaluation system lacks scientific methods to utilize such data. These data are of great significance in guiding teachers and students in their future learning and work, and improving the quality of education.
Cluster analysis is an important and active research field in data mining [
23]. Clustering is an unsupervised learning method that groups similar objects in a dataset, aiming to make the objects within the same group similar to each other while being distinctly different from objects in other groups. Currently, commonly used clustering methods include partition clustering, density clustering, hierarchy-based, model-based, and density-based [
24]. The K-means clustering algorithm is simple and easy to understand, with high computational efficiency, and is suitable for processing large-scale student performance data. For example, Kim et al. [
25] conducted a statistical assessment of student engagement in online learning using the K-means clustering algorithm, examining differences in attendance, assignment completion, discussion participation, and perceived learning outcomes. The results indicate that students are divided into two groups, with significant differences between the two groups in terms of instructor–student interaction, student–student interaction and perceived learning outcome. Tuyishimire et al. [
26] categorized 703 students into three clusters: more encouraged students, encouraged students, and less encouraged students, and analyzed them in conjunction with proportional distribution. Due to its characteristics of simplified operation, timely feedback, and enhanced specificity, the K-means clustering method effectively compensates for the shortcomings of I-AHP. However, these works cannot explain how much a feature contributes to the dimensionality reduction result. To address this interpretability gap, Shapley Additive Explanations (SHAP) values derived from cooperative game theory are integrated into the clustering framework [
27]. SHAP provides mathematically rigorous attribution values that quantify each feature’s marginal contribution to cluster assignments while maintaining consistency with the original model’s outputs.
Table 1 synthesizes the strengths and limitations of existing student performance evaluation methods. While these methods demonstrate promising results, they nevertheless struggle to address the specific challenges of course evaluation, particularly in simultaneously handling multiple evaluation components, managing uncertainty, and mitigating subjective grading. On one hand, comprehensive evaluation tends to be time-consuming and complex, requiring multiple raters and criteria—a significant drawback for large-scale implementation. Conversely, qualitative evaluation, despite its effectiveness in capturing behavioral nuances through self- and peer-assessments, suffers from high subjectivity and potential bias. In contrast to these approaches, data-driven methods such as regression, classification, and neural networks excel in predictive accuracy through objective data analysis [
28]. However, their overreliance on quantifiable metrics causes them to not only overlook non-quantifiable aspects but also fail to account for the relative importance of different evaluation dimensions, ultimately compromising assessment precision. Therefore, this study adopts the I-AHP to evaluate student course performance by deriving criterion weights for pedagogical sub-criteria based on expert judgment, and utilizes K-means clustering to stratify students into performance cohorts based on multidimensional metrics. Random Forest classification with SHAP value analysis is employed to identify key discriminators of cluster membership and interpret decision boundaries, enabling attribution-guided interventions to address cohort-specific deficiencies. Implemented within a dual-channel ecosystem across pre-class, in-class, and post-class phases, this integrated framework aims to provide a comprehensive and accurate understanding of students’ performance patterns and learning differences. This approach provides teachers with a tool for analyzing student differences. Teachers can utilize this tool to systematically assess data on different student groups, including their knowledge mastery in final exams, participation in classroom performance, accuracy and timeliness of assignment completion, learning outcomes in chapter tests, self-directed learning abilities during chapter self-study, and technical application and collaboration performance in online assignments. Subsequently, teachers can develop personalized learning support plans based on students’ specific needs and characteristics. The framework’s process, illustrated in
Figure 1.
3. Case Study
This chapter presents a comprehensive case study to validate the proposed integrated assessment framework, combining the I-AHP, K-means clustering, Random Forest classification, and SHAP value analysis, within the context of an undergraduate Environmental Monitoring course.
Section 3.1 begins by outlining the course structure, pedagogical objectives, and existing challenges in student performance evaluation. Building on this foundation,
Section 3.2 details the formulation of the I-AHP model, including the hierarchical indicator system construction, expert-driven interval judgment matrices, and the derivation of interval weights.
Section 3.3 then operationalizes these indicators through multi-source data collection, alongside rigorous preprocessing steps to ensure data integrity.
Section 3.4 applies K-means clustering to categorize students based on normalized performance scores, followed by Random Forest classification and SHAP analysis to identify key discriminators of cluster membership and inform attribution-guided interventions.
3.1. Overview of Environmental Monitoring Course
The data presented in this paper are sourced from the Environmental Monitoring course at the School of Future Technology, Xi’an University of Architecture and Technology. The Environmental Monitoring course is a compulsory course designed for environmental majors, and the course content framework is mainly shown in
Figure 3. The content of the Environmental Monitoring course is divided into five major chapters: Introduction, Monitoring of Water and Wastewater, Monitoring of Air and Exhaust Gas, Automatic Monitoring of Environmental Pollution, and Quality Assurance. This chapter structure is intended to help students systematically grasp both theoretical and practical skills in environmental monitoring. In addition, the course format is diverse, incorporating Ideological and political education, flipped classrooms, online and offline blended teaching to enhance student engagement and learning outcomes. The traditional method of student course performance evaluation in Environmental Monitoring courses is based on a final exam combined with online and offline evaluation of scores.
However, as student-centered teaching philosophy gradually took root, the traditional student course performance evaluation system increasingly proved unfavorable for students’ knowledge accumulation and skill development, making it difficult to meet the diverse needs of talent cultivation in universities in the new era. Firstly, there is a high uncertainty in the score. The content of the Environmental Monitoring course is quite extensive and highly integrative. This characteristic results in fewer examination questions with a single, definitive answer, and a higher proportion of analytical questions with open-ended answers. Consequently, the grading system often relies on the subjective judgment of instructors, which further increases the high degree of uncertainty in grading. Secondly, scores are predominantly based on summative examinations, which often fail to comprehensively analyze students’ learning attitudes, thought processes, and developmental capacities. This shift in focus may lead students to prioritize exam results over daily learning, decrease their initiative in learning, and ultimately result in insufficient mastery of knowledge. Thirdly, it is challenging for teachers to timely adjust or develop scientific and effective learning strategies for students based on assessment results. Since each student’s learning situation differs, especially in large classes, teachers face significant challenges. They struggle to conduct an in-depth analysis and diagnosis of each student’s learning process, making it difficult to accurately identify the specific problems and needs of individual students. As a result, teachers often resort to a one-size-fits-all approach rather than providing targeted guidance and support for different students. In addition, courses such as environmental monitoring not only teach professional knowledge but also cultivate students’ industry ethics and a strong sense of social responsibility. Traditional assessment methods make it difficult to reflect this part of the situation. Therefore, this study optimizes course assessment through I-AHP and proposes an assessment model that includes “online and offline” dual channels and “pre-class, in-class, and post-class” stages. Finally, an in-depth analysis of final grades was performed using clustering analysis to facilitate personalized and precise instruction.
3.2. Formulation and Solution of I-AHP Model
In order to evaluate students’ studying performance comprehensively and accurately, the evaluation system should be designed from diverse angles and aspects: studying attitude, studying preparation, studying process, exam results, etc. The comprehensive evaluation indicator system for environmental monitoring course performance is divided into the goal level, the criteria level, and the indicator level by AHP. The criteria level consists of two factors: process assessment (B1) and final exam (B2). The indicator level is further subdivided into classroom performance (C1), assignment completion (C2), online study (C3), chapter self-study (C4), etc. Consequently, a four-level hierarchical structure is shown in
Table 4, as well as
Figure 4 and
Figure 5. Process assessment results (B1) can reflect students’ performance throughout the learning process, going beyond mere reliance on examination results. It effectively showcases students’ learning attitudes, participation, and effort levels. Educators can keep abreast of students’ learning conditions and provide guidance and assistance through process assessment. Classroom performance (C1), including signing in, discussion interactions, thoughts and insights, etc. This indicator not only reflects students’ engagement in the learning process but also showcases their professional cognition. Assignment completion (C2) represents students’ self-discipline and consistency in learning, aiding in the absorption and consolidation of newly acquired knowledge. Online study (C3), such as MOOCs, is not constrained by time and space, allowing students to organically integrate their interests and fragmented time. Through online learning, students can enhance their self-directed learning abilities and improve in weaker areas. The assessment criteria for online learning are divided into chapter tests (D1) and online assignments (D2). The environmental monitoring course encompasses numerous knowledge points, and the teaching progress is relatively rapid. Through chapter self-study (C4), students can promptly preview essential concepts, thereby facilitating a rapid transition into an active learning state during class. Finally, the final exam (B2) is a comprehensive assessment at the end of the semester, reflecting students’ mastery of the knowledge points and their ability to apply it. The original expert judgment data is presented in
Table A1 of
Appendix A, which shows the judgment results of two specified experts.
After the establishment of the hierarchy, it is necessary to compare the factors at each level, construct the judgment matrix, determine the relative importance, and use the appropriate scale value to solve the weight of each index in the hierarchy quantitatively. By solving the weights of level factors, it is possible to objectively reflect the importance of different level factors in the overall performance evaluation. The pairwise comparison between factors was obtained by issuing a questionnaire to the expert committee. This study invited nine experts in the field of environmental engineering to participate in a questionnaire survey, among whom three had previously taught environmental monitoring courses. The selection of experts needed to meet one of three criteria: firstly, having served as the chief editor for textbooks in the field of environmental engineering; secondly, having published papers related to educational themes in core journals; and thirdly, being scholars who have to preside over research projects related to curriculum reform at the provincial level or above. The expert group has rich experience in vocational education and significant research in vocational education teaching and textbooks, which can effectively ensure the authenticity and rationality of the evaluation index system construction. Finally, a questionnaire validity of 100% indicated a high level of engagement from the experts involved in the research.
Saaty’s 1–9 scale of importance was adopted by experts to assess the significance of factors across levels B to D. The evaluation results are presented in
Table 5. The rating of B1 relative to B2 is [2, 1], indicating that the importance of B1 compared to B2 lies between 2 and 1, suggesting that B1 is slightly more important than B2. A detailed analysis of the weights for other hierarchical levels will be elaborated in
Section 4.1. Before this, it is essential to validate the effectiveness of the weights through a consistency test to ensure the coherence of the judgment matrix. The interval reciprocal judgment matrices
(
x = B, C, D) for levels B to D provided by experts are listed as follows:
Taking the interval judgment matrix
as an example for calculation,
can be divided into the lower bound matrix
and the upper bound matrix
:
Since different evaluation criteria often have different dimensions, such a situation can affect the comparison results of the data analysis. To eliminate the dimensional influence between criteria, weight normalization is required. The specific formula is given in Equation (5).
Next, calculation the basic weights of matrix
and
, the feature vector of matrix
and
could be computed with a arithmetic mean method as shown in Equation (7). The feature vector
of
are listed as follows:
As stated in
Section 2.2, it is necessary to calculate the maximum eigenvalue and the consistency index. The calculation of the maximum feature root
is a crucial step in the AHP. Its primary purpose is to assess the consistency of the pairwise comparison matrix. Taking the upper bound matrix
as an example, the judgment matrix
could be calculated with Equation (9). However, it is important to note that if the matrix is of order
n ≤ 2, no consistency check is required because such small matrices are inherently consistent. while the consistency index is determined with Equation (10). The consistency ratio of the comparison matrix to eliminate inconsistencies (CR) is calculated as Equation (11).
Since value was less than limit value (CR < 0.1), it was decided that the comparisons of experts were consistent and dependable. After confirming the reliability of the weights, it is possible to analyze students’ performance in various indicators based on their actual performance and calculate their comprehensive scores using the weight system.
Figure 5 provides a detailed illustration of the weight change graph obtained after removing the expert evaluation component within the I-AHP (an integrated assessment system based on the Analytic Hierarchy Process). Through rigorous data analysis and calculation, it is found that the magnitude of weight changes presented in the graph does not exceed 0.02.
3.3. Case Study Data Collection and Preprocessing
The evaluation of student performance under the proposed I-AHP framework required multi-source data aligned with the hierarchical indicators defined in
Section 3.2 (
Table 3). Taking the performance evaluation of environmental monitoring for 98 undergraduates majoring in environmental science of the academic year 2020–2025 as an example, comparative analysis and research were carried out. B1 was operationalized through four hierarchical indicators: C1, C2, C3, C4. Data for C1 were derived from in-class activities, including verbal participation records, interactive Q&A sessions, and mini-quizzes comprising multiple-choice or true/false questions administered during lectures. C2 integrated objective scores from six routine assignments (e.g., problem-solving exercises) and instructor evaluations of four project reports, such as the formulation of a campus ambient air quality monitoring plan, a water quality monitoring scheme for Rivers, and an environmental impact analysis of urban noise. C3 was assessed through two components: D1, administered via the course’s dedicated online learning platform, and D2, calculated from platform-generated logs documenting students’ interactions with instructional audio-visual materials, including quantitative measures of viewing frequency and duration. C4 relied on self-assessment surveys evaluating students’ autonomous study habits and conceptual understanding. B2, serving as a summative assessment, encompassed multiple-choice questions, conceptual analyses, computational problems, and open-ended essays, all rigorously graded by the course instructor. Following data collection, the raw dataset underwent a series of preprocessing steps to address quality issues and ensure analytical robustness. Platform-generated behavioral logs underwent sanity checks to remove implausible entries, such as anomalously prolonged activity durations exceeding 24 h.
3.4. Educational Data Analysis
3.4.1. K-Means Clustering Analysis
Considering the practical needs of teachers and aiming to gain a deeper understanding of students’ performance patterns in the course, the K-means clustering analysis method was further employed. By clustering students’ performance across various factors in the I-AHP model, teachers can more effectively identify the needs of different student groups and implement personalized teaching interventions. Initial feature standardization employed Z-score normalization on six educational assessment dimensions (Classroom performance, Assignments completion, Online assignments, Chapter tests, Chapter self-study, and Final exam) to eliminate scale variance. Cluster cardinality optimization implemented multi-metric validation across K = 2–5, systematically evaluating silhouette coefficient, inertia elbow detection, and Calinski-Harabasz index. The multi-metric evaluation results are shown in
Figure 6 and
Table 6. With tripartite partitioning (K = 3) emerges as the optimal configuration through consensus of all metrics. This process groups students into cohorts with similar performance characteristics, enabling educators to identify distinct learning patterns and tailor instructional strategies. This process groups students into cohorts with similar performance characteristics, enabling educators to identify distinct learning patterns and tailor instructional strategies.
Figure 7 demonstrates its stability after 100 restarts, showing that the tripartite partitioning (K = 3) also emerges as the optimal solution.
3.4.2. Random Forest Classification and SHAP Analysis
To enhance model interpretability and support the precise implementation of personalized intervention measures, this study employs the Random Forest classification algorithm to model cluster membership and conducts meticulous hyperparameter tuning: Specifically, the number of decision trees (n_estimators) is set to 100, the maximum tree depth (max_depth) is set to unlimited (i.e., None), the minimum number of samples required to split a node (min_samples_split) is set to 2, the minimum number of samples required at a leaf node (min_samples_leaf) is set to 1, and the “square root” criterion (max_features = ‘sqrt’) is adopted to achieve optimal feature selection during node splitting. Additionally, to ensure the reproducibility of the experiment, we deliberately set a fixed random seed (random_state = 42). During the model construction process, we utilize SHAP (Shapley Additive Explanations) values to quantify the contribution of each feature. Regarding the stability issue that may arise from differences in the number of SHAP samples, when using the shap. TreeExplainer algorithm, it is found that this algorithm can provide precise SHAP value calculations for tree-based models without relying on random sampling for approximate estimation. Therefore, all calculated SHAP values are deterministic, and their stability is not influenced by the sample size parameter, fundamentally guaranteeing the uniqueness and reproducibility of the attribution results. The results revealed that, for Cluster 1, ‘assignment completion rate’ was identified as the most important feature in 9 out of the 10 runs. For Cluster 2, ‘final exam score’ consistently emerged as the most significant negative driving factor. The ranking of key features based on their SHAP values remained highly consistent across all runs, with a standard deviation of less than 0.01, which demonstrates the robustness of our attribution conclusions. A Random Forest classifier, selected for its robust predictive accuracy, is trained on the clustered student data to predict cohort affiliation based on the six assessment dimensions. SHAP values, grounded in cooperative game theory [
32], calculate each feature’s marginal contribution to classification outcomes, identifying key discriminators of cluster membership (e.g., low engagement or weak test performance) and interpreting decision boundaries. SHAP analysis uses 100 samples per explanation, with mean absolute SHAP values computed for each cluster to quantify feature importance, as described in
Section 2.3.2. SHAP insights inform attribution-guided interventions, enabling educators to design targeted teaching strategies for cohort-specific deficiencies, such as enhancing online engagement or strengthening self-study skills.
4. Results Analysis
This study presents an analysis of the evaluation results derived from the integrated framework combining the I-AHP, K-means clustering, Random Forest classification, and SHAP value analysis, applied within a dual-channel ecosystem across pre-class, in-class, and post-class phases.
Section 4.1 outlines the weighting distribution of pedagogical sub-criteria to highlight their significance in student course performance evaluation.
Section 4.2 conducts a comparative analysis of student scores before and after implementing the evaluation framework to assess its impact.
Section 4.3 applies K-means clustering to categorize students based on normalized performance scores, followed by Random Forest classification and SHAP analysis to identify key discriminators of performance levels and inform attribution-guided interventions.
4.1. Analysis of Interval Weights
After the calculation in
Section 3.2, the weights of all level factors are presented in
Table 7. Taking B1 as an example, the upper weight for this factor is 0.67 with a ranking of first place, and the lower limit weight is 0.5. For factors B1 and B2, some assessors stated that “Evaluation of learning outcomes should be diversified, with process assessment and final grades should be equally important”. The weights of C2 and C3 are considered most significant, with their lower and upper limits both specified between 0.30 and 0.35, and their rankings all established as prominent. This indicates that these two factors are deemed highly important within the assessment of Category C. Meanwhile, C1 is ranked fourth, with its weights allocated as 0.10 and 0.08, which are lower compared to the other factors. Some assessors noted that “Students’ performance in the classroom may be related to their personalities, compared to C2, C3 and C4, which are more reflective of students’ learning states”. Regarding C4, some assessors remarked that “The teaching process should emphasize pre-studying of chapters, which helps students develop good study habits”. However, others expressed that “This factor is assessed in a single way, which makes it difficult to accurately reflect students’ pre-studying, with a high degree of uncertainty”. The weight of D1 is significantly higher than that of D2. The upper weight of D1 is 0.80, and the lower weight is 0.75, while the upper weight of D2 is 0.25, and the lower weight is 0.20. Nearly all assessors stated that “Factor D1 is more challenging than factor D2, testing students’ understanding, application skills, and the solidity of foundational knowledge.” Based on the weight distribution results obtained from the I-AHP, it is possible to further calculate the comprehensive score of the students to assess their overall performance in the course. By incorporating the specific performance data of the students and applying the weighting system to each indicator, the final comprehensive evaluation result can be derived.
4.2. Analysis of Scores from the I-AHP Model
The implementation of the I-AHP model has provided a more comprehensive and nuanced assessment of students’ scores, while also facilitating the diversification of assessment functions in the teaching process. The comparison of student score distributions is shown in
Figure 8. All students have met the passing standard (with scores of 60 points or above), among whom [50, 53] students fall within the middle score range (with scores between 70 and 79 points). Compared to the previous assessment scores based solely on a single score point, the average score assessed by the I-AHP model has seen a substantial increase in [1.88, 5.65] points. A detailed analysis of individual student scores reveals that the majority of students have achieved higher scores than those derived from the single-score-point assessment, suggesting that the I-AHP method may offer a more comprehensive evaluation of students’ abilities. The I-AHP method appears to effectively refine score estimates by providing a score range instead of a single score point, enabling a more detailed assessment of student performance. In past teaching practices, the assessment function has not been fully realized. Single examinations often devolve into summative evaluations, which, while expanding the evaluative function of exams, overshadow and neglect other crucial assessment functions. Course assessment should play a pivotal role in evaluation, differentiation, prediction, diagnosis, teaching feedback, and motivation. During the teaching process, educators can formulate more targeted instructional strategies based on existing assessment results, thereby enhancing the enthusiasm and initiative of both teachers and students.
Although the introduction of interval numbers in I-AHP enhances decision-making flexibility, it also presents certain limitations. Firstly, the construction of judgment matrices in I-AHP heavily relies on the subjective judgments of experts. Given that experts vary in their professional backgrounds and experiences, assessment results are prone to biases; secondly, the incorporation of interval numbers significantly increases computational complexity, and when dealing with a large number of indicators, the efficiency of iterative solutions markedly declines; thirdly, the determination of interval widths necessitates a careful balance between information retention and decision-making flexibility. Setting the interval range too wide may obscure key information, while setting it too narrow may restrict decision-making flexibility, ultimately undermining the effectiveness of decisions. To address these issues, this study introduces a method that integrates the Random Forest algorithm with K-means clustering analysis to further enhance the accuracy and robustness of decision analysis.
4.3. Results and Analysis of K-Means Clustering
An evaluation was carried out to delve into how retraining the Random Forest model affects the SHAP attribution results, during which the model was retrained using 10 distinct random seeds and the mean absolute SHAP values for each feature were then computed. The results revealed that for students in Cluster 1, ‘assignment completion’ stood out as the most crucial feature in 9 out of the 10 retraining scenarios, while for those in Cluster 2, ‘final exam’ consistently acted as the most significant negative driving factor across all retraining attempts. Moreover, the ranking of SHAP values for key features remained highly consistent throughout all runs, with a standard deviation below 0.01, providing solid statistical support to validate the robustness of the derived attribution conclusions. The integration of SHAP interpretability with K-means clustering delineates three distinct student cohorts characterized by quantifiable divergence in learning patterns, as evidenced by cluster centroids and Shapley value decomposition. The centers of clustering results are presented in
Table 8, while
Figure 9 visually summarizes the SHAP value-based feature importance across all clusters. The student distribution across clusters reveals Cluster 1 as the largest cohort with 39 students (39.58% of the total), followed by Cluster 2 with 37 students (37.50%), and Cluster 3 with the smallest cohort of 24 students (24.38%).
As shown in
Table 9, Cluster 1 displays a balanced performance profile with moderate scores across most indicators, achieving a final exam score of 65.84 and a classroom performance of 80.11. SHAP analysis reveals that assignment completion is the primary driver of cluster membership, overshadowing the influence of final exam scores despite their moderate level. This indicates that Cluster 1 students heavily rely on completing assignments, likely compensating for weaker autonomous learning, as reflected in their chapter self-study score of 86.86. The dominance of assignment completion in Cluster 1’s SHAP profile suggests these students disproportionately rely on task compliance over genuine comprehension, as evidenced by their exam scores despite moderate performance in other domains. Cluster 2, with 37 students, demonstrates the lowest final exam performance despite relatively competent scores in other domains, such as chapter tests and online assignments. SHAP attribution reveals final exam scores as the primary differentiator for this cohort, with a high SHAP magnitude, underscoring their critical underperformance in knowledge integration during high-stakes assessments. The chapter self-study score (80.84) is the lowest among clusters, suggesting deficiencies in independent learning that may contribute to their exam struggles. This cohort’s SHAP profile suggests a reliance on incremental task completion without effective translation to exam outcomes, exposing a gap in knowledge retention. Cluster 3 performed well on all indicators, especially the self-study score, which showed that these students were very good at independent learning. They were able to effectively master the content of their studies and achieved better grades in the final exam (73.50). This high level of self-study may have given them an advantage in preparing for the exam and, therefore, better overall learning outcomes. Notably, while self-study scores are numerically dominant, SHAP attribution prioritizes final exams as the second-most influential feature, with exam scores exhibiting stronger differentiation power than chapter tests. This SHAP-based prioritization implies that exam performance better captures this cohort’s knowledge integration efficacy compared to incremental self-study gains, highlighting the model’s capacity to surface non-intuitive feature relationships.
In summary, cluster 1 students, with average grades and engagement levels, could adopt a simplified flipped classroom model: pre-class micro-lectures (≤10 min) paired with reflection worksheets, in-class structured peer debates using scenario-based prompts, and post-class journals consolidating insights from both phases. Students in Cluster 2 exhibit weaker self-learning abilities and poorer final grades. To address this, instructors may implement a phased approach beginning with structured review tasks (e.g., annotated concept maps or summary tables) to scaffold independent learning, followed by biweekly peer-led study groups guided by discussion templates provided by the instructor. Progress is reinforced through periodic self-assessment checklists and personalized feedback. For Cluster 3 students, who demonstrate strong academic performance with high assignment completion and online assignments but a relatively lower final exam score, brief synthesis-oriented questions can be embedded directly into post-lecture assignments, with subsequent in-class discussions analyzing these questions to emphasize connections between assignment content and broader course concepts.
Table 10 presents the Kruskal–Wallis test results for six academic performance features across all features. The H-statistic values, ranging from 82.65 (Online Assignments) to 95.25 (Classroom Performance), indicate varying degrees of discriminatory power among the features, with classroom performance demonstrating the strongest capacity to differentiate student groups, while online assignments exhibit relatively weaker but still signs three K-means clusters (k = 3), revealing statistically significant differences (
p < 0.05) ificant clustering effects. These findings validate the effectiveness of the K-means clustering approach in identifying distinct student subgroups based on academic behaviors, suggesting that classroom performance serves as a critical indicator for educational differentiation, whereas online assignments may require additional contextual analysis to fully interpret their role in student stratification. The results collectively support the use of these features for targeted teaching interventions, with high-H-statistic features (e.g., classroom performance, assignment completion) warranting priority in designing personalized learning strategies.
4.4. Research Limitations
A significant research limitation of this study lies in the inadequate sample size. Specifically, the total number of samples included in this study is 98. A relatively small sample size may undermine the generalizability and robustness of the research findings. It could reduce the reliability of statistical results and limit the ability to accurately represent a broader population and draw far-reaching conclusions. When interpreting the research findings and considering their applicability, this limiting factor of sample size should be taken into account.
5. Conclusions
In this paper, I-AHP was used to evaluate student course performance, and differences in student performance factors were analyzed with K-means clustering and Random Forest classification with SHAP value analysis. On the one hand, I-AHP, based on pedagogical sub-criteria, reduced the influence of subjective factors in the evaluation process and clarified the weight of each evaluation criterion. On the other hand, K-means clustering, combined with Random Forest and SHAP analysis, reveals the internal structure of student performance and helps teachers identify groups of students with similar learning characteristics and achievement levels. Based on the analysis results, teachers can take appropriate measures to enhance students’ learning outcomes.
Based on the analysis, the following conclusions can be drawn: (1) Utilizing K-means clustering to categorize students into three cohorts, combined with Random Forest classification and SHAP analysis to categorize students into different groups, helps educators develop targeted teaching strategies. The clustering results allow for the identification of differences in students’ learning abilities, enabling the design of tailored instructional activities that cater to varying ability levels. (2) The clustering results indicate that students in Cluster 1 demonstrate average final grades and classroom performance. Meanwhile, Cluster 2 exhibits weaker self-learning abilities and has lower final grades. Lastly, students in Cluster 3 achieve strong final grades and excel in chapter tests and online assignments, but may benefit from improved synthesis of knowledge for comprehensive assessments.
Currently, there is still a lack of empirical research on the application of this framework in disciplines other than the “Environmental Monitoring” course. However, its potential for cross-disciplinary application is extremely remarkable. The Improved Analytic Hierarchy Process (I-AHP), with its unique mechanism of quantifying teaching criteria to reduce subjective evaluation biases, can flexibly cater to the teaching needs of different discipline courses, such as those in medicine and engineering. Through clustering and classification analysis methods, it can precisely identify the differences in ability characteristics among students from different disciplines in areas such as programming algorithm skills and humanistic critical thinking. The study has found that three types of student groups are commonly present: the balanced-development type, the type with weak autonomous learning abilities, and the type with high grades but insufficient knowledge integration. These findings provide a solid basis for the implementation of stratified teaching strategies.
In practical applications, it is necessary to adjust the data collection methods according to the specific characteristics of each course. For example, for language courses, text analysis can be added, and corresponding evaluation dimensions can be optimized. For engineering courses, the assessment of drawing and drafting abilities should be strengthened, while ensuring there is a sufficient amount of data to support in-depth analysis. In the future, multi-disciplinary comparative experiments will be conducted to further verify the framework’s adaptability across different disciplines, with the aim of developing a widely applicable teaching optimization plan.