Learning Analytics with Scalable Bloom’s Taxonomy Labeling of Socratic Chatbot Dialogues
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsManuscript “Learning Analytics with Scalable High-Confidence Bloom’s Taxonomy Labeling of Socratic Chatbot Dialogues” authors propose a semi-supervised machine learning pipeline to automate the classification of student-chatbot dialogues according to Bloom’s Taxonomy. Below are my comments to improve the manuscript.
- The models were trained and validated with examples from only three students (Students 1, 2, and 3) and later used on 31 other students. Patterns of discourse, words, and sentences differ greatly from person to person. There is nothing empirical about the model's ability to generalize 31 unseen students. "High confidence" in the title is mathematical probability in the source domain, and it is not verified for the target domain.
- It is noted in the text that the data is annotated by "a domain expert" in the singular form. This is because, in subjective tasks such as the categorization according to the categories of the taxonomy proposed by Bloom, Inter-Rater Reliability (IRR), Cohen’s Kappa statistics, is normally used. The use of a single annotator is prone to bias, which will be followed by the machine learning algorithm.
- In section 4, Chi-square tests are conducted on the pseudo-labelled (unlabelled) datasets by the authors for pedagogical purposes. There is a problem with methodology here, namely testing for statistical significance in predictions which have not been shown to be valid. The bias of the model will simply be verified rather than verifying the learning behaviours of the students because the bias of the SVM is conceded by the authors.
- The fact that a lightweight Linear SVM can achieve much better performance (Macro-F1 0.630) compared to a fine-tuned GPT-4o-mini (Macro-F1 0.473) on minority classes is an important contribution. This critically challenges the recent trend of assuming LLMs are the default solution for all NLP tasks, especially on imbalanced datasets. Add a table comparing specific text examples of classification errors/successes between the SVM and the Fine-Tuned LLM to explain the Macro-F1 divergence.
- It is useful to extend the literature that has so far merely focused on static texts, such as exam questions, by applying semi-supervised learning to Socratic educational dialogues.
- The comparison itself, Micro-F1 and Macro-F1, is excellent and very important for a specific problem such as extreme class imbalance.
- The abstract introduces a "zero-shot" GPT-4o-mini baseline, although the emphasis is entirely on the "fine-tuned" variant in the Results section, based on Table 1's discussion. There is an omission concerning the discussion of performance metrics related to zero-shot LLMs, where these LLMs tend to be better generalizers than fine-tuned ones with biased datasets comprising very few examples.
- Even as these numbers have been described well in the text, the use of a log-scale inset in figure 2 further emphasizes the problem of class imbalance, casting doubts about the validity of the tests being conducted.
- Good use of references.
Author Response
Please see the attachment.
Author Response File:
Author Response.pdf
Reviewer 2 Report
Comments and Suggestions for AuthorsIndeed the method for automatically labeling student-chatbot dialogue exchanges with Bloom's taxonomy categories using a pseudo-labeling pipeline is relevant and needed, and the scale of the dataset (6,716 utterances) is promising. However, the manuscript in its current form suffers from fundamental methodological and conceptual flaws that undermine its validity and contribution. For example, the entire methodological pipeline "model training", threshold tuning, and evaluation, is built upon a "gold-labeled subset" consisting of only three student workbooks (8.8% of the cohort), selected based on "chronological availability" (i.e., the first three submitted). The authors acknowledge this as a limitation but profoundly underestimate its consequences. The models are trained and tuned on the linguistic patterns, cognitive styles, and potential idiosyncrasies of just three students. There is no evidence that these students are representative of the remaining 31. For instance, if these three students were highly engaged or particularly verbose, the model learns to associate their specific dialogue patterns with Bloom levels, which may not apply to others.
One major concern, the authors state "The following are available for the three annotated workbooks... The remaining workbooks lack annotations and are labeled via pseudo-labeling. This setup reflects practical educational contexts..." (Page 5). While the setup may be "practical," it is scientifically unsound for making the broad claims that follow. The subsequent distribution analysis in Figure 2 and the claims about "dominance of Understanding" are not analyses of the true dataset, but analyses of labels generated by a model trained on a severe, non-random bias.
Moreover, The pipeline is designed to assign labels where the model is confident, based on its training on three students. Any distributional findings (e.g., "the dominance of Understanding (82–92% of labeled student utterance)" (Page 8)) are artifacts of the model's biases and the initial data skew, not verified properties of the dialogue. The authors perform statistical tests (χ², KL divergence) on these synthetic distributions, which tests whether two models label differently, not whether the labels reflect reality. The conclusions about chatbot pedagogical limitations are built on this house of cards.
Finally, the discussion states, "the dominance of Understanding... suggests that, in the case of this Socratic chatbot, the chatbot-mediated interactions remain concentrated at lower-order cognitive levels" (Page 8-9). This is a profound claim about educational effectiveness, but it is based on unverified automated labels. An equally plausible explanation is that the model, trained on limited data, is simply better at recognizing surface patterns of "Understanding" and fails to identify nuanced higher-order thinking in other students' dialogues.
Comments on the Quality of English LanguageReplace passive and hedging phrases with active, declarative statements. Use precise verbs and specific nouns.
Author Response
Please see the attachment.
Author Response File:
Author Response.pdf
Reviewer 3 Report
Comments and Suggestions for AuthorsThe manuscript is generally excellently done. In the introduction, it would be good to state the main contribution of this research and the work itself in the last paragraph.
Also, it would be fine to have more recent references. The reference number 3 is not complete.
Comments on the Quality of English LanguageThe English language is acceptable. But I am not competent to give the final words for that.
Author Response
Please see the attachment.
Author Response File:
Author Response.pdf
Reviewer 4 Report
Comments and Suggestions for AuthorsThis paper addresses a critical pain point in educational chatbot research—the lack of scalable tools for analyzing the cognitive depth of dialogues—by proposing a practical and well-designed solution. The methodology comprehensively compares two technical approaches: calibrated classical machine learning models and fine-tuned lightweight LLMs. The use of a high-confidence pseudo-labeling strategy to expand a small expert-annotated dataset, while explicitly avoiding iterative self-training to prevent confirmation bias, is methodologically sound. The evaluation using both Micro-F1 and Macro-F1 metrics clearly reveals the trade-offs between overall performance and class-wise balance. The discussion thoughtfully explores the distinct value of different models (LLMs for efficiency vs. classical models for balance) in educational applications and connects these findings to practical teaching needs, such as detecting rare higher-order thinking skills, thereby enhancing the study’s practical relevance. The paper provides a reproducible pipeline for scaling Bloom’s taxonomy-based analysis of educational dialogues, offering direct reference value for learning analytics and adaptive teaching systems.
Major Issues and Suggestions for Improvement
1.The gold-standard annotation set is derived from only the first three students who submitted their work (8.8% of the cohort), selected chronologically. This introduces a significant risk of sampling bias, as students who complete assignments early may systematically differ from others in terms of learning motivation, cognitive style, or dialogue patterns. The models may unintentionally overfit the dialogue characteristics of these specific individuals.
2.While the paper evaluates model performance on the labeled subset, it does not assess the quality of the pseudo-labels generated for the unlabeled data. Thus, the reliability of these pseudo-labels remains unclear.
3.Although chi-square tests and KL divergence calculations are performed, the reporting of KL divergence values (0.064, 0.031) is limited to numerical results. It would be beneficial to provide a more concrete interpretation of these values in the context of educational dialogue analysis, explaining what “small but systematic differences” imply in practice.
4.In Figure 2, the “Unlabeled” category constitutes a large proportion of the data, yet the analysis of cognitive distribution focuses primarily on the labeled portion. This may cause confusion for readers.
5.Table 1 only reports F1 scores. It is recommended to include a complete table with precision and recall for each label category, which is essential for understanding model performance on specific labels—especially rare higher-order categories.
6.The literature review could be more tightly focused on Bloom’s taxonomy studies in “dialogue” contexts and draw a clearer contrast with studies on “static texts” (e.g., exam questions, learning objectives) to better highlight the innovation of working with dynamic, interactive data.
7.The term “prevalence-weighted performance” used in the abstract is referred to as “micro-F1” in the main text. It is advisable to consistently use “micro-F1,” which is a more standard term.
8.The phrase “Replacement paragraph for Section 3 (Methodology):” appears to be a drafting artifact and should be removed.
9.In Figure 1, “Mt Model Training” is likely a typo and should be corrected to “ML Model Training.”
10.Some long sentences, particularly in the abstract and introduction, could be broken down to improve readability and flow.
Comments on the Quality of English LanguageIt is recommended to engage a native English-speaking expert for thorough language polishing of the entire manuscript.
Author Response
Please see the attachment.
Author Response File:
Author Response.pdf
Round 2
Reviewer 1 Report
Comments and Suggestions for AuthorsI thank the authors for their diligent and comprehensive revision of the manuscript. The responses to the previous round of feedback are thoughtful, and the corresponding changes in the text have significantly strengthened the validity and clarity of the work. It would be better to address these as minor and for clarity.
- Line 217: The quotation marks are inconsistent (smart quotes vs straight quotes): “who,” “what,” vs "why" or "how".
- Line 471: "Moreover, the three labeled set were chosen..." --> Should be "the three labeled workbooks were chosen" or "the labeled set was chosen".
Paper is ready for publication.
Author Response
Line 217: The quotation marks are inconsistent (smart quotes vs straight quotes): “who,” “what,” vs "why" or "how".
Line 471: "Moreover, the three labeled set were chosen..." --> Should be "the three labeled workbooks were chosen" or "the labeled set was chosen".
We thank the reviewer for their careful reading of the manuscript and for the thoughtful, constructive comments. We have edited the typographical errors raised and read the manuscript to remove other typographical errors.
Reviewer 2 Report
Comments and Suggestions for AuthorsI appreciate the thoughtful and detailed responses to the my comments, and the clarifications made throughout the manuscript significantly strengthen its methodological transparency. This revisions appropriately reframe the study as a proof-of-concept and explicitly acknowledge the limitations regarding pseudo-labels and generalizability. One minor suggestion: In the updated limitations section, consider briefly restating, in a single sentence that the labeled subset of three students was selected based on anonymized ID order rather than submission time, to preempt any lingering reader assumptions about “early submitter” bias. This small clarification would reinforce the point made in your response and ensure consistency.
Author Response
In the updated limitations section, consider briefly restating, in a single sentence, that the labeled subset of three students was selected based on anonymized ID order rather than submission time, to preempt any lingering reader assumptions about “early submitter” bias. This small clarification would reinforce the point made in your response and ensure consistency.
Thank you very much for the positive feedback and for recognizing the improvements in methodological transparency. We appreciate your additional suggestion regarding clarifying how the labeled subset of three students was selected. (See lines 475-479)
Reviewer 4 Report
Comments and Suggestions for AuthorsThank you for your thorough revision and detailed point-by-point response. The manuscript has been substantially improved in clarity, methodological transparency, and scholarly framing. The following suggestions are offered to further strengthen the paper before final publication.
- Provide a clearer educational interpretation of the “Unlabeled” category. While the revised text clarifies that “Unlabeled” is a technical category (not a Bloom level), it would be valuable to briefly discuss what such utterances may represent in a learning context.
- Emphasize the exploratory nature of distributional findings more prominently in the Results/Discussion. Although the Limitations section acknowledges that pseudo-labels are not validated, the Results and Discussion sections could more explicitly remind readers that the reported label distributions (e.g., dominance of Understanding) are model-predicted estimates, not validated ground truth. Adding a short cautionary note when first presenting Figure 2 would prevent over-interpretation.
- Enhance interpretation of KL divergence values. The added explanation is helpful, but a more concrete educational interpretation would strengthen the discussion. For example:“A KL divergence of 0.064 suggests that the two models produce highly similar overall cognitive profiles, with differences equivalent to a slight redistribution of probability mass across Bloom categories—insufficient to alter the main finding of lower-order dominance, but indicative of model-specific biases in classifying certain higher-order levels.”
- Strengthen the Limitations section regarding sample representativeness. While the manuscript now clarifies that SIDs are not tied to submission order, the limitation of using only three students (8.8% of the cohort) as the gold set remains significant.
- Note the absence of per-class precision/recall metrics as a study limitation. Since per-class precision and recall could not be recomputed, the manuscript should briefly acknowledge this as a constraint on fine-grained performance analysis, especially for rare higher-order categories.
It is recommended to engage a native English-speaking expert for thorough language polishing of the entire manuscript.
Author Response
1. Provide a clearer educational interpretation of the “Unlabeled” category. While the revised text clarifies that “Unlabeled” is a technical category (not a Bloom level), it would be valuable to briefly discuss what such utterances may represent in a learning context.
2. Emphasize the exploratory nature of distributional findings more prominently in the Results/Discussion. Although the Limitations section acknowledges that pseudo-labels are not validated, the Results and Discussion sections could more explicitly remind readers that the reported label distributions (e.g., dominance of Understanding) are model-predicted estimates, not validated ground truth. Adding a short cautionary note when first presenting Figure 2 would prevent over-interpretation.
To address points 1 and 2, we have now explicitly mentioned this clarification in the caption of Figure 2 above and beyond what is currently stated in the main text.
3. Enhance interpretation of KL divergence values. The added explanation is helpful, but a more concrete educational interpretation would strengthen the discussion. For example:“A KL divergence of 0.064 suggests that the two models produce highly similar overall cognitive profiles, with differences equivalent to a slight redistribution of probability mass across Bloom categories—insufficient to alter the main finding of lower-order dominance, but indicative of model-specific biases in classifying certain higher-order levels.”
4. Strengthen the Limitations section regarding sample representativeness. While the manuscript now clarifies that SIDs are not tied to submission order, the limitation of using only three students (8.8% of the cohort) as the gold set remains significant.
5. Note the absence of per-class precision/recall metrics as a study limitation. Since per-class precision and recall could not be recomputed, the manuscript should briefly acknowledge this as a constraint on fine-grained performance analysis, especially for rare higher-order categories.
To address points 3-5 together, the suggested clarification and emphasis have now been added to the manuscript. (See Lines 388-392, 479-481, 487-488)
