Previous Article in Journal
Anonymized Dataset of Information Systems and Technology Students at a South African University for Learning Analytics
 
 
Article
Peer-Review Record

A Dual-Model Framework for Writing Assessment: A Cross-Sectional Interpretive Machine Learning Analysis of Linguistic Features

Data 2026, 11(1), 2; https://doi.org/10.3390/data11010002 (registering DOI)
by Cheng Tang 1, George Engelhard 1,*, Yinying Liu 2 and Jiawei Xiong 1
Reviewer 1:
Reviewer 2: Anonymous
Data 2026, 11(1), 2; https://doi.org/10.3390/data11010002 (registering DOI)
Submission received: 31 October 2025 / Revised: 13 December 2025 / Accepted: 19 December 2025 / Published: 21 December 2025

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

This study investigated the feature importance in automatic constructed-response item scoring, using lasso regression, XGBoost, SHAP, and Gain. Overall the paper is written well and offers important insights on feature importance for automatic essay scoring. I have some suggestions and comments about the presentation of the study: 

1) My biggest concern is related to missing model performance when trained with the linguistic features extracted. Please consider adding a section for explaining how the trained models performed for predicting students' essay scores. This will allow reader to understand the utility of features for predicting student performance, and low model performance may imply different conclusions about the features (e.g., the student performance may not map onto the extracted features well, etc.).

2) Please clarify if you trained models separately for each grade level. The authors have mentioned that they trained XGBoost separately for each grade level but it was not explicitly stated for the lasso regression. 

3) Please consider providing information about how essays were scored (e.g., the use of a rubric, total score available, the number of raters, etc.). 

4) If available, providing some demographic information about students may be important, as models could be biased against certain subgroups. For example, literature indicates that trained automatic essay scoring models could be biased against second language learners. 

5) I think providing a list of features in a table is great but tables become somewhat overwhelming for summarizing feature importance. Perhaps, the authors may report feature importance by using a horizontal bar plot (each bar can be color-coded based on the linguistic category and can be arranged in a grid format for representing each grade level). 

6) Please clarify what is being predicted. Based on the IRT section, the authors seemed to use learners' ability level as the outcome variable. I found it somewhat problematic, as learners' ability levels across all items might be a good proxy for writing skills but do not think it can fully capture one's writing skills. The justification for using ability estimates instead of essay scores should be provided better. 

7) The authors have talked about the multicollinearity issue but have not reported correlations among the features. 

8) I think the paper would benefit from approaching results from a theoretical/conceptual point of view in terms of how students develop their writing skills (some literature suggestions to check: https://doi.org/10.1016/j.acorp.2022.100026, https://doi.org/10.1017/S0047404510000254, https://doi.org/10.58680/ccc198115885). This might also enrich the discussion section for explaining why features were different for different grade levels. 

Minor issues: 

  • Please clarify if you have 6,861 unique students or if it is a repeated measure.
  • How is it possible to have a min word count = 1. Does that mean that these students have not completed the task and entered just one word? 
  • Have you split the data for test, train, and validation twice—once for lasso and the other one for XGBoost? It could have been interesting to use the same splits for both models, making them more comparable in terms of model performance. 
  • Line 817-- The one small signal that did survive in both models [...] supports this. Here this is ambiguous. 

 

 

 

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

In my opinion, it seems like an interesting paper, clear and with a logical structure. The authors look at how linguistic features can predict writing proficiency in students from Grades 6 to 12, and use both a LASSO regression model and an XGBoost model. The dataset is rich in my opinion, it covers 5,638 essays, and the methodology is explained well. 
Thank you for the opportunity to review this manuscript and please allow me to make a few constructive suggestions.
The introduction includes a good overview of the problem, but I think the theoretical background could be slightly deeper. For example, I believe that it would help to link the observed linguistic trends to theories of language development or writing pedagogy. Right now, the focus seems mostly on measurement and modeling. Also, the phrase "interpretive dual-model framework" is mentioned in the manuscript but I suggest it could be better defined in conceptual terms (perhaps it would help to explain not just what it is technically, but why it matters in interpreting writing ability).
The data seem well-prepared, but since all essays are of the informational type, I suggest the authors should mention that results might not generalize to narrative or argumentative writing.
I believe that it might also help to discuss whether the prompts across grades could affect the results (since topics and writing tasks differ, some linguistic patterns may come from prompt difficulty rather than from proficiency itself ).
The list of linguistic features is impressive in my opinion, but some of them may be conceptually or statistically related. I believe that it would be helpful to include a short note or table to show how inter-feature correlations were handled or checked for stability.
Also, the use of the Sentence-BERT MiniLM model is fine, but I suggest it would be nice to justify why this model was chosen and not a slightly larger one, since semantic representation is central to the results.
Also, perhaps the authors might want to show how sensitive results are to text length, since longer essays naturally have more complex structures.
The methodology is solid, but I think it would be good to include a short baseline comparison (for ex., how these models perform versus a simple linear regression or random forest). The R² values are modest (mostly 0.2 ~ 0.3 for LASSO and 0.3~0.4 for XGBoost), so perhaps it’s better to describe these as moderate or partial predictors, not strong ones.
The idea of combining LASSO and XGBoost is strong, but the comparison between them is mostly descriptive. I suggest that the authors add a small quantitative measure of overlap, for ex., how many top features are shared between the two models (a Jaccard similarity index or maybe something similar).
The results are clear and the tables are informative. In my opinion , maybe the discussion could go a bit deeper into why these features matter for education. For ex., if syntactic density predicts better writing in middle school, how can teachers use that insight in feedback or in the curriculum design?
In high school, both models seem to fail to generalize, which is actually very interesting. Instead of describing this as a “collapse”, perhpas it could be reframed as a sign that writing at this level depends on higher-level discourse and argumentation features not captured by the current set. 
I would also suggest to define all abbreviations the first time they appear (for ex., IQR etc.).

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Round 2

Reviewer 2 Report

Comments and Suggestions for Authors

Thank you for your careful revisions. The paper seems to me now clearer, better explained,well balanced

Back to TopTop