Improving the Measurement of Students’ Composite Ability Score in Mixed-Format Assessments
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsIn the manuscript entitled “Improving the Measurement of Students’ Composite Ability
Score in Mixed-Format Assessments”, authors introduces a practical and easily implementable method that leverages MC item scores to derive empirical priors, leading to more accurate composite score estimates. Through empirical analyses of students in Grades 3 to 10 and two additional simulation studies based on real-world data, authors demonstrated that this approach enhances composite ability score reliability, reduces reporting biases, and provides a valuable empirical evaluation tool for mixed-format assessments.
The manuscript is clear, relevant and presented in a well-structured manner. It is scientifically sound and the experimental design appropriate to test the hypothesis. Investigation methods are described in detail and given a clear overview of the obtained results. The data, analysis of results and discussion are shown clear and understandable. The listed references are relevant for the study.
My recommendation is Minor Revision:
1. Text from Pages 3 to 6 and from 9 to 13 should be Justify;
2. The conclusion is too extensive. In the conclusion should restate the research problem, summarizes arguments or findings, and discusses the implications. Conclusion should briefly summarize the key results made in the body and give the significance of the study to the broader knowledge base of the field of study.
Author Response
Dear Reviewer 1,
Thank you for your time reviewing our work. Your comments are valuable to us, and we have responded to your comments with point-by-point replies in the attached document. Thank you again!
Best,
Authors
Author Response File: Author Response.pdf
Reviewer 2 Report
Comments and Suggestions for AuthorsThis paper was interesting, well-written, and mostly easy to follow. I enjoyed reading it. I have a few suggestions that will improve the clarity and readability of the Results and Conclusions sections.
Line 205: Table 1. Consider adding a "points possible" column that can help readers easily see the MC plus CR item totals.
Line 214: Delete redundant text.
Line 221: Table 2. Consider adding a sentence or two about how you interpreted the factor analysis results for Grades 7-10, where the second factor was relatively larger/explained more variance than in Grades 3-6.
Lines 231-232: Consider explicitly stating "Grades 7-10" here, rather than the vague text "grades with larger numbers of items and students".
Line 238: Consider adding the mean, standard deviation, and reliability of a few more grades. Maybe include the grade with the lowest, middle, and highest.
One key piece of information that seems to be missing from the results is a comparison of the SEM/CSEM of the posterior estimates. In addition to total reliability, consider reporting the standard error of measure for key points along the posterior distribution (around a cut score or at the median, 25th and 75th percentiles). This will help the reader get a sense of the precision of the scores along the continuum, not just a summary statistic for the group (total reliability).
Also, your readers will be interested to hear your thoughts on the relationships shown in Figure 2 for Grades 3-6 versus Grades 7-10 because they look different. For instance, what is the expected relationship for each grade? Does the expected relationship differ with the test design related to more CR item points in higher grades?
Line 245: Typo "each of the assessments has been calibrated"
Line 356: I may have missed it, but is unclear how Table 3 supports the claim that EFB achieves "superior" estimation accuracy to the traditional expert-determined weights method. Please make this clearer.
Line 275: Consider explicitly stating which "parameters" are being referred to here (i.e., item or item and person).
Line 277: Likewise, consider changing "ability parameters" to "ability estimates" here to keep the language consistent with other sentences in this paragraph.
Line 279 and 285: Consider adding text that discusses the thresholds for RMSE and reliability that is used to define substantive differences (e.g., an ANOVA methodology rather than a particular value for each metric).
Line 293: Consider adding the direction of the skew (e.g., negative skew like what was observed in the empirical data?).
Line 297: It may be helpful to add the total count of unique groups in the simulation so that readers don't have to do the math in their heads (e.g., 3x3x3=27). Similarly, consider adding a number count column to Table 5.
Lines 304-315: Consider adding a general interpretation of the size of the observed RMSEs. For instance, are they average for a sample/test of this size/length? Or if they are on the logit metric, are they close to 2x the average SEM?
Line 324: Is this referring to only the EFB method (uniform and skewed of the EFB method)? Consider making it clearer. Also, what about the normal distribution findings? Was this group difference significantly different from uniform and skewed?
Line 348: Here or earlier, consider mentioning any significant ANOVA interactions (if any) or why these were not of interest in this study.
Line 351: The word "Groups" in the title is unclear to me. Consider using Non-Normally Distributed "Achievement" or something similar instead.
Lines 353-355: I can't make the math work on the number of groups here (grades 3-12 = 10 grades vs. 7 science assessments). Consider adding a note that all grades didn't have an assessment.
Line 363: How the datasets were simulated is unclear. Were they sampled at the student level? With replacement? In the appendix, it makes it sound like only the items were selected? Consider making this clearer.
Line 395: With the exception of Assessment 7, the distributions of abilities in Figure 3 look normal-ish, considering the small sample sizes. Consider adding why these are considered non-normal.
Line 401: Figure 3. Consider reversing the order of the Density axis to be consistent with Table 7.
Lines 410-425: This section is a little redundant. If you are looking to reduce text (to compensate for some of the suggestions here), here may be a good place.
Line 424: "relatively low". Relative to what? Computational time was mentioned here and in the Introduction, but it was not a feature that was included in this study. Bayesian methods typically take longer than other traditional methods, so I'm not sure what is gained by adding this sentence again.
Line 462: Is there a word missing here? "applicability of polytomous ____ extends..."
Appendix: The mathematical equations did not appear in the MS Word doc. Maybe it was my copy.
Appendix: Consider labeling the item parameters a, b, and category thresholds in a column for Table C.
Author Response
Dear Reviewer 2,
Thank you for your time reviewing our work. Your comments are valuable to us, and we have responded to your comments with point-by-point replies in the attached document. Thank you again!
Best,
Authors
Author Response File: Author Response.pdf