Development of a Quantum Literacy Test for K-12 Students: An Extension of the Computational Thinking Framework
Round 1
Reviewer 1 Report
Comments and Suggestions for Authors
The article is worthy of publication but only after some minor changes.
The field of quantum computing is certainly one of most promising (although the horizon of promise seems not to be reachable yet, as actually is the case with every horizon). A research in the field of educational aspects of quantum computing is very important.
The authors have clearly invested a lot of thought and effort in the theory and in creating a literacy test for the field of quantum computing. They have also conducted very important experiment in teaching and developing this quantum computing literacy in 819 high-school (average age of app. 17) students over a two-month program.
My suggested changes would in the tension and equilibrium between these aspects of the research.
The authors exemplify their test with few test items and give a link to online depository. Although it does include the pictures of the cards used, it would benefit the reader to have at least the gist of the test (like the questions asked or some systematic summary) in the article, perhaps in the appendix. Including all the items "as is" in the appendix seems even better.
The authors describe their program as:
"...conducted a two-month training program with the assis- 174
tance of ten research assistants to introduce participants to quantum computing. The 175
training was simple enough to match the participants' present abilities and broad enough 176
to introduce a wide range of quantum literacy. To minimize disruptions to classroom rou- 177
tines, training sessions were scheduled in coordination with each school. The program 178
consisted of two modules, with the first lasting a maximum of one month and the second, 179
two months. In the first module, we introduced the concept of quantum computing, in- 180
cluding a brief history, definition, and the fundamental principles. In the second module, 181
the participants were exposed to different unplugged activities..."
The authors described here and there some fragments of the program. But, as it sounds very interesting to educators, the authors would do well to dedicate much more place to a systematic and detailed account of the curriculum and content of the program.
Author Response
Comment 1:
The authors exemplify their test with few test items and give a link to online depository. Although it does include the pictures of the cards used, it would benefit the reader to have at least the gist of the test (like the questions asked or some systematic summary) in the article, perhaps in the appendix. Including all the items "as is" in the appendix seems even better.
Response:
We thank the reviewer for this suggestion. We would like to clarify that the manuscript already provides a detailed description of the test design, item structure, response alternatives, and cognitive tasks assessed, and includes four representative items in the main text to illustrate the test format. Given the graphical nature of all 27 validated items, reproducing the full item set in the appendix would substantially increase manuscript length without added interpretive value. Therefore, the complete item bank, including all visual stimuli and scoring keys, is provided in full via the OSF repository, which is explicitly linked in the manuscript.
Comment 2:
The authors describe their program as:
"...conducted a two-month training program with the assis- 174
tance of ten research assistants to introduce participants to quantum computing. The 175
training was simple enough to match the participants' present abilities and broad enough 176
to introduce a wide range of quantum literacy. To minimize disruptions to classroom rou- 177
tines, training sessions were scheduled in coordination with each school. The program 178
consisted of two modules, with the first lasting a maximum of one month and the second, 179
two months. In the first module, we introduced the concept of quantum computing, in- 180
cluding a brief history, definition, and the fundamental principles. In the second module, 181
the participants were exposed to different unplugged activities..."
The authors described here and there some fragments of the program. But, as it sounds very interesting to educators, the authors would do well to dedicate much more place to a systematic and detailed account of the curriculum and content of the program.
Response:
Thank you for drawing our attention to this. But honestly, the training program was designed solely to support test development and validation, and was not tied to any formal curriculum. We have clarified this in the manuscript. Please refer to lines 180-181.
Reviewer 2 Report
Comments and Suggestions for Authors
Dear Authors,
I'd suggest tightening your theoretical claims about extending computational thinking to a quantum stage. Second, I noticed a couple of technical explanations in the unplugged activities that need correction or tempering. Third, the validity evidence section could be more complete and more standardized (particularly regarding concurrent validity and ML modeling details). Finally, there are a handful of internal inconsistencies and editorial issues to address.
Your abstract does a great job stating your aims, sample, and key indices—"high internal consistency (α = .87)… strong concurrent validity with the Computational Thinking Test (r = .65)… moderate validity with a Spatial Ability Test (r = .32)… machine learning models explained less than 40% of QLt score variance" (p. 1, lines 14–19). I'd love to see each of these headline claims mirrored in your Results section with a compact table that lists the external instruments (names, citations), N per correlation, timing of administration (same sitting vs separate), coefficient type (Pearson/Spearman/point-biserial), and 95% CIs. It would also be helpful to include a short note on whether the correlations remain similar after controlling for gender/grade, especially since you report a small gender difference in total scores. Right now, these figures only appear in the abstract narrative, so a dedicated validity table would make the evidence much more auditable.
Your conceptual framing is promising, but occasionally it feels over-extended or imprecise. For example, in Table 1, the phrase "Type of logic involved: Diffuse logic (everywhere-nowhere, every time-no time)" (p. 3, line 114) risks conflating metaphors with formal descriptions, which might confuse education researchers using your framework. Similarly, when you write that "the quantum stage introduces a non-binary representation (0 and 1 simultaneously), which permits higher processing power and speed" (p. 3, lines 98–104), you're overstating performance in general terms. I'd suggest rephrasing this to discuss superposition as a linear combination of basis states, and avoiding claims of generic speed since that really depends on problem class and algorithm. These changes won't alter your instrument, but they'll improve conceptual accuracy.
I really appreciate the strength of your design and item blueprinting—you clearly specify a 30-item pool across concepts (20), practices (3), and perspectives (7), and you explicitly describe the four cognitive task families (prediction, sequencing, completion, pattern recognition) with a clear mapping to visual answer formats (pp. 4–5, lines 139–147). I have two small requests here. First, could you reconsider the decision to "remove the qubit concept as this can be understood when addressing the concept of superposition" (p. 4, lines 131–133)? In many curricula, qubit is the foundational unit introduced before superposition. At minimum, I'd suggest justifying its removal in terms of redundancy in your item set—perhaps with content-overlap statistics or expert-panel consensus. Second, since you subsequently drop three items (qc7, qc17, qps7) after EGA (pp. 23–25), please restate the final blueprint counts (QC/QPr/QPs) for the retained 27-item form and re-report reliability (KR-20/α and preferably ω) for the final scale and subscales. Currently, α = .87 and subscale αs are reported before the item removals (pp. 15–16, lines 359–361).
The unplugged activities are really engaging, but I spotted two technical overclaims that should be corrected to avoid propagating misconceptions. In the Grover search vignette, you write that "the probability of finding the correct answer increases exponentially with each step… the ball can be found in 1 to 3 attempts as opposed to 1 to 10 attempts" (p. 10, lines 253–256). Grover’s algorithm provides a quadratic (O(√N)) speedup, not exponential. The good news is that your pedagogy can remain intact if you adjust the wording to "quadratic amplification" and explain why the attempt count scales with √N in your stylized classroom task. In the Shor example, focusing on "prime numbers to systematically narrow down choices" (p. 10–11, lines 260–267) risks misrepresenting Shor's method, which uses period finding via QFT rather than pruning by primality. I'd suggest reframing this as a high-level analogy to periodic structure detection rather than a prime-filtering heuristic.
Thanks for the transparency on recruitment and consent (Ministry approval, parental consent, final N = 819; p. 6, lines 156–165). Table 2 is useful but needs two quick fixes. First, "School type" appears twice—the second occurrence ("Boys/Girls/Co-education") would be more naturally labeled "School composition" or "School gender composition" (p. 7, Table 2). Second, your level labels are inconsistent across the paper. Participants are introduced as Senior Secondary I/II (SSI/SSII; p. 6, lines 156–163), while Table 2 lists SSII/SSIII, and later analyses again use SSI/SSII (p. 13, Table 4). Please harmonize these level names throughout and double-check the counts per level.
Your CTT/IRT results are clearly laid out, and your combined use of item statistics, item-fit (Outfit/Infit), and information functions is appropriate. That said, two interpretive refinements would strengthen your measurement argument. First, the IRT difficulty estimates sit in a very narrow band ("approximately between −0.15 and +0.16," p. 16, lines 387–393), and the Test Information Function peaks squarely near θ ≈ 0 (pp. 19–20). This suggests your current pool targets average ability well, but may have limited precision at the tails. Consider retaining a couple of harder/easier items (from the dropped pool or new items) to expand coverage, particularly for QC, and show marginal reliability/SEM across θ to demonstrate the gain. Second, after removing qc7, qc17, qps7 for weak network loadings (<0.20; p. 23–24), please report how their removal affected subscale balance and whether any content domain became under-represented (Table 8 shows qps7 and qc7 among the weakest; p. 24).
Regarding your group comparisons and DIF: the t-test shows a small gender difference in total QLt (Mfemale = 15.41 vs Mmale = 14.79; t = −2.063, p = .039; p. 12–13), but your DIF analyses indicate no item-level bias beyond ±0.5 logits across gender/discipline/level/school type (pp. 21–22, lines 488–496). It would be helpful to add a one-sentence reconciliation explaining that "the small mean difference reflects true-score variation rather than item bias." I'd also suggest providing a table (or OSF supplement) of DIF statistics for each subgroup, since the text currently just points to OSF for additional plots (p. 21–22).
For concurrent and predictive validity: the abstract notes r = .65 with the Computational Thinking Test and r = .32 with Spatial Ability (p. 1, lines 14–19). Please document the specific instruments (e.g., CTt by Roman-Gonzalez with year, Spatial Ability Test reference), the administration order (to avoid criterion contamination after the two-month training; pp. 8–11), and the exact sample size for each correlation (accounting for any missingness). Since you mention ML models and list the R packages used (gbm, caret, e1071; p. 14, lines 346–347), I'd suggest adding a compact table summarizing your algorithm(s), cross-validation scheme, feature set, and variance explained on a held-out fold. This will bring the abstract's "< 40% variance explained" claim into the Results proper.
Best wishes with the revisions!
Author Response
We appreciate your effort and must acknowledge that your comments were crucial.
Comment 1:
Your conceptual framing is promising, but occasionally it feels over-extended or imprecise. For example, in Table 1, the phrase "Type of logic involved: Diffuse logic (everywhere-nowhere, every time-no time)" (p. 3, line 114) risks conflating metaphors with formal descriptions, which might confuse education researchers using your framework.
Response:
We have revised the wording in Table 1 to replace metaphorical phrasing with a more precise and analytically grounded description.
Comment 2:
Similarly, when you write that "the quantum stage introduces a non-binary representation (0 and 1 simultaneously), which permits higher processing power and speed" (p. 3, lines 98–104), you're overstating performance in general terms. I'd suggest rephrasing this to discuss superposition as a linear combination of basis states, and avoiding claims of generic speed since that really depends on problem class and algorithm. These changes won't alter your instrument, but they'll improve conceptual accuracy.
Response
We have revised this sentence to remove generalised performance claims and to describe superposition more precisely, in line with established quantum computing terminology. Please refer to lines 99-101.
Comment 3:
I really appreciate the strength of your design and item blueprinting—you clearly specify a 30-item pool across concepts (20), practices (3), and perspectives (7), and you explicitly describe the four cognitive task families (prediction, sequencing, completion, pattern recognition) with a clear mapping to visual answer formats (pp. 4–5, lines 139–147).
Response:
We appreciate your comment
Comment 4:
I have two small requests here. First, could you reconsider the decision to "remove the qubit concept as this can be understood when addressing the concept of superposition" (p. 4, lines 131–133)? In many curricula, qubit is the foundational unit introduced before superposition. At minimum, I'd suggest justifying its removal in terms of redundancy in your item set—perhaps with content-overlap statistics or expert-panel consensus.
Response:
We have clarified that the qubit was not removed conceptually, but treated as a foundational prerequisite embedded across multiple items. Its exclusion as a standalone construct was based on expert-panel consensus and concerns about content redundancy. Please refer to lines 131-136
Comment 5:
Second, since you subsequently drop three items (qc7, qc17, qps7) after EGA (pp. 23–25), please restate the final blueprint counts (QC/QPr/QPs) for the retained 27-item form and re-report reliability (KR-20/α and preferably ω) for the final scale and subscales. Currently, α = .87 and subscale αs are reported before the item removals (pp. 15–16, lines 359–361).
Response:
We have restated the final item blueprint and re-reported reliability estimates for the refined 27-item scale following item removal. Please refer to lines 557-565
Comment 6:
The unplugged activities are really engaging, but I spotted two technical overclaims that should be corrected to avoid propagating misconceptions. In the Grover search vignette, you write that "the probability of finding the correct answer increases exponentially with each step… the ball can be found in 1 to 3 attempts as opposed to 1 to 10 attempts" (p. 10, lines 253–256). Grover’s algorithm provides a quadratic (O(√N)) speedup, not exponential. The good news is that your pedagogy can remain intact if you adjust the wording to "quadratic amplification" and explain why the attempt count scales with √N in your stylized classroom task. In the Shor example, focusing on "prime numbers to systematically narrow down choices" (p. 10–11, lines 260–267) risks misrepresenting Shor's method, which uses period finding via QFT rather than pruning by primality. I'd suggest reframing this as a high-level analogy to periodic structure detection rather than a prime-filtering heuristic.
Response:
We have revised the Grover and Shor vignettes to correct technical overclaims, replacing exponential language with quadratic amplification for Grover and reframing the Shor example as a high-level analogy for period finding rather than prime-based filtering. Please refer to lines 250-260 and 263-275.
Comment 7:
Thanks for the transparency on recruitment and consent (Ministry approval, parental consent, final N = 819; p. 6, lines 156–165). Table 2 is useful but needs two quick fixes. First, "School type" appears twice—the second occurrence ("Boys/Girls/Co-education") would be more naturally labeled "School composition" or "School gender composition" (p. 7, Table 2).
Response:
Thank you for drawing our attention to this. This has been corrected in the table
Comment 8:
Second, your level labels are inconsistent across the paper. Participants are introduced as Senior Secondary I/II (SSI/SSII; p. 6, lines 156–163), while Table 2 lists SSII/SSIII, and later analyses again use SSI/SSII (p. 13, Table 4). Please harmonize these level names throughout and double-check the counts per level.
Response:
Thank you for this. It has been corrected. Please refer to Table 2
Comment 9:
Your CTT/IRT results are clearly laid out, and your combined use of item statistics, item-fit (Outfit/Infit), and information functions is appropriate.
Response:
Thank you so much.
Comment 10:
That said, two interpretive refinements would strengthen your measurement argument. First, the IRT difficulty estimates sit in a very narrow band ("approximately between −0.15 and +0.16," p. 16, lines 387–393), and the Test Information Function peaks squarely near θ ≈ 0 (pp. 19–20). This suggests your current pool targets average ability well, but may have limited precision at the tails. Consider retaining a couple of harder/easier items (from the dropped pool or new items) to expand coverage, particularly for QC, and show marginal reliability/SEM across θ to demonstrate the gain.
Response:
We agree that the current item pool provides the greatest precision around average ability levels. However, we opted not to retain psychometrically weak items or introduce new items at this stage, as this would require additional data collection and would conflict with the EGA-based item purification strategy. We have clarified this design boundary and highlighted future extensions to broaden the ability coverage in the limitation section (lines 725-729).
Comment 11:
Second, after removing qc7, qc17, qps7 for weak network loadings (<0.20; p. 23–24), please report how their removal affected subscale balance and whether any content domain became under-represented (Table 8 shows qps7 and qc7 among the weakest; p. 24).
Response:
We have clarified how item removal affected subscale balance and confirmed that no content domain became under-represented. Please refer to lines 557-559.
Comment 12:
Regarding your group comparisons and DIF: the t-test shows a small gender difference in total QLt (Mfemale = 15.41 vs Mmale = 14.79; t = −2.063, p = .039; p. 12–13), but your DIF analyses indicate no item-level bias beyond ±0.5 logits across gender/discipline/level/school type (pp. 21–22, lines 488–496). It would be helpful to add a one-sentence reconciliation explaining that "the small mean difference reflects true-score variation rather than item bias." I'd also suggest providing a table (or OSF supplement) of DIF statistics for each subgroup, since the text currently just points to OSF for additional plots (p. 21–22).
Response:
Thank you for this comment. We have included the sentence in the main text and also provided additional DIF statistics in the OSF. Please refer to lines 509-517.
Comment 13:
For concurrent and predictive validity: the abstract notes r = .65 with the Computational Thinking Test and r = .32 with Spatial Ability (p. 1, lines 14–19). Please document the specific instruments (e.g., CTt by Roman-Gonzalez with year, Spatial Ability Test reference), the administration order (to avoid criterion contamination after the two-month training; pp. 8–11).
Response:
Thank you for this comment. But please note this has already been provided in the previous version. Please refer to section 8 (lines 600-601) for the reference of the CTt and the SAt, and also section 8.2 (specifically, Table 10) for the test administration procedure.
Comment 14:
...and the exact sample size for each correlation (accounting for any missingness).
Response:
We have included the exact sample size for each correlation. Please refer to Table 11.
Comment 15:
Since you mention ML models and list the R packages used (gbm, caret, e1071; p. 14, lines 346–347), I'd suggest adding a compact table summarizing your algorithm(s), cross-validation scheme, feature set, and variance explained on a held-out fold. This will bring the abstract's "< 40% variance explained" claim into the Results proper.
Response:
Thank you for this. We have included this information in Table 13.
