Review Reports - From Abstract to Domain-Specific: Development and Validation of Matrix Reasoning Tasks for Students in Biology

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The present study investigates psychometric properties of a matrix reasoning test with stimuli that resemble biology-related content. Based on a sample of 3rd to 6th graders, the authors report reliability estimates along with validity coefficients for an established reasoning IQ test and for a biology-related reasoning measure. Furthermore, test scores of a pre-selected group of students from an enrichment program are compared with test scores from an general student sample.

The majority of the manuscript provides straight-forward information regarding the new biology-related reasoning test's psychometric properties. However, I struggle to find in the text the actual purpose of the new test, and consequently I struggle to deduce the purpose of the present manuscript.

Incidently, the introduction is begins with a lengthy recap of intelligence theories. The intention of the new test development remains difficult to grasp. Consequently, the various analyses of item properties and concurrent validity lack a clear motivation.

The construct "performance validity" is mentioned twice in the abstract, and twice in the manuscript text, but it is never explained. Also, the second sentence of the abstract addresses factors like test-taking motivation (line 6), yet the remainder of the study evades this topic completely.

The rationale behind the new matrix reasoning task seems to be a tailored assessment for students that are interested in biology. It remains unclear, though, whether this aim was achieved. Test attitude has been widely researched (e.g., Arvey, R. D., Strickland, W., Drauden, G., & Martin, C. (1990). Motivational components of test taking. Personnel Psychology 43, 695–716. doi:10.1111/j.1744-6570.1990.tb00679.x as well as various newer studies).

RQ2 seems vague. From a construction point of view, I would expect the new test to correlate strongly with other matrix reasoning tests. The weak correlation in the first sample seems unexpected to me. Or is it possible that there was a restriction of range due to ceiling effects? (Reporting the descriptive statistics (number of items solved, SD, median) might shed some light on this.) Also, the number of skipped/unanswered items throughout the test might yield additional insights. Also, the "abilities in scientific inquiry processes in biological research contexts" need to be introduced in the introduction as readers might be unfamiliar with this particular construct. The resulting corelations are therefore difficult to interpret, as I cannot know what to expect from this somewhat clandestine construct.

Finally, the authors should add details regarding how to obtain the new test instrument and its terms of use.

Minor points:

l. 81: "developed" instead of "Developed"
l. 97-99: It would help readers if the samples of Benit & Söllner (2012) where summarized quickly here and (possibly for other studies
elsewhere in the introduction, as well).
l. 104-106: Please specify between which grades the transition (typically) occurs. In Germany, for example, it can be after grade 4 in most states, but after grade 6 in a few states.
l. 126-127: Please report in which country the study was conducted. Also consider reporting the sampling procedure (sampling of schools, sampling of classes, sampling of students) along the rate of participation the selected sample(s).
l. 127: The mean age of 10.1 years does not match the combined mean age of the two subsamples (10.12 years and 10.9 years - which would yield 10.5 years on average).
l. 153: The duration of the test is reported to be 20 minutes - for the 24 item test as well as for the 60 item test? This seems implausible.
Figure 1: Please consider indicating the correct answer in each of the four examples. (Pertains also to lines 177-179.)
l. 224: What happened to the fourth and sixth grades?
l. 228: "rule out" instead of "exclude"
l. 308: Here (and also in line 128), the authors mention "revising" the test instrument. How were the initial items revised? The correlation between the item difficulty parameters of r = 0.95 suggests that items 2 to 24 remained rather unchanged.
l. 312: What does "a fully adequate structure mean"? Please consider rewriting this point.
l. 363-366: Consider deleting the specific results from the discussion at this point (as this has already been reported in the results).

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

Thank you for the invitation to review this very interesting work. The authors emphasize the importance of more domain specific based assessment and provide some initial evidence for validating a domain specific assessment of biology-based skills in relation to scientific inquiry skills and giftedness identification. I applaud the authors for their thorough reporting and descriptions of the assessment design and analysis. The paper is well written as drafted however, I’d appreciate additional information and clarifications in order to further consider the work for publication.

Introduction

I'd be curious to know have you all considered whether these tasks are fair and valid across different groups. Suggested reading:

Holden, L. R., & Tanenbaum, G. J. (2023). Modern Assessments of Intelligence Must Be Fair and Equitable. Journal of Intelligence, 11(6), 126.

What do you mean by time efficient?

Have you considered the Process Overlap Theory? Suggested reading:

McGrew, Kevin S., W. Joel Schneider, Scott L. Decker, and Okan Bulut. 2023. A Psychometric Network Analysis of CHC Intelligence Measures: Implications for Research, Theory, and Interpretation of Broad CHC Scores “Beyond g”. Journal of Intelligence 11: 19.

What do you mean by task interactivity?

What does this measure of complex problem solving entail? How is this different from what would be a matrix reasoning task is it that it’s just domain specific vs those other reasoning tasks are domain general?

You make a great point about how these other personality factors could be impacting the measurement and validity of scores on matrix reasoning tasks so it would be important for you to consider forms of measurement bias for your task as well.

When using specific jargon please explain what this means to the reader. What is design- based research framework? Here and throughout.

Materials and methods

Diversity? Demographics? Do you have racial/ethnic identity and socioeconomics for these students?

How was sample recruited were the students not randomly selected?

What does it mean that the biology students were not randomly selected to participate from the teachers’ suggestions? Also how does school type or location come into play here as well?

How does the fact that the enrichment students are outnumbering the controls impact the results?

Was the decision to include 4 distractors based on previous research? how to know that the cognitive demands of this task are age appropriate across all grades of students included in your sample?

Were the items administered on computer/tablet with administrator assistance? How did you ensure the students understood the example before moving on? Please provide more information about how the students were assisted by the research staff how were comprehension difficulties handled and in what ways. Did each student have to solve a trial item or items correctly before continuing.

For figure 1 suggest including the correct answers for the example items.

Being validated on a European sample is interesting. What about participants outside of Europe? why aren’t there norms for other groups? What does it mean to use this task on US students?

How did you all ensure that the scientific reasoning items are age appropriate for this group? has this scale been validated for a range of students of different developmental stages and ages regardless of giftedness?

How do the locations of task administration impact the results also considering the differences in sizes of the groups across locations and program type? why was this not more controlled across groups?

Results

Do you include or have racial ethnic identity of the students why was a dif analysis not conducted for that as well?

Please provide the reader with additional detail about the stocking lord procedure and why this was used.

Why was dif not run on the first test version?

On pg 8 the sentence referencing table 3 and the subsequent sentence seem to contradict each other please revise for clarity.

So even with nonsignificant or marginal differences in items based on dif analyses it seems that the longer test version was better for correcting what might be forms of dif for these groups. Again, dif should also be examined for other demographic factors as the information is available.

What is the typical amount of variance you see accounted for in development of these kinds of new assessments or even comparable assessment with similar reliability? Why do you think your r square was so low?

Discussion

Please provide more information from a developmental perspective for why MR more similarly correlated with IQ and SI at secondary grade in the results.

Please discuss suboptimal targeting in more detail and how it applies here also considering comments above.

In terms of test taking motivation there was a lot of framing of its importance in the introduction please say more about how you think it applies based on your results, the implications of the work and for future work.

You talk a good amount about the importance of domain specific ability assessment in the intro but the discussion did not feel balanced in terms of theoretical contribution of the assessment or implications of the work as a whole please revise the paper to better address these issues.

Minor

Please provide the reader with more information regarding CNSVS and TTMQ.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

The study provides a comprehensive introduction with a description of general intelligence and the support that this construct enjoys in the literature. There is some recognition that subsets of general intelligence, such as fluid reasoning is contested in terms of how it relates to domain specific achievement, but I would have liked to have seen a greater exploration of the critique that exists in literature about the limits of this model of intelligence for understanding achievement more broadly. Further, while factors such as personality and test conditions were acknowledged to impact performance on tests that claim to measure this form of intelligence (especially fluid reasoning), there was little consideration of the vast literature that debates the usefulness of these tests - e.g., cultural relevance; the Flynn effect etc. I think it perfectly acceptable, and defensible given the plethora of literature that supports 'g' to use this construct in the way it is used in this study, but I do not think the tensions in the field can be completely ignored. I would expect to see them addressed, and then an explanation as to why such criticisms are not relevant to the position adopted in the study on these issues.

I was heartened to see a discussion whereby the authors were frank about the results and the limits of instrument development, as well as proposals for how the instrument might be strengthened into the future. I did feel that this section was under referenced. The suggestions made would feel more solid if they were better grounded in the literature related to instrument design. I think what is also missing, and this could be for the introduction or the conclusion - either way - is a greater rationale for why this research has merit. I am not sure that the need for domain specific measures comes through as strongly as it could - it is certainly situated within the field of intelligence testing and a research gap established, but I am left wondering about the practical applications of this work- who will benefit from such research? Is student need misunderstood in this domain and if so, how do you know - what is the literature to support this? Exactly how will this potentially impact in practical terms? It is touched upon, but I am sure could be developed further with greater links to relevant literature.

Overall, I think this is a strong study with a good contribution to make to the field of intelligence testing. I have reviewed this with the caveat that I am taking the statical analysis to be solid (and from what I am able to review it does seem to be the case but also caution that this needs a more thorough review from someone who has a more solid background in these processes). In terms of the conceptualization within the study, I think there are few areas that could be strengthened, and this would serve to bolster the study's purpose and practical implications and demonstrate that the authors have navigated the tensions in the field. Otherwise, once adequate statistical review is conducted, I believe this to be a publishable and interesting study.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

The revision successfully addresses various (often technical) issues. Overall, the study provides insights into the newly developed scale. However, the manuscript still deals heavily with test-taking motivation, going as far as stating in the abstract: "The findings suggest that tailoring matrix reasoning tests to specific domains can enhance test-taking motivation." (l. 22-24) Yet, test-taking motivation was not assessed in the two testing sessions. Such a claim is therefore unbased.

Also, the new matrix reasoning test is supposed to improve the diagnostics of gifted students. The manuscript still lacks a proper definition of giftedness and a clear narrative of how the new test suits the assessment of giftedness.

On the other hand, long parts of the introduction and the discussion dwell on general aspects of intelligence testing. The bloated list of references (now 100+) does not align with the narrow scope of the present study. I recommend shortening the paragraphs containing details regarding established intelligence theories.

Minor points:

l. 245-247: Participants were given 20 minutes to complete the test in both versions - please indicate whether both tests had the same number of items - and if that is correct, I recommend clarifying that in the second sample, students worked 24 items from the 60-item-pool. (The number in Table 1 suggest tests of similary length across the two samples.)

While non-response might indicate (a lack of) test-taking motivation among other thing, the authors write that "we treated no-response answers as missing data" (l. 560-561). This seems to be not enough to assess motivation; still, it might be of interest whether the missings were scattered throughout the test or mostly towards the end, suggesting a speed component in the reasoning test.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf