Describing the Development of the Assessment of Biological Reasoning (ABR)
Round 1
Reviewer 1 Report
This article reports on the development and validation of a scientific reasoning assessment in the domain of biology. The research aligns well with the theme of this special issue and I recommend publication once the authors have addressed the feedback below.
Introduction and theoretical framework
The authors should give a specific definition of what scientific reasoning is and which skills it comprises. Merely criticizing previous definitions (p.3, lines 109-116) will not do in an article that examined the validity of a new test of scientific reasoning, simply because readers should know beforehand exactly which skills are assessed and why.
The argument presented in section 2.1 relies heavily on the works of Kind and Osborne. The authors could strengthen their view by referring to one or more chapters from Fischer et al. (2018) that address the association between scientific reasoning and disciplinary content knowledge from various perspectives. Likewise, the paragraphs on model-based reasoning would benefit from a broader framing beyond Zangori et al., for instance by addressing the research of Hmelo-Silver, Van Joolingen, Mulder, and Reiser.
Fischer, F. et al. (Eds.) (2018). Scientific reasoning and argumentation: The roles of domain-specific and domain-general knowledge. New York: Routlege.
What I missed in the Introduction and/or Discussion is a critical account of the pros and cons of written tests that make use of a multiple-choice item format. Both test characteristics are a known threat to internal validity—that is, it is not self-evident that such tests measure scientific reasoning as accurately as practical inquiry tasks would. This issue is touched on in section 2.3 where possible disadvantages of existing assessments are described; this discussion should be extended by addressing the ecological and construct validity of multiple-choice tests. The authors could, for instance, attend to the book by Harlen (2013), which contains an insightful overview of possible disadvantages of written tests of scientific knowledge and scientific reasoning.
Harlen, W. (2013). Assessment & inquiry-based science education: Issues in policy and practice. Trieste, Italy: Global Network of Science Academies.
Method
The ARB was developed through iterative rounds of testing, which is a highly recommended procedure. The authors describe their approach rather well, but some details should be added to ease the readers’ understanding of what exactly happened in the various testing rounds. For instance, the initial testing described in section 3.1.1 is ambiguous; is this the same as the initial testing described in section 3.1.2? Please clarify, and if it is a separate activity then please give some more demographic information on the participants.
Secondly, please explain the acronyms AP and IB (section 3.1.4). Additionally, be more specific here about the procedures used to determine whether there were problematic items that needed to be reconsidered.
Thirdly, there is no information about the expert screening that served to determine translational validity. Please add to the Method section.
Finally, the sample for the second (final) round of data collection is rather small given the quantitative nature of the analyses. I assume that this is the reason why the authors relied on classic test theory measures instead of performing item-reponse theory (IRT) modeling. This advanced technique has some obvious advantages the authors should acknowledge in the Discussion. In doing so, they may be more modest about their own validation efforts and should perhaps present their work as an ‘initial validation’.
Results
The detailed discriptions of how the qualitative analyses informed the refinement of the ABR are really insightful. The quantitative analyses are mostly presented in narrative form with minimal descriptive statistics. I strongly recommend the authors to include item characteristics derived under the classic test theory paradigm (p-values indicating item difficulty and point-biserial item-test correlations to indicate item discrimination).
Minor issue: p.12, line 522—‘screen plot’ should be ‘scree plot’. More importantly, the ‘elbow’ in the scree plot is at the two-factor solution, so the authors cannot claim that the results lend support to a one-factor solution! I therefore recommend to skip the follow-up exploratory factor analyses (EFA) that tested a two- and three-factor solution, and instead run a comfirmatory factor analysis with two factors (despite the modest sample size; this can be commented on in the Discussion). Item characteristics (e.g., eigen values) resulting from the EFA should be presented to indicate which items belong to which factors.
Please present the means and standard deviations in Tables 6 and 7 instead of percentages.
Discussion
The overall conclusion in the first line of the discussion is too strong. The authors present initial validity evidence in this paper, that should be backed up by future research in larger samples that enable for more advanced validation methods such as IRT.
In addition, this section would benefit from a discussion on the uptake of the ABR in educational practice, which seems limited to classes that have studied the exact same disciplinary content as addressed in the tests.
Author Response
Dear Reviewer 1,
We very much appreciate the review of our manuscript. We have incorporated these comments and suggestions, which we have outlined in the table below. We feel that the revised manuscript now has a much stronger focus and message.
We look forward to your feedback and thank you for the valuable reviews you have provided.
Kind regards,
The authors
Reviewer Comment |
Authors’ Responses |
Introduction and theoretical framework |
|
The authors should give a specific definition of what scientific reasoning is and which skills it comprises. |
A definition of scientific reasoning has been added to the first paragraph of the introduction. |
The argument presented in section 2.1 relies heavily on the works of Kind and Osborne. The authors could strengthen their view by referring to one or more chapters from Fischer et al. (2018) that address the association between scientific reasoning and disciplinary content knowledge from various perspectives. |
Section 2.1 has been expanded to include a discussion of views of scientific reasoning related to broad applicability versus domain-specificity drawing from work presented in Fischer et al., 2018. |
The paragraphs on model-based reasoning would benefit from a broader framing beyond Zangori et al., for instance by addressing the research of Hmelo-Silver, Van Joolingen, Mulder, and Reiser. |
The argument has been broadened and bolstered on page 4 by citing a broader array of research and by providing another example (e.g., Swartz et al., 2009). |
What I missed in the Introduction and/or Discussion is a critical account of the pros and cons of written tests that make use of a multiple-choice item format. Both test characteristics are a known threat to internal validity—that is, it is not self-evident that such tests measure scientific reasoning as accurately as practical inquiry tasks would. This issue is touched on in section 2.3 where possible disadvantages of existing assessments are described; this discussion should be extended by addressing the ecological and construct validity of multiple-choice tests. The authors could, for instance, attend to the book by Harlen (2013), which contains an insightful overview of possible disadvantages of written tests of scientific knowledge and scientific reasoning. |
A discussion of the pros and cons of multiple-choice tests compared to other types of assessments have been added to the limitation section. |
Method |
|
The ARB was developed through iterative rounds of testing, which is a highly recommended procedure. The authors describe their approach rather well, but some details should be added to ease the readers’ understanding of what exactly happened in the various testing rounds. For instance, the initial testing described in section 3.1.1 is ambiguous; is this the same as the initial testing described in section 3.1.2? Please clarify, and if it is a separate activity then please give some more demographic information on the participants. |
Clarification has been added to section 3.1.1 by referencing the read-aloud section 3.1.2 and the focus group section 3.1.3 |
Please explain the acronyms AP and IB (section 3.1.4). |
AP and IB have been spelled out as Advanced Placement and International Baccalaureate in the paper. |
Be more specific here about the procedures used to determine whether there were problematic items that needed to be reconsidered. |
Added clarifying text to the third paragraph of section 3.1.4 about problematic items. |
Thirdly, there is no information about the expert screening that served to determine translational validity. Please add to the Method section. |
Information about the experts has been added to the third paragraph in the methods section. |
The sample for the second (final) round of data collection is rather small given the quantitative nature of the analyses. I assume that this is the reason why the authors relied on classic test theory measures instead of performing item-reponse theory (IRT) modeling. This advanced technique has some obvious advantages the authors should acknowledge in the Discussion. In doing so, they may be more modest about their own validation efforts and should perhaps present their work as an ‘initial validation’. |
We agree with reviewer 2 that this paper presents preliminary evidence for the instrument's validity, and chose to run CTT analyses this round to get the instrument into a more stable state prior to beginning data collection for a large enough sample for IRT models. We have clarified in the paper the reasoning for selecting CTT analyses for this round, and have emphasized throughout the paper that the validity evidence presented here is preliminary and that more rigorous testing is planned for the future. |
Results |
|
The quantitative analyses are mostly presented in narrative form with minimal descriptive statistics. I strongly recommend the authors to include item characteristics derived under the classic test theory paradigm (p-values indicating item difficulty and point-biserial item-test correlations to indicate item discrimination). |
We have added tables with item difficulty and discrimination, as requested. |
Minor issue: p.12, line 522—‘screen plot’ should be ‘scree plot’. More importantly, the ‘elbow’ in the scree plot is at the two-factor solution, so the authors cannot claim that the results lend support to a one-factor solution! I therefore recommend to skip the follow-up exploratory factor analyses (EFA) that tested a two- and three-factor solution, and instead run a comfirmatory factor analysis with two factors (despite the modest sample size; this can be commented on in the Discussion). Item characteristics (e.g., eigen values) resulting from the EFA should be presented to indicate which items belong to which factors. |
We have added information to the text to clarify how to interpret the presented scree plot. Only factors to the left of the elbow point should be considered significant.
|
Please present the means and standard deviations in Tables 6 and 7 instead of percentages. |
We have updated tables 6 and 7 (now tables 7 and 8) with means and standard deviations as requested. |
Discussion |
|
The overall conclusion in the first line of the discussion is too strong. The authors present initial validity evidence in this paper, that should be backed up by future research in larger samples that enable for more advanced validation methods such as IRT. |
We have reworded this line to indicate that the evidence is preliminary.
|
In addition, this section would benefit from a discussion on the uptake of the ABR in educational practice, which seems limited to classes that have studied the exact same disciplinary content as addressed in the tests. |
A discussion of the limitations of the ABR to biology classrooms has been added to the Limitations and Implications sections. |
Reviewer 2 Report
This study can contribute to science education in general, and particularly to school science. As authors pointed out, there is a limitation in the multiple-choice assessment tools and qualitative assessment may be more reliable to obtain rich, detailed assessment results. However, teachers in the science classrooms are in need of a reliable, convenient, and scalable assessment tool. Using ABR can help teachers obtain a scope of their students’ understanding and plan for the next step.
There are a few points that need to be clarified in Method.
1) Each data set is collected for the specific purpose in sequential order: qualitative data for wording and coherency issues, 1st round large sample of item adjustment, and 2nd round large sample of final factor analysis. The results section describes what were found and how the items were revised. However, it is not clear whether the result of each analysis was used to revise the items for the next data collection. For example, if the wordings were improved by the interview results, did the revised wording items used for the large data collection? What differences were made in items from 1st large data collection and 2nd collection? The sequential revision was assumed, but not clearly described.
2) What was the teachers’ role? It is assumed that these teachers administered the ABR and collected data. Is that the only role of these teachers or were they involved more in the research?
3) The interview data shows participants’ gender and race, but the data were not discussed nor analyzed based on these terms. Is presenting this demographic information necessary for this study? If the purpose is to show the diverse sample, Table 1 does not show it.
Author Response
Dear Reviewer 2,
We very much appreciate the review of our manuscript. We have incorporated these comments and suggestions, which we have outlined in the table below. We feel that the revised manuscript now has a much stronger focus and message.
We look forward to your feedback and thank you for the valuable reviews you have provided.
Kind regards,
The authors
Reviewer Comment |
Authors’ Responses |
The results section describes what were found and how the items were revised. However, it is not clear whether the result of each analysis was used to revise the items for the next data collection. For example, if the wordings were improved by the interview results, did the revised wording items used for the large data collection? What differences were made in items from 1st large data collection and 2nd collection? The sequential revision was assumed, but not clearly described. |
Clarifying text related to the iterative nature of the revisions after each round of analysis and before subsequent data collection has been added to section 3.2.1. |
What was the teachers’ role? It is assumed that these teachers administered the ABR and collected data. Is that the only role of these teachers or were they involved more in the research? |
A description of teacher #1 and #2's roles as experts have been added to the third paragraph of the methods section. |
The interview data shows participants’ gender and race, but the data were not discussed nor analyzed based on these terms. Is presenting this demographic information necessary for this study? If the purpose is to show the diverse sample, Table 1 does not show it. |
Demographic data has been removed including information in text and the associated tables have been removed. |
Round 2
Reviewer 1 Report
The authors have addressed all of my comments in this resubmission, and I was satisfied by the detailed additions to the text.
While reading the article I noticed two truly minor issues that should be fixed before the article can go in press:
- Line 136: Shavelson (2018) argued
- Placement of Table 3-5: these tables present results and should therefore not be presented in the Method section (but in in de Results section). And please consider combining the results of the three tiers in a single table, as was done in Table 6.
Author Response
Dear Reviewer 1,
Thank you for your comments on our manuscript. We have changed “argue” to “argued” in line 136. Additionally, we have combined Tables 3-5 into Table 3 and renamed subsequent tables accordingly. Table 3 can be found on page 12 starting at line 620.
Kind regards,
The Authors