The Child-Focused Injury Risk Screening Tool (ChildFIRST) Demonstrates Greater Reliability When Using a Dichotomous Scale vs. a Seven-Point Likert Scale, and Is Preferred by Raters
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThis is an interesting article, well-presented, but it needs some refinement.
Which software did your use for computing ICC?
You state:
"The dichotomous scale showed greater overall inter-rater reliability scores for each movement skill when looking at the composite scores for both days of testing."
Where is the composite scores for both days of testing? It is missing in the tables.
You should compare statistically (maybe with a non-parametric method because of the small number of raters) the discrepancies between the evaluations of each rater (on each criteria/item) to conclude that dichotomous evaluation had higher reliability than seven-point scale evaluation.
How did you define the required number of raters?
Author Response
C1: This is an interesting article, well-presented, but it needs some refinement.
R1: thank you.
C2:Which software did your use for computing ICC?
R2: We added: Python 3.11 (Python Software Foundation, https://www.python.org/) for all statistical analyses.
C3: You state: "The dichotomous scale showed greater overall inter-rater reliability scores for each movement skill when looking at the composite scores for both days of testing."
Where is the composite scores for both days of testing? It is missing in the tables.
R3: In the tables (1- dichotomous, 2- 7 point Likert), each skill has reliability values. The ICCs for each day inter-rater reliability and the intra-rater reliability are provided for the composite scores and the evaluation criteria.
C4: You should compare statistically (maybe with a non-parametric method because of the small number of raters) the discrepancies between the evaluations of each rater (on each criteria/item) to conclude that dichotomous evaluation had higher reliability than seven-point scale evaluation.
R4: We did an evaluation of all movement skills evaluation criteria – results are in the second last line of each table.
C5: How did you define the required number of raters?
R5: We used prior studies to determine the number of raters. We clarified this process in the paper by adding: Our decision to use the number of raters and ICC procedures aligns with the prior validation of the ChildFIRST tool, allowing direct comparison with earlier findings while maintaining methodological consistency [11].
Reviewer 2 Report
Comments and Suggestions for Authors
The article presents a timely, relevant and well structured investigation into the reliability and usability of two scoring systems, dichotomous and seven-point Likert scales, for the ChildFIRST movement screening tool. The title is informative and clearly states the focus of the paper, though it could benefit from slightly more concise phrasing. The abstract effectively summarizes the study’s aims, methods, and main findings, but it is somewhat dense and would be more effective with simpler, more direct language and fewer embedded statistics. As it stands, it risks overwhelming the reader before the main text begins.
The introduction provides a strong contextual foundation by outlining the increasing rates of inactivity in children and the health risks associated . However, the narrative could be more focused. The hypothesis suggesting that the Likert scale would improve reliability is not fully aligned with the background discussion, which more accurately emphasizes Likert scales’ potential to improve sensitivity or resolution rather than consistency. The difference between sensitivity and reliability, and their possible trade-off, is an important nuance that should be more clearly explained. Additionally, the transition from the general background on physical inactivity and physical literacy to the specific focus on the ChildFIRST tool and the comparison of scoring systems could be made more seamless. For instance, after introducing the concept of physical literacy and the importance of movement competence for injury prevention, the manuscript shifts rather abruptly into a description of the ChildFIRST tool and its scoring method. The reader would benefit from a clearer connective paragraph that explicitly frames why evaluating the scoring scale is important in this context. A concrete example of this would be to insert a bridging sentence such as: “Given the need for efficient and accurate assessment tools to identify movement deficits that may contribute to injury risk, it is critical to evaluate not only what is being assessed but how it is scored.” This would help link the conceptual discussion of physical literacy and injury risk with the methodological question at the heart of the study, namely, whether a dichotomous or Likert scale format is better suited to reliably capturing movement competence in children. Currently, that logical step is implied but not fully articulated, which weakens the narrative flow and may leave readers unclear on why this comparison of scoring systems is both timely and necessary.
The methods section is clear and appropriate but the justification for using a seven-point Likert scale formatted around “agreement” rather than performance quality is weak. This decision could have introduced unnecessary subjectivity, and the authors should have provided a stronger rationale or clearer operational definitions for how raters were instructed to interpret each point on the scale. Although the study design is sound, it suffers from an ordering effect: the dichotomous scale was always used first on the first day of testing, which may have affected results due to fatigue or learning effects.
Results. The paper presents extensive tables with intraclass correlation coefficients , comparing reliability across both scales and both testing sessions. The data are robust and well-analyzed using appropriate statistical methods but the presentation is quite dense and could be improved by summarizing key differences more visually—for instance, using graphs or comparative summary tables to highlight which movement skills showed the greatest discrepancies between scales.
The discussion is comprehensive and reflects thoughtfully on the findings. It appropriately acknowledges the mixed reliability results and considers why certain movement skills—such as the single-leg hop—might benefit from a more nuanced scale. However, the language is at times repetitive and informal, and the core insights could be more tightly presented. There is an over-reliance on speculative phrasing (“could be due to,” “might explain”), which weakens the strength of the conclusions. While the authors correctly identify that raters overwhelmingly preferred the dichotomous scale, they miss an opportunity to reflect more deeply on the practical trade-offs between precision and ease of use. Additionally, the discussion could be enriched by engaging more directly with related tools or studies in the field, offering broader comparisons.
The limitations section could be better structured to separate procedural limitations from broader conceptual concerns and in the conclusion part repeats much of what has already been stated in the discussion. A more impactful conclusion would emphasize the practical implications for coaches, educators, or health professionals using the tool in real-world settings. Furthermore, while the authors briefly mention the potential value of exploring three- or five-point scales, this idea is underdeveloped and could be more clearly framed as a recommendation for future work.
The formatting of the materials in final part is inconsistent in places, which can hinder readability. In Table A2, for instance, some cells include full sentences (e.g., “Keep the heels down all the time”), while others use more telegraphic phrases (e.g., “Hips, knees and ankles aligned”), and the punctuation style is uneven. Moreover, the use of terminology could be standardized. The phrase “bend to land softly in a controlled fashion” appears repeatedly but is sometimes phrased slightly differently across movements, which may confuse readers or raters. Clarifying and unifying this language would enhance the usability of the tool and improve the clarity of training materials for future users.
Lines 149-149
Confusing formatting of Likert options: “Strongly Agree” is listed twice in the 7-point scale.
The scale uses agreement (e.g., "Strongly Agree") rather than performance quality (e.g., "Excellent" to "Poor"). This introduces semantic confusion for raters. If this could be redesigned or rephrased to match performance assessment logic. E.g., use “No achievement” to “Full achievement” scale, which would be more intuitive. Were raters trained to interpret agreement levels in relation to performance? What anchoring examples were given? This could be clarified in the article.
In summary, this is a valuable contribution to the field of pediatric movement assessment and injury prevention. With few changes, the manuscript would be well-suited for publication.
Author Response
C1: The article presents a timely, relevant and well structured investigation into the reliability and usability of two scoring systems, dichotomous and seven-point Likert scales, for the ChildFIRST movement screening tool. The title is informative and clearly states the focus of the paper, though it could benefit from slightly more concise phrasing. The abstract effectively summarizes the study’s aims, methods, and main findings, but it is somewhat dense and would be more effective with simpler, more direct language and fewer embedded statistics. As it stands, it risks overwhelming the reader before the main text begins.
The introduction provides a strong contextual foundation by outlining the increasing rates of inactivity in children and the health risks associated . However, the narrative could be more focused. The hypothesis suggesting that the Likert scale would improve reliability is not fully aligned with the background discussion, which more accurately emphasizes Likert scales’ potential to improve sensitivity or resolution rather than consistency. The difference between sensitivity and reliability, and their possible trade-off, is an important nuance that should be more clearly explained. Additionally, the transition from the general background on physical inactivity and physical literacy to the specific focus on the ChildFIRST tool and the comparison of scoring systems could be made more seamless. For instance, after introducing the concept of physical literacy and the importance of movement competence for injury prevention, the manuscript shifts rather abruptly into a description of the ChildFIRST tool and its scoring method. The reader would benefit from a clearer connective paragraph that explicitly frames why evaluating the scoring scale is important in this context. A concrete example of this would be to insert a bridging sentence such as: “Given the need for efficient and accurate assessment tools to identify movement deficits that may contribute to injury risk, it is critical to evaluate not only what is being assessed but how it is scored.” This would help link the conceptual discussion of physical literacy and injury risk with the methodological question at the heart of the study, namely, whether a dichotomous or Likert scale format is better suited to reliably capturing movement competence in children. Currently, that logical step is implied but not fully articulated, which weakens the narrative flow and may leave readers unclear on why this comparison of scoring systems is both timely and necessary.
R1: The Child Focused Injury Risk Screening Tool (ChildFIRST) is a process-based physical literacy assessment tool used to measure physical competence that was developed to screen for high-risk technique error that predispose children aged 8-12 to lower limb injury
Recent studies have shown Likert scales to be susceptible to middle and extreme response bias, which can reduce inter-rater reliability[20], and these elements need to be considered when choosing an evaluation method. The need for efficient and accurate assessment tools to identify movement deficits is critical to evaluate not only what is being assessed but how it is scored. (added italicized text).
C2: The methods section is clear and appropriate but the justification for using a seven-point Likert scale formatted around “agreement” rather than performance quality is weak. This decision could have introduced unnecessary subjectivity, and the authors should have provided a stronger rationale or clearer operational definitions for how raters were instructed to interpret each point on the scale. Although the study design is sound, it suffers from an ordering effect: the dichotomous scale was always used first on the first day of testing, which may have affected results due to fatigue or learning effects.
R2: We clarified that in fact we tested day 1 and 2 in different orders, not as previously thought; the dichotomous first on both days.
C3: Results. The paper presents extensive tables with intraclass correlation coefficients , comparing reliability across both scales and both testing sessions. The data are robust and well-analyzed using appropriate statistical methods but the presentation is quite dense and could be improved by summarizing key differences more visually—for instance, using graphs or comparative summary tables to highlight which movement skills showed the greatest discrepancies between scales.
R3: We would be pleased to adjust Table 3 (Rater satisfaction) to a bard graph if the editors would prefer this.
C4: The discussion is comprehensive and reflects thoughtfully on the findings. It appropriately acknowledges the mixed reliability results and considers why certain movement skills—such as the single-leg hop—might benefit from a more nuanced scale. However, the language is at times repetitive and informal, and the core insights could be more tightly presented. There is an over-reliance on speculative phrasing (“could be due to,” “might explain”), which weakens the strength of the conclusions.
R4: We revised to fewer uses of speculative language.
C5:While the authors correctly identify that raters overwhelmingly preferred the dichotomous scale, they miss an opportunity to reflect more deeply on the practical trade-offs between precision and ease of use. Additionally, the discussion could be enriched by engaging more directly with related tools or studies in the field, offering broader comparisons.
R5: Given the similarity of reliability using each scale, we relied on the user preference to drive home the decision to retain the dichotomous scale. As the ChildFIRST is a process-based tool and uniquely attempts to bring injury risk into the assessment, we did not believe it was of value to related our reliability to other tools that are designed with a different purpose.
C6: The limitations section could be better structured to separate procedural limitations from broader conceptual concerns and in the conclusion part repeats much of what has already been stated in the discussion. A more impactful conclusion would emphasize the practical implications for coaches, educators, or health professionals using the tool in real-world settings. Furthermore, while the authors briefly mention the potential value of exploring three- or five-point scales, this idea is underdeveloped and could be more clearly framed as a recommendation for future work.
R6: We improved the conclusion to be less repetitive and added emphasis on the practical implications.
We made the suggestion for future research to determine if the 3-5 point might improve consistency, yet give more options that a dichotomous choice. However, the reliability between 3-, 5-, and 7- point scale is similar, and may not be of value. We removed the suggestion.
C7: The formatting of the materials in final part is inconsistent in places, which can hinder readability. In Table A2, for instance, some cells include full sentences (e.g., “Keep the heels down all the time”), while others use more telegraphic phrases (e.g., “Hips, knees and ankles aligned”), and the punctuation style is uneven. Moreover, the use of terminology could be standardized. The phrase “bend to land softly in a controlled fashion” appears repeatedly but is sometimes phrased slightly differently across movements, which may confuse readers or raters. Clarifying and unifying this language would enhance the usability of the tool and improve the clarity of training materials for future users.
R7: We appreciate the overlapping language of the tool; however we are using the language of the original tool based on the Delphi model published by Jimenez-Garcia et al. (MPEES 2020)
C8: Lines 149-149
Confusing formatting of Likert options: “Strongly Agree” is listed twice in the 7-point scale.
The scale uses agreement (e.g., "Strongly Agree") rather than performance quality (e.g., "Excellent" to "Poor"). This introduces semantic confusion for raters. If this could be redesigned or rephrased to match performance assessment logic. E.g., use “No achievement” to “Full achievement” scale, which would be more intuitive. Were raters trained to interpret agreement levels in relation to performance? What anchoring examples were given? This could be clarified in the article.
R8: We corrected the twice listed Likert option. The descriptor (excellence to poor) suggestion is an interesting option. While we cannot go back and change this, we see how those ‘achievements’ (vs ‘agreement’) could be more intuitive. However, we did describe agreement in relation performance, writing “We instructed raters to score each evaluation criteria using the seven-point scale based on their agreement with the level of achievement of each evaluation criteria. During the training sessions, there were anchoring style videos and examples of achieved and unachieved skill criteria. We added the line: “We instructed raters to score each evaluation criteria using the seven-point scale based on their agreement with the level of achievement of each evaluation criteria.”
C9: In summary, this is a valuable contribution to the field of pediatric movement assessment and injury prevention. With few changes, the manuscript would be well-suited for publication.
R9: Thank you. We hope the changes are agreeable and we can move forward to publication.
Reviewer 3 Report
Comments and Suggestions for AuthorsThis article evaluates the reliability and usability of two scoring scales—dichotomous and seven-point Likert—for the Child Focused Injury Risk Screening Tool (ChildFIRST), a process-based tool designed to assess movement competence and injury risk in children aged 8–12. Fourteen trained raters evaluated videos of children performing movement tasks. The findings indicate that while both scales provide comparable inter-rater reliability, the dichotomous scale offers slightly better intra-rater consistency and is strongly preferred by raters for its practicality and ease of use. The authors conclude that the dichotomous scale is better suited for field deployment of the ChildFIRST tool.
The study addresses an important issue in pediatric physical health—identifying children at risk of injury due to poor movement competence, a key component of physical literacy. The use of video-based evaluation and ICC (Intraclass Correlation Coefficient) analysis to compare reliability across scales is methodologically sound and appropriate. By directly comparing two scales within the same cohort, the study offers practical insights into real-world applicability. Generally, the article has a clear and logical flow from introduction to conclusion, making it easy to follow.
Areas for improvements:
The study suffers from limited generalizability due to a small and homogeneous sample—only eight children (from a single Taekwondo club) and 14 raters (all from a kinesiology program), with no discussion of cultural, gender, or physical diversity. Additionally, the tool's predictive validity remains unverified, as there is no linkage between ChildFIRST scores and actual injury outcomes.
The seven-point Likert scale may not be well-suited for evaluating physical performance, as it relies on subjective agreement rather than clear behavioral criteria, introducing ambiguity.
Study design also introduces potential rater fatigue bias, since the Likert scale was always administered after the dichotomous scale in the first session. Furthermore, while informed consent is stated, there is insufficient information on how participant confidentiality and data security were ensured—an essential ethical consideration when involving minors.
Recommendations
Accept with Minor Revisions.
The article is a solid contribution to the field of pediatric physical assessment and injury prevention. It is especially valuable for practitioners who need reliable, quick-to-administer tools. The study is methodologically sound, practically relevant, and well-executed despite some limitations in sample diversity and scale design.
Required Revisions:
- Address the limitations of sample diversity and potential rater fatigue more explicitly.
- Reconsider or expand discussion around the structure of the Likert scale and whether a 3- or 5-point version might offer a more feasible middle ground.
- Perform a thorough language edit to correct grammar inconsistencies and improve flow.
This article should be published after these minor corrections, particularly because it provides both empirical and practical guidance for optimizing injury risk assessments in children using a reliable and user-friendly tool.
Comments for author File:
Comments.pdf
Author Response
Areas for improvements:
C 1: The study suffers from limited generalizability due to a small and homogeneous sample—only eight children (from a single Taekwondo club) and 14 raters (all from a kinesiology program), with no discussion of cultural, gender, or physical diversity. Additionally, the tool's predictive validity remains unverified, as there is no linkage between ChildFIRST scores and actual injury outcomes.
R1: We previously used these sample sizes (Miller, et al. 2020), however we recognise that the raters all had a general understanding of anatomy and basic movement skills. The training complete with the raters, we believe, helped them reach a consistent knowledge base for rating the skills. The training is a key component that aids raters ability to accurately use the tool.
C2" The seven-point Likert scale may not be well-suited for evaluating physical performance, as it relies on subjective agreement rather than clear behavioral criteria, introducing ambiguity.
R2: This is an important point, that we feel is a partial rationale for the study. The dichotomous scale is closer (or should be) to judging a skill as achieved, or not. We believe the dichotomous scale was preferred because that is easier to judge achievement.
C3: Study design also introduces potential rater fatigue bias, since the Likert scale was always administered after the dichotomous scale in the first session.
R3: We are unsure how this confusion arose: Our paper states: “On day 1, raters watched the videos of participants performing the movements and scored them using the dichotomous scale. We played each video twice with a 20 second pause between recordings to allow raters enough time to score the movement skills. After a five-minute rest period, the raters watched the same videos in a counterbalanced order and scored the movement skills with the seven-point scale. The first day of testing lasted 2.5 hours in total. On the second day, the raters participated in a refresher course of the ChildFIRST skills, evaluation criteria, and scoring options. The raters then scored the movement skills first using the seven-point scale, followed by the dichotomous scale.” (added emphasis)
C4: Furthermore, while informed consent is stated, there is insufficient information on how participant confidentiality and data security were ensured—an essential ethical consideration when involving minors.
R4: Under ‘study design’ we indicate:
Following the institutional ethics approval, The children in the recordings provided assent and their guardians provided informed consent prior to using their videos in the study. Raters provided informed consent prior to participation. The raters had no access to names or other demographic information of the children. The videos were shown on a screen, and the videos were not provided to the raters. (italicized sentences added)
Under ‘participants and videos’
The identity of the participants in the recordings remained anonymous and all procedures were approved by the institutional human research ethics committee.
Recommendations
Accept with Minor Revisions.
The article is a solid contribution to the field of pediatric physical assessment and injury prevention. It is especially valuable for practitioners who need reliable, quick-to-administer tools. The study is methodologically sound, practically relevant, and well-executed despite some limitations in sample diversity and scale design.
Required Revisions:
C5: Address the limitations of sample diversity and potential rater fatigue more explicitly.
R5: Added to paper: The children were representative of a variety of races, shapes and sized, but we did not record such information.
Regarding fatigue, we state “ The poor reliability on the first day of testing for the seven-point scale could be in part due to rater fatigue and decreased attention. The first testing session included a training session, and the seven-point scale was used second after the dichotomous scale. The attention span in university students decreases steadily after 20 minutes of exposure to the same subject[26]. The second day of testing had no training, only a refresher, which reduced the total time of the session. Therefore, the length of the session may not have affected the dichotomous scale as much even though it was in the second part of the session.”
C6: Reconsider or expand discussion around the structure of the Likert scale and whether a 3- or 5-point version might offer a more feasible middle ground.
R6: We do not have information that suggest a 3- or 5- point scale would be better; simply that it could be tested. Given the similarities in reliability of the 7 point and dichotomous, it seems as if this suggestion is not warranted. We removed this suggestion.
C7: Perform a thorough language edit to correct grammar inconsistencies and improve flow.
R7: We reviewed the text of the paper and attempted to improve flow and grammar considering all suggested edits from the reviewers. We used grammar and spelling checks to avoid errors and inconsistencies. The first and last authors are native English users, and the other is first language Spanish, but trilingual. If there are errors, please be specific in identifying them and we will be happy to review/correct.
C8: This article should be published after these minor corrections, particularly because it provides both empirical and practical guidance for optimizing injury risk assessments in children using a reliable and user-friendly tool.
R8: Thank you. We hope the changes are agreeable and we can move forward to publication.
Reviewer 4 Report
Comments and Suggestions for AuthorsDear Authors,
thank you for submitting this interesting manuscript. I believe that the type of research you conducted (comparing different measures) is generally very worthy. For the present case, I have a couple of reservations, though. Please see below for details. I will not make a specific recommendation but suggest either a substantial revision or a rejection.
Major Comments:
My main concerns are about the number of subjects and the seeming lack of videos from children with clear movement problems -- the latter to assess the merit of the scales for cases where children may show serious problems. Both concerns, of course, could be resolved with additional data.
The rest of my comments is comparably unimportant. I report them below.
- I had a problem following the first part of the introduction. In Line 55 a new paragraph starts that comes a bit out of nothing for me. Suggestion: Either integrate things so that it is clear why you report them or delete everything up to line 68 and simply start with "The Child Focused..."
- Generally you might consider splitting some paragraphs (long ones only).
- Section 2.2: Did you consider (and test for) order effects?
- Section 2.3: In 2.2. you directly mention the number of participants. Why not here?
- Section 3: I believe here the data problem arises. That needs at least to be addressed at length.
- Just above 3.2: mean experience is 3.71 +/- 4.11. I suppose you have some outliers to rather long experience? Negative values at least appear odd. It would be helpful if you could describe the data in a bit more detail.
- You mention that subjects could be more experienced for later assessments. Would that not suggest to have them do some test sessions before. In applications, I would assume that practitioners are experienced so that differences due to learning should not matter. That could at least be commented on more explicitly.
- I would have liked a table with a brief description of the tasks in the main body of the paper not just in the appendix.
- The discussion could be more focused and perhaps benefit from an illustration/table.
- In 4.1. you mention that some videos may be difficult to assess due to the clothing of the children and that this may have affected the result. This could be easily tested for by just running the same analysis without the data points for the respective children and movements.
Author Response
C1: My main concerns are about the number of subjects and the seeming lack of videos from children with clear movement problems -- the latter to assess the merit of the scales for cases where children may show serious problems. Both concerns, of course, could be resolved with additional data.
R1: Thank you for this comment. We have clarified the paper to demonstrate the variety of children used (race, shape, and size), however, it is true that all were somewhat active. There were a variety of skill achievement levels as well, but with a convenience sample, the main point remained if the raters could demonstrate the ability to discern the skill of the child consistently. We added to the limitations section, “Additionally, while children were of a variety of skill levels, all were active. Increasing the pool of participants in future studies to include less skilled, or elite athletes would aid the generalizability of the tool.”
The rest of my comments is comparably unimportant. I report them below.
C2: I had a problem following the first part of the introduction. In Line 55 a new paragraph starts that comes a bit out of nothing for me. Suggestion: Either integrate things so that it is clear why you report them or delete everything up to line 68 and simply start with "The Child Focused..."
R2: We feel it is important to introduce the aspects of product vs process in assessment tools, which is why we list two tools. We did create a new paragraph to describe the ChildFIRST which we hope will be less abrupt (and which addresses the reviewer’s 2nd point).
C3: Generally you might consider splitting some paragraphs (long ones only).
R3: Some paragraph breaks added.
C4: Section 2.2: Did you consider (and test for) order effects?
R4: We clarified that in fact we tested day 1 and 2 in different orders, not as previously thought; the dichotomous first on both days.
C5: Section 2.3: In 2.2. you directly mention the number of participants. Why not here?
R5: We have this information in the results/rater demographics section.
C6: Section 3: I believe here the data problem arises. That needs at least to be addressed at length.
R6: We believe some clarity to our procedures and methods help to explain our data/participants number decisions.
C7: Just above 3.2: mean experience is 3.71 +/- 4.11. I suppose you have some outliers to rather long experience? Negative values at least appear odd. It would be helpful if you could describe the data in a bit more detail.
R7: We indicate the outlier of the PhD student as a rater skewed the mean.
C8: You mention that subjects could be more experienced for later assessments. Would that not suggest to have them do some test sessions before. In applications, I would assume that practitioners are experienced so that differences due to learning should not matter. That could at least be commented on more explicitly.
R8: The raters may only have been more experienced because the Day 2 session was not their first time. The raters all have similar experience, with three (2 Masters and 1 PhD) having more experience. However, our point was that reliability improved generally for Day 2, indicating that training and practice with the tool is helpful to improve consistency.
C9: I would have liked a table with a brief description of the tasks in the main body of the paper not just in the appendix.
R9: We are happy to have this table in the main paper, if the editors wish this as well. The tool and movement criteria can be found in other papers in which the tool was developed. We wished to maintain that distinction.
C10: The discussion could be more focused and perhaps benefit from an illustration/table.
R10: We felt the tables were suitable to demonstrate the distinction for reliability and preference. We can change the preference to a bar graph of some other visual if the editors prefer.
C11: In 4.1. you mention that some videos may be difficult to assess due to the clothing of the children and that this may have affected the result. This could be easily tested for by just running the same analysis without the data points for the respective children and movements.
R11: We wished to make our testing session as consistent to a true experience as possible, so we left these videos in our analysis. While we mention it as a possible limitation, each rater still had to score the skills and we believe including these videos adds a real-world a consistency.
Round 2
Reviewer 4 Report
Comments and Suggestions for AuthorsDear Authors,
I have read the revised version and your comments. My conclusion is that I strongly dislike the way you reject and ignore important questions by just saying that you wanted it like that. The way you deal with crucial questions is highly unscientific and, in my view, disrespectful to the reviewer.
In fact, my impression is that in your revised manuscript you try to drop issues that might need further work to clarify (and make the claims a bit more scientifically sound) instead of dealing with them. I therefore will recommend to reject the paper.
I will provide some further brief comments below in case the editor does not follow my suggestion. Should I come to see the paper again without all (!) these issues (and the ones you avoided from the first report) being addressed properly, you can be sure I will recommend rejection again without further comment.
Major comments:
Section 2.2:
You recruited 8 (in words eight) children of age 8-12 from a Taekwondo club. Then you say they were representative of a variety of this and that and in addition refer to observations you did not record. That's how you conceive of proper scientific method?
As you want to test your scales, I am willing to accept the highly special group. But the problems need to be acknowledged and not talked away. Apparently these children are all used to regular exercising.
Section 2.3
There is loads of information lacking about raters.
Section 2.4
Differences in procedures between day 1 and 2 vanish in the text. You should be open about the change in order and that should be easily visible even on a quick reading. I suggest two paragraphs for the two days.
Section 3.1
14 raters is extremely few. That MUST be acknowledged before presenting the results.
How can only the phd student skew the mean if you also have to MSc candidates?
Section 3.2/3.3
Days should be visible (paragraphs?)
What became of the trousers and possible differences in ratings? It may be that real life is like that. But if you want to assess the scales, results should not depend on such chance events. Or do you intend to claim that clothing was representative for children in Taekwondo clubs? Your area? Western Countries?
I still think that a description of the tasks ought to be brief an in the main body of the text (and giving credit to original papers in the table).
Minor comments:
I find it strategically not very clever to keep the first paragraph extensively long after the reviewer has asked for more clarifications through more concise paragraphs. Line 43 offers itself for a split.
Author Response
Comment 1: I have read the revised version and your comments. My conclusion is that I strongly dislike the way you reject and ignore important questions by just saying that you wanted it like that. The way you deal with crucial questions is highly unscientific and, in my view, disrespectful to the reviewer.
In fact, my impression is that in your revised manuscript you try to drop issues that might need further work to clarify (and make the claims a bit more scientifically sound) instead of dealing with them. I therefore will recommend to reject the paper.
I will provide some further brief comments below in case the editor does not follow my suggestion. Should I come to see the paper again without all (!) these issues (and the ones you avoided from the first report) being addressed properly, you can be sure I will recommend rejection again without further comment.
Response 1: We apologize for coming across as disrespectful, it was not intentional. We did try to address comments for all four reviewers consistently and fully. It is true that we made some changes but did not follow all suggestions of all reviewers. Generally, those points that were “major comments” we tried to clarify and/or edit to improve the manuscript to the reviewers’ suggestions. When the comments were deemed minor, we still addressed them, in various depths to retain the spirit of the messaged. We did take these comments seriously and used them to improve the manuscript.
Major comments:
Section 2.2:
Comment 2: You recruited 8 (in words eight) children of age 8-12 from a Taekwondo club. Then you say they were representative of a variety of this and that and in addition refer to observations you did not record. That's how you conceive of proper scientific method?
As you want to test your scales, I am willing to accept the highly special group. But the problems need to be acknowledged and not talked away. Apparently, these children are all used to regular exercising.
Response 2: We did have a very specialised cohort in that they were active, and it was a sample of convenience. We were attempting to address another reviewer’s comment about sample diversity. We removed that line, added that they were active, and addressed it in the limitations section. We added in Limitations: We did not collect race or ethnicity data of our participants, which may also reduce generalizability.
Section 2.3
Comment 3: There is loads of information lacking about raters.
Response 3: We realise that paragraph is limited, however we include additional information in subsequent sections (see below). We are happy to adjust this paragraph. Would the reviewer be willing to suggest what information would be helpful for the reader that fits this portion of the manuscript? For example, we could combine some or all the information below, but we think it should be in the respective sections. We are willing to change this if the editor deems it appropriate.
Note: We did misunderstand the comment about age vs experience and apologize for this error. We note our response was specific to the age standard deviation and not the experience standard deviation. We have corrected the text to address the experience range. Please see below the information about participants in the results section.
To give examples of the information we wrote, first in this section we wrote about the general background, and subsequent sections below:
We recruited raters using convenience sampling from a health, kinesiology, and exercise-oriented university program. All raters successfully completed university level courses in musculoskeletal and systemic anatomy, physiology, and strength & conditioning giving them a good understanding of fundamental movement skills.
In section 2.4 (Procedures), we wrote of the training for the raters.
On the first day of testing, each rater filled out a demographic questionnaire. We then gave the raters a 45-minute ChildFIRST training session. During these sessions, the raters were exposed to a variety of skill achievement to anchor a child’s success on the evaluation criteria We explained each movement skill and how to score each evaluation criteria for the dichotomous and seven-point versions of the ChildFIRST
In section 2.5 (Statistical Analysis) we made the justification for the number of raters.
Our decision to use the number of raters and ICC procedures aligns with the prior validation of the ChildFIRST tool, allowing direct comparison with earlier findings while maintaining methodological consistency [11].
In the Results section, 3.1 (Rater Demographics) we wrote:
A total of 15 raters were initially recruited. One rater dropped out and 14 raters completed the study. The mean age of the raters was 24.92 ± 4.02 years old with the age of the PhD student skewing the mean. Nine of the raters were seniors; one rater was a sophomore; one rater was a junior; two raters were MSc candidates, and one rater was a PhD student. The mean years of experience working with children was 3.71 (range 0 to 7.82 years).
Comments 4 - Section 2.4
Differences in procedures between day 1 and 2 vanish in the text. You should be open about the change in order and that should be easily visible even on a quick reading. I suggest two paragraphs for the two days.
Response 4: This reviewer was not the only one to miss this aspect of the study design in the first review, so clearly we need to find a way to better distinguish day 1 from day 2, and highlight the altered test order. We followed the reviewer’s advice and made a paragraph for each day.
Comment 5: Section 3.1
14 raters is extremely few. That MUST be acknowledged before presenting the results.
Response 5: We believe the best spot to acknowledge the number of raters, as a potential shortcoming, is in the limitations section. We do not believe however the number of raters is a limitation because we followed previous procedures from published papers, and with this number of raters, we established favorable reliability numbers. Despite that belief, we have included the following in the procedures section: Having a larger number of raters could have improved our results.
Comment 6: How can only the phd student skew the mean if you also have to MSc candidates?
The MSc students enrolled in the program directly after their undergraduate degree. The PhD student is a ‘mature’ student and enrolled several years after their undergraduate degree. In the paper we previously included additional text: “The PhD student was a mature student with an age to skew the mean.” However, we made an error by confusing the age and experience of the comment. As such, we removed this line, and reported the age standard deviation as it was, and corrected the negative impression of the years of experience (addressed above, with our apologies)
Comment 7:
Section 3.2/3.3
Days should be visible (paragraphs?)
Response 7: We described the procedures from day 1 and day 2, now with that section into 2 paragraphs.
Comments 8: What became of the trousers and possible differences in ratings? It may be that real life is like that. But if you want to assess the scales, results should not depend on such chance events. Or do you intend to claim that clothing was representative for children in Taekwondo clubs? Your area? Western Countries?
Response 8: The primary purpose of the study was to compare the intra-rater and inter-rater reliability of the ChildFIRST using a dichotomous and a seven-point Likert scale. While we recognize the limitation of the child wearing pants, we believe it did not interfere with the results. The only support for that opinion, is that the reliability remained consistent throughout the measures, including a poor rating for the vertical jump. We edited the limitations section to help clarify this problem. We added: It is difficult to determine if the poor reliability scores of the vertical jump are a function of the seven-point scale as the inter-rater reliability on Day 1 were also poor using the dichotomous scale. However, the day 2 interrater values, as well as the intra-rater values using the dichotomous scale were moderate. These values give rise to the limitation of our videos.
We also added paragraph breaks to this section to improve flow.
Comments 9: I still think that a description of the tasks ought to be brief an in the main body of the text (and giving credit to original papers in the table).
Response 9: We have added as Table 1, the description of the ChildFIRST skills.
Minor comments:
Comment 10: I find it strategically not very clever to keep the first paragraph extensively long after the reviewer has asked for more clarifications through more concise paragraphs. Line 43 offers itself for a split.
Response 10: We have revised the first paragraph, to shorten and clarify.
Deleted: Exercise deficit disorder describes a condition of reduced levels of physical activity, where achieving 60 minutes of daily physical activity is associated with improved health outcomes[3]. Pediatric dynapenia is a condition characterized by decreased levels of muscle strength causing functional limitations not associated with any neurological of muscular disease [3]. Lastly, physical illiteracy...
…and now reads: This triad includes physical illiteracy, which refers to the lack of confidence, competence, and motivation to engage in meaningful physical activities with interest and enthusiasm [3].
Deleted and changed: the pediatric inactivity triad with physical illiteracy
Deleted: Significant deficits in youth physical activity levels highlight the importance of promoting exercise to mitigate negative health outcomes. In recent years, studies looking...
Edited: Efforts to improve physical activity levels are using the concept of physical literacy [8].
Round 3
Reviewer 4 Report
Comments and Suggestions for Authors--

