Intra-Rater (Live vs. Video Assessment) and Inter-Rater (Expert vs. Novice) Reliability of the Test of Gross Motor Development—Third Edition

The Test of Gross Motor Development (TGMD) is one of the most common tools for assessing the fundamental movement skills (FMS) in children between 3 and 10 years. This study aimed to examine the intra-rater and inter-rater reliability of the TGMD—3rd Edition (TGMD-3) between expert and novice raters using live and video assessment. Five raters [2 experts and 3 novices (one of them BSc in Physical Education and Sport Science)] assessed and scored the performance of the TGMD-3 of 25 healthy children [Female: 60%; mean (standard deviation) age 9.16 (1.31)]. Schoolchildren were attending at one public elementary school during the academic year 2019–2020 from Santiago de Compostela (Spain). Raters scored each children performance through two viewing moods (live and slow-motion). The ICC (Intraclass Correlation Coefficient) was used to determine the agreement between raters. Our results showed moderate-to-excellent intra-rater reliability for overall score and locomotor and ball skills subscales; moderate-to-good inter-rater reliability for overall and ball skills; and poor-to-good for locomotor subscale. Higher intra-rater reliability was achieved by the expert raters and novice rater with physical education background compared to novice raters. However, the inter-rater reliability was more variable in all the raters regardless of their experience or background. No significant differences in reliability were found when comparing live and video assessments. For clinical practice, it would be recommended that raters reach an agreement before the assessment to avoid subjective interpretations that might distort the results.


Introduction
Fundamental movements skills (FMS) consist of a basic organized movement involving the combination of movement patterns of two or more parts of the body [1]. FMS are considered "building blocks" for more advanced and complex movements necessary to participate in different sports, games and other physical activities. Commonly, FMS are classified into locomotor skills (e.g., run, jump, hop, slide), object control/ball skills (e.g., catch, kick, strike, throw) and balance/stability skills (e.g., static balance, dynamic balance) [1][2][3].
Strong existing evidence suggests positive associations between FMS competency and physical activity [4], physical fitness [4,5] and health-related benefits [6,7] such as healthy weight status [4,5], cardiorespiratory fitness [4,5] or positive cognitive and academic outcomes [8] among others. However, the acquisition of FMS proficiency levels does not occur naturally and the implementation of structured physical education programs in pre-school and school-aged children is required [9].
In light of the aforementioned reported benefits, existing literature supports the importance of assessing FMS using valid and reliable tools [10], which is also necessary to provide valuable information about children's motor performance and progress [11]. It is essential to early identification of possible delays or disorders that could affect motor competence and cognitive [12] and affective [13] development.
There are several assessment tools to evaluate FMS, which can be broadly classified into two categories: quantity/product-oriented assessment, which evaluates the outcome of the movement (e.g., velocity, trajectory) and quality/process-oriented assessment, which evaluates the pattern of the movement (e.g., how a person throws) [14,15]. Moreover, assessments may examine gross motor skills, movements that require the use of large motor groups (e.g., running, jumping) [16] or fine motor skills, movements that involve small motor groups (writing, eating) [17].
One of the most common assessments used to examine FMS in children is the Test of Gross Motor Development (TGMD) [3] and its variants TGMD-Second edition (TGMD-2) [18] and TGMD-Third Edition (TGMD-3) [19]. The TGMD-3 is a validated and reliable [20] process-oriented assessment applied to evaluate gross motor competence in children between 3 years-0 months and 10 years-11 months. The TGMD-3 assess thirteen FMS divide into two subscales, locomotor and ball skills.
Reliability is one of the most essential and fundamental features in assessing performance in research. For one hand, reliability is an essential issue in research, since it lets researchers replicate studies. On the other hand, from a practical perspective, it would be important to assess accurately those variables that wanted to be studied. For example, regarding the scope of the present study, the evaluation of FMS by schoolteachers or clinical staff. In this sense, a current systematic review [20] of 23 studies assessing the reliability of the TGMD showed good-to-excellent inter-rater reliability, good-to-excellent intra-rater reliability and moderate-to-excellent test-retest reliability. Most studies assessed the reliability among experienced raters using video evaluation. However, the TGMD is sometimes used by researchers or practitioners (such as physical education teachers) with little or no training [21], since the examiner's manual only recommend to train with at least three children before a diagnostic evaluation [18,19]. Thus, it is necessary to know reliability among raters with different backgrounds and experiences. Only one previous study has compared differences in scores between expert and novice coders [22]. The results indicated that novice raters could not score the TGMD-2 in a significantly similar manner to the experts [22]. However, to this day, no study examined this issue in TGMD-3.
Although the TGMD examiner's manual does not refer to the children's assessment through video recordings, many studies used videotaped assessment because of its advantages over live evaluation. It allows the evaluation of each skill's criteria in more detail and repeatedly, even in slow-motion if necessary [22]. Intra and inter-rater reliability between expert and novice raters considering a different type of video viewing, live and slow-motion, has not yet been investigated. Thus, the purpose of this study was to analyze intra-and inter-rater reliability of TGMD-3 using live and video assessments between experts and novice raters.

Participants
This study was conducted with 25 healthy children (60% females, n = 15; 100% Hispanic, n = 25) between 8 and 10 years old (Mean ± SD: 9.16 ± SD 1.31 years), attending a public elementary school during the academic year 2019-2020 in Santiago de Compostela, Galicia, Spain. Participants' anthropometric were obtained for each child (height: 1.37 ± 0.11 m; weight: 34.57 ± 8.49 kg; and body mass index: 18.26 ± 2.95 kg·m −2 ). Participation in the study was voluntary and previously, parents or guardians signed the informed consent. All children were provided with verbal information and gave verbal assent before the test. This study respected the Helsinki Convention's ethical principles and was approved by the Ethical Committee of the University of Vigo, Spain.

Raters
Five raters (convenience sample), two experts (ER) and three novices (NR) were responsible for assessing the participants' video recorders. The three NR had no previous experience rating TGMD-3. One of the novices had a BSc in Physical Education and Sports Science (NR-PE); the other two were a primary schoolteacher and a nurse, both with no physical education or sports science background and experience in working with children either. However, all of them might have to assess FMS of children in the future due to their academic/professional background. Before coding children's performance, NR studied the content of the TGMD manual [19], enquired the expert coders about their doubts and practiced administering the TGMD-3 to three children. The two ER were BSc in Physical Education and Sports Science. Both with more than 5 years of experience testing the TGMD.

Test of Gross Motor Development-Third Edition [TGMD-3]
The TGMD-3 assesses gross motor skill performance in children between 3 years and 0 months and 11 years and 11 months. It consists of two subscales: locomotor and ball skills. The locomotor subscale measures the gross motor skills that involve fluid coordination of the body while the child moves in one direction or another and includes: run, gallop, hop, skip, horizontal jump and slide. The ball skills subscale assesses the gross motor skills that require effectiveness in intercepting and propelling objects and involves: striking a stationary ball, forehand strike, stationary dribble, catch, kick, overhand throw and underhand throw. Each skill, which includes several behavioral components, was evaluated through three to five criteria, scored 1 or 0 depending on the presence or absence [19]. We obtain independent scores for each skill, subscale and total test from the criteria scores.
After a verbal description and a practical demonstration, the children perform three trials of each skill. The first trial is a practice and it was not scored. The examiner scores the remaining two trials as follows: if the child performs a behavioral component or criterion correctly, the examiner scores a 1 and if the child does not perform it correctly, a 0. According to this, the maximum score that each child can achieve is 100 points (46 for the locomotor subscale and 54 for the ball skills subscale).

Procedures
A physical education teacher and a nurse (PhD student) conducted the assessments. All data were collected in March 2020. The tests were carried out in a sports hall, during the school's regular schedule and always in the presence of at least one teacher at the school. The testers provided the children with a verbal description of each skill, followed by a video showing the skill's correct performance (showed twice: normal speed and slow-motion) following the TGMD-3 guidelines [19]. Each child performed a practice trial followed by two consecutive trials, which were video-recorded (camera Nikon D5300) for later reliability analysis. The administration of the test to each child took approximately 20-30 min.
The five raters independently assessed the recorded videos to analyze the inter-and intra-rater reliability. Raters assessed the video in two viewing modes: live (once, at normal speed and without pause) and slow-motion (watching the video as many times as they needed, in slow-motion and with pauses). The children, the order of skills and the type of visualization (live/slow-motion) for each video was randomized for each rater. For the study of intra-rater reliability, each rater re-assessed the videos following the corresponding viewing mode according to the randomization after 2 weeks. In this sense, intra-rater reliability was analyzed according to the viewing mode (live vs. slow-motion) and the inter-rater reliability was studied between NR and ER in both conditions, live and slow-motion.

Statistical Analysis
Reliability was assessed by intraclass correlation coefficients (ICC) for both intra-and inter-rater. The selection process of ICC was based on Koo et al. [23] flow-chart. ICC values and their 95% confident intervals for intra-rater reliability were based on singlemeasurement, absolute-agreement, 2-way mixed-effects model. In the case of inter-rater reliability, single-measurement, consistency, 2-way random-effects model. Values less than 0.5 indicate poor reliability, values between 0.5 and 0.75 indicate moderate reliability, values between 0.75 and 0.9 indicate good reliability and values greater than 0.90 indicate excellent reliability.
Additionally, repeated-measures ANOVA was performed to analyze the overall scores and the locomotor and ball skills subscales' scores; means and standard deviations were provided. Two factors were integrated into the model: one intra-group (live vs. slowmotion) and one inter-group (raters). Raters were categorized as follows: NR (Rater A and Rater B), NR-PE (Rater C) and ER (Rater D and Rater E).
All analyses were performed using the SPSS statistical package version 23 (SPSS Inc, Chicago, IL). A significance level of p < 0.05 was considered.

Intra-Rater Reliability: Live vs. Slow-Motion
Results of intra-rater reliability are shown in Table 1. Good-to-excellent or excellent reliability were found for most of the skills, mainly in ER and NR-PE evaluations. ICC values from subscales' and overall scores also showed good-to-excellent or excellent reliability in the case of ER and NR-PE. NR showed moderate-to-excellent, good-to-excellent or excellent intra-rater reliability in both subscales' and overall scores.
ER and NR-PE achieved intra-rater reliability at least moderate in all skills (except ER (Rater E): slide). In the case of NR, run (Rater A and B), hop (Rater B), horizontal jump (Rater A and B), kick (Rater A) and underhand throw (Rater B) were the skills in which reliability was poor-to-moderate/good.
Descriptive statistics of overall score and locomotor and ball skills subscales' scores are presented in Table 2. Significant differences in the test factor (live vs. slow-motion) were found in locomotor subscale in repeated measures analysis (live: 33.7 ± 4.0; slow-motion: 33.2 ± 3.9; p = 0.010). The post-hoc Bonferroni analysis revealed significant differences between live and slow-motion evaluations in both NR (Rater A: p = 0.014/Rater B: p = 0.036) in the locomotor subscale. Same differences were found in the ball subscale in one of the NR (Rater B: p = 0.001) and overall score in the same NR (Rater B: p < 0.001) and one ER (Rater D: p = 0.038).

Inter-Rater Reliability: Novice vs. Experts
In general, poor-to-moderate inter-rater reliability was found in both locomotor and ball skills, with null changes comparing live vs. slow-motion between each pair of raters. Tables 3 and 4 present inter-rater ICC results for each skill. Figure 1 presents the ICC associations with locomotor and ball subscales and overall scores.
Regarding locomotor skills, reliability were moderate-to-good; for most cases, independent of the rater's experience or background. Gallop was the skill with the best inter-rater reliability; horizontal jump and skip reached ICC values over 0.5 in most comparisons. Hop was the locomotor skill with lower inter-rater reliability.         Regarding ball skills, the two-hand strike was the skill with higher inter-rater reliability, ranging between poor-to-good and moderate-to-excellent. The two-hand catch was the skill with the lower inter-rater reliability.  Ball skills subscale and overall scores are slightly higher in ER compared with NR and NR-PE ( Table 2). The repeated measures analysis showed no differences in the interactions for both viewing modes and raters factors for the locomotor subscale. However, regarding ball skills, one of the experts, rater (Rater D), coded significantly higher in live evaluation than one NR (Rater B: p = 0.024) and NR-PE (p = 0.001). Same differences were found in slowmotion evaluation between Rater D and both NR (Rater A: p = 0.029/Rater B: p = < 0.001) and NR-PE (p = 0.001). Finally, overall scores registered only significant differences between one ER (Rater D) and NR-PE in slow-motion evaluation (p = 0.031).

Discussion
The purpose of this study was to analyze the intra-(live and slow-motion) and interrater [expert and novice (one of them with physical education background)] of TGMD-3. Briefly, locomotor and ball skills subscales and overall scores showed moderate-to-excellent intra-rater reliability. The inter-rater reliability showed moderate-to-good scores for the ball skills subscale and overall scores for both live and slow-motion and poor-to-good reliability for the locomotor subscale also for both assessing modes.
In the stringent analysis performed by skill, all the raters achieved good-to-excellent or excellent intra-rater reliability for gallop, two-hand strike and two-hand catch. Nevertheless, it was observed more differences between raters in intra-rater the individual skills, ICC values compared to overall score or subscales scores. Previous studies also found variations in the intra-rater reliability agreement across raters [31], especially when skills are examined individually [10], obtaining results ranging from poor-to-excellent. In the present study, intra-rater reliability varied from moderate-to-excellent to excellent for ER and NR-PE in all skills (except Slide for Rater E). Variability in ICC values was higher in NR, with moderate-to-good or moderate-to-excellent intra-rater reliability in half of the skills.
Regarding inter-rater reliability, our results show that the degree of consistency varies considerably between raters on the TGMD-3 skills. As observed, the variations in inter-rater reliability were more visible in the analysis by skill. The ICC's results for the overall test, locomotor and ball skills subscales' scores were over 0.5 in almost all comparisons. The inter-rater reliability for locomotor skills, among five raters, was poor-to-moderate for run, slide and hop (both live and slow-motion); and poor-to-good for the skip in slow-motion and the horizontal jump (both live and slow-motion); and moderate-to-good for the skip in live viewing and the gallop in both viewing modes. Overall, the skill with the higher reliability among raters was the gallop achieving an ICC > 0.5 in all comparisons, reaching moderate reliability in 15/20 comparisons between raters. In contrast, the lower results of reliability were found for the hop and run skill. Kim et al. [34] have also reported the gallop as the skill with higher reliability results and the run skill as the one with the lowest reliability coefficients. In contrast, Maeng et al. [31] and Ayán et al. [35] achieved the most robust reliability in the hop and the weakest in the gallop, contrary to our findings.
Regarding the ball skills, the inter-rater reliability was poor-to-moderate for one-hand stationary dribble (live), overhand throw (slow-motion), kick, forehand strike, two-hand catch and underhand throw (both live and-slow motion); poor-to-good for two-hand strike (slow-motion), one-hand stationary dribble (slow-motion) and overhand throw (live); and moderate-to-good for the two-hand strike in live viewing. In general, the two-hand strike was the skill with the best reliability among raters reaching an ICC > 0.5 and moderate reliability in 19/20 comparisons. The lowest reliability coefficients achieved were in twohand catch skill. This result is in concordance with previous studies [36] but again, contrary to others that found the two-hand catch as the skill with stronger inter-rater reliability [10].
Our results show a lower agreement on the locomotor skills compared to the ball skills. Rintala et al. [10] and Palmer and Brian [22], who reported similar results in TGMD-2, suggested that before using TGMD, further training with the locomotor skills may be needed, our results support this contention. On the other hand, it seems that the raters' background and experience did not influence the inter-rater reliability, taking into account that inter-rater reliability varied not only among NR and ER raters but also between NR each other and ER. Contrary, Palmer and Brian [22] found that novice raters could not achieve a significant agreement with expert raters assessing the TGMD-2 test. However, it has been shown that novice raters reached a significantly similar agreement to experts assessing measure movements with the Landing Error Scoring System [37] and Functional Movement Screen [38].
The variation found in inter-rater reliability in our study could be due to different interpretations of each skill evaluation criteria. Some of them give rise to subjectivity and vary according to the concrete interpretation of the raters. For example, in the hop skill, the second criterion, "Foot of non-hopping leg remains behind hopping leg" might be interpreted differently; it might consider scoring 1 if non-hopping foot always remained behind hopping leg. Nevertheless, it might be considered scoring 1 if the child, although not always, executed the criterion correctly most of the time. Hence, it becomes necessary to obtain pre-assessment agreements between raters or professionals who pretend to evaluate FMS in children. It has been suggested that more training may be necessary for those skills more difficult to assess and even that some of the criteria might be redefined to make them more objective [10,20,31,34,36]; our results provide further support for this recommendation.
Besides, a recent systematic review about TGMD-2 and 3 reliability showed at least good-to-excellent ICC values for inter-rater reliability in both subscales and the overall score, higher than those obtained in this study. However, these differences might be regarding the methodology adopted or the statistical analysis used [20]. In the case of ICC, used in many studies, several aspects make it challenging to compare the results. First, investigators have to choose an ICC model (One-Way Random-Effects Model, Two-Way Random-Effects Model or Two-Way Mixed-Effects Model), ICC definition (absolute agreement or consistency) and ICC type (single or mean measurement) according to the study. Second, there are no standards values to report reliability. Third, some studies did not report ICC model/type/definition or reported reliability based on ICC values with no mention of the confidence interval. Our study followed the flowchart proposed by Koo et al. [23] for selecting model, type and definition. Many studies have used different models, in most cases, less demanding, which might explain the higher reliability results obtained compared to our study.
About to ICC values and 95% confidence intervals, interpretation was also based on Koo et al. [23] recommendations. In this respect, ICC values do not describe reliability in itself but it is necessary to express reliability in terms of lower and upper bounds of the confidence interval. Finally, we considered excellent reliability if an ICC equal to or higher than 0.9 was obtained, while other ratings considered it above 0.75 [39,40]. All this makes it difficult to compare the reliability used in different studies since this variability causes an interpretation problem.

Limitations
This study is not free of limitations. Our reliability study was carried out assessing 25 healthy children. The sample was small, so the results should be interpreted with caution. Besides, we tried to simulate the live assessment by displaying the digital records only once and playing the video at a normal speed. This procedure does not represent precisely the reality of a live assessment by professionals who have to score the children's performance. However, from a research point of view, it is a useful way for different evaluators to carry out the evaluation at any time.

Conclusions
The TGMD-3 battery showed moderate-to-excellent intra-rater reliability for overall score, locomotor and ball skills subscales' scores and moderate-to-good inter-rater reliability for overall and ball skills subscale scores and poor-to-good for locomotor subscale. The expert raters and the novice rater with physical education background achieved stronger intra-rater reliability than novice raters and inter-rater reliability did not seem to be influenced by the raters' background and experience. The viewing modes also seemed not to influence on reliability but further investigation is needed in this regard, in the sight of the study limitations. Higher variability in both intra-and inter-rater reliability was found when analyzing the skills separately.
In conclusion, our high results in terms of intra-rater reliability but lower in the case of inter-rater, suggest that each skill's criteria can be interpreted differently. For clinical practice, it would be recommended that raters reach an agreement before the assessment to avoid subjective interpretations that might distort the results. In addition, a revision of some criteria might be needed to let research replication.  Institutional Review Board Statement: The study was conducted according to the guidelines of the Declaration of Helsinki and approved by the Institutional Review Board (or Ethics Committee) of the University of Vigo, Spain (protocol code: 17-0320; date of approval: 11 May 2020).
Informed Consent Statement: Informed consent was obtained from all subjects involved in the study.

Data Availability Statement:
No new data were created or analyzed in this study. Data sharing is not applicable to this article.