Reliability Study of the Items of the Alberta Infant Motor Scale (AIMS) Using Kappa Analysis

Purpose: We evaluated the interrater and intrarater reliabilities of the Korean version of the Alberta Infant Motor Scale (K-AIMS). Methods: For the interrater reliability test, six raters participated in the K-AIMS evaluation using video clips of 70 infants (aged between 0 and 18 months). One rater participated in an intrarater reliability test. Among 70 infants, 46 were born preterm and 24 were born full term. A total of 58 AIMS items were evaluated for supine, prone, sitting, and standing positions. A reliability analysis was conducted using ICC and Fleiss’ kappa. Results: The highest Fleiss’ kappa was found for the 4–7 months group for sitting (K = 0.701–1.000) and standing (K = 0.721–1.000), while the lowest K was the 3 months or under group for standing (K = 0.153–1.000). We found higher Fleiss’ kappa statistics when all infants were evaluated without grouping for the three positions (K = 0.727–1.000), except standing (K = 0.192–1.000), for the interrater analysis. Conclusion: Our results demonstrate the good reliability for the Korean version of the AIMS for Korean infants (preterm and full term).


Introduction
Preterm infants are babies defined as born under 37 weeks. The incidence rate of preterm delivery is estimated to be between 5% and 18% worldwide [1]. Children born preterm often demonstrate significantly poorer motor functions, cognitive outcomes, and language skills during development than those born full term [2]. Preterm infants often demonstrate atypical postures and movements, frequently manifesting as a hyper-extended neck and trunk, because they lack active flexor power compared to full-term infants [3,4]. Of note, the alignment of a preterm neonate's musculoskeletal system while in neonatal intensive care plays a crucial role in determining their later postural shape and motor control [5,6]. In preterm infants, muscle tone, head control, and motor control development, including the lower and upper extremities, can be significantly less developed until the 18 corrected months [7]. Therefore, developmental process monitoring is highly recommended, i.e., performing a motor developmental evaluation periodically at the corrected full term, third month, sixth month, ninth month, twelfth month, and eighteenth month [8,9].
Regularly administered tests are valuable for detecting developmental disorders and predicting and preparing for further developments in preterm infants [10,11]. If administered within an appropriate timeframe, the infant motor development test could measure general motor performance, help both clinicians and parents make a more effective treatment plan and intervention strategy, and monitor motor development to predict neurodevelopmental outcome [4,12].
One of the infant motor development evaluation tools used most widely and recently is the Alberta Infant Motor Scale (AIMS) [13]. The AIMS evaluates 58 gross motor skills in supine, sitting, and standing positions. The AIMS could be used from 7 days (corrected age (CA)) to 18 months of age to monitor motor development [14]. The AIMS has been used to define, predict, and classify the level of motor development of preterm infants [15,16]. In one study, at eight months CA, only 56% of the preterm infants could sit briefly with no arm support, while over 90% of the term infants could perform the task [17]. The AIMS demonstrated its assessment validity for preterm infants' movement quality [18]. Additionally, the AIMS can detect the imbalance between flexor and extensor muscle roles in the trunk and the lack of rotation in movements [19]. The postural control differences between preterm and full-term infants [20], and their gross motor trajectories [21], were also evaluated with the AIMS.
The AIMS was developed and standardized using 2,202 infants' data from one week to 18 months of age in Alberta, Canada, in 1994 [14], and is used by many countries, including Brazil [22], Spain [23], China [24], Taiwan [25], and Greece [26]. Interrater reliability, consistency, and intra-class reliability were also assessed for each language by researchers in each country [1,27,28]. The AIMS evaluates qualitative movements, such as segment posture, weight-bearing, and anti-gravity movement as well as quantitative performance. It uses a motor development window that identifies a fully developed motor skill versus a newly developing motor skill. In this study, we evaluated the interrater and intrarater reliabilities of the Korean version of the AIMS (K-AIMS) for each item, subtotal, total scores with six raters using video clips of 70 infants using ICC and kappa statistics.

Infant Subjects
We recruited 70 healthy infants and parent volunteers using a social networking service, including 46 preterm infants under 18 months (CA), from three cities from 2017 to 2018. All infants and parents were Korean and monolingual. All parents signed consent forms before participation (IRB no. 2-1040781-AB-N-01-2017101HR). We excluded infants with congenital anomalies, acute illnesses, musculoskeletal disorders (fracture, peripheral neuropathy, and muscular system infection), and intraventricular hemorrhages of grades 3 and 4. The general characteristics of the infant subjects are summarized in Table 1. Unit: Mean ± standard deviation or n (%), CA: corrected age.

Raters
Six physical therapists (raters A, B, C, D, E, and F) participated in the study for the interrater and intrarater reliability tests of K-AIMS. Rater A had under one year's experience in child evaluation and pediatric physical therapy, and the others had 3 to 10 years' experience. No rater had utilized the AIMS before participating in this study.

Evaluation Tool (AIMS)
The AIMS evaluates 58 items for four basic functional positions: 21 items for supine, 9 for prone, 12 for sitting, and 16 for standing. We did not intervene in infants' performance, but observed their natural posture and movements during daily activities. Observers only changed body position when an infant could not change the position itself. The total test time was 20 min. All items were observed considering weight-bearing and antigravity movements. One point was given to an infant who showed the item defined by the AIMS component for a certain posture, and zero points were given when the item was not observed. Additionally, zero points were given to an item located within a motor development window, but which was not observed. Note that we did not modify the original AIMS at all, because AIMS could be translated and adapted in Korean well.
AIMS raw scores ranged from 0 to 58. The raw score was then converted into a percentile rank. This percentage was used for the parents' and clinicians' easy understanding, because they are accustomed to percentiles for anthropometric data, such as height, weight, and head circumference. A higher score represents more well-developed gross motor skills, while a lower score represents undeveloped gross motor skills. Infants scoring under the fifth percentile are at high risk for developmental differences [14].

Interrater and Intrarater Reliability Analysis
All raters had an education session (four hours) regarding the AIMS, including motor development theory and setting the motor development window. They could only begin participating in the main evaluation test when they demonstrated a higher than 90% interrater correlation during the preliminary training session. We did not use the same video clips in the preliminary training sessions and the main test. For the main reliability evaluation, raters completed AIMS evaluations of 70 video clips, recorded under standard conditions. One of the authors exclusively performed the AIMS administration in a spacious and comfortable room in the presence of parents. One assistant PT recorded the whole process for the reliability scoring. Four different positions (supine, prone, sitting, and standing) were prepared from the recording after eliminating portions of the video unnecessary for the AIMS reliability test. The six raters being evaluated scored the infants on the AIMS by watching video recordings of 70 infants in four positions in a training room. To avoid potential bias and ensure independent scoring, raters were not allowed to exchange opinions on the tested findings. Video recordings were played three times each, and one additional play was allowed per the rater's request. For the intrarater reliability test (rater A), the AIMS tests were repeated four weeks apart with the same video clips [25].

Data Analysis
We divided all infants' data into three groups: 0-3 months, 4-7 months, and 8 months or over. These age groups consisted of preterm infants' corrected ages (CA) and full-term infants' chronological ages. For instance, a prematurely delivered infants' chronological age (or age from date of birth) may be nine months, but if its corrected age (or age from original due date) was seven months old, it was included in the 4-7 months group.
Total AIMS scores per position were analyzed statistically. Interrater reliability among the six raters (A, B, C, D, E, and F) and intrarater reliability by rater A for evaluations conducted four weeks apart were analyzed using Fleiss' kappa analysis and the intraclass correlation coefficient (ICC) at a 95% confidence interval. We used a Bland-Altman plot to assess the intrarater reliability of the AIMS total score. In Fleiss' kappa analysis, the following definitions were used: 0 = no agreement, 0.1-0.19 = poor agreement, 0.20-0.39 = fair agreement, 0.40-0.59 = moderate agreement, 0.60-0.79 = substantial agreement, and 0.80-1.00 = almost perfect agreement [29]. ICCs were interpreted as excellent, good, moderate, and poor for >0.90, 0.75-0.90, 0.50-0.75, and <0.5, respectively [30]. We used IBM SPSS Statistics 27 at a significance level of 0.05.

Interrater and Intrarater Reliability of AIMS for Each Item
For the interrater reliability, Six raters showed Fleiss' kappa (K) statistics ranging from 0.153 to 1.0 for four positions. In general, the highest Fleiss' kappa was found for the 4-7 months group for sitting (K = 0.701-1.000) and standing (K = 0.721-1.000), while the lowest K was from the 3 months or under group (K = 0.153-1.000) for the standing position. We found higher Fleiss' kappa when all infants were evaluated without grouping for three positions (K = 0.727-1.000), except standing (K = 0.192-1.000). See the summaries in Table 2 for further detail.
For the intrarater reliability, rater A showed Fleiss' kappa (K) statistics ranging from 0.250 to 1.0 for four positions. See the summaries in Table 3 for further detail.

Interrater and Intrarater Reliability of AIMS for Subtotal and Total Scores
ICC analysis results from the six raters for each position and the total score were 0.80 to 1.00 (AVE. = 0.97, STD = 0.05) for the interrater reliability ( Table 4). The minimum ICC (=0.796) was found for standing position for infants 3 months or under. Otherwise, all ICCs were greater than 0.96 for interrater reliability.
ICC analysis results from rater A for each position and the total score were 0.75 to 1.00 (AVE = 0.93, STD = 0.08) for intrarater reliability. An ICC of less than 0.8 was only found for the supine (0.749) and sitting positions (0.776) in the 3 months or under group. All other ICCs were greater than 0.85. The Bland-Altman analysis for total score gave the average difference of 0.42 and a standard deviation (SD) of 1.11, with 2.60 and −1.77 for the upper and lower limits, for the infants 3 months or under (Figure 1). The average difference and SD were 0.86 and 2.79, with 6.33 and −4.61 for the upper and lower limits, for the 4-7 months infants. The average difference and SD were 0.18 and 2.37, with 4.82 and −4.46 for the upper and lower limits, for the infants older than 8 months. In summary, the total average difference and SD for all infants recruited in this study were 0.43 and 2.09, with 4.53 and −3.68 for the upper and lower limits.

Discussion
We evaluated the interrater and intrarater reliability of the Korean version of the AIMS using six raters. Among them, one rater's repeated evaluation results were analyzed for intrarater reliability. For the study, we used video recordings from 70 infants in four positions (prone, supine, sitting, and standing). Fleiss' kappa values showed highly acceptable agreement in infants with a corrected or chronological age of 3 months or more. In addition, the ICC values of interrater and intrarater reliabilities for the subscales and total scores of K-AIMS were good to excellent.
Reliability, consistency, and stability studies of a newly introduced scale are fundamental prerequisites before its implementation [28,31]. The AIMS has been studied for its improved ability to detect delayed motor development in preterm infants [32] compared to other existing previous scales, such as the Bayley-III [4,18]. The AIMS's interrater reliability has also been studied [31]. Of note, the AIMS has demonstrated moderate to strong correlation (r = 0.78-0.9) with the Bayley Motor Scale [25].
The AIMS has also been translated into many languages and studied for its reliability in many countries, including Taiwan [25], China [24], Thailand [33], Serbia [28], Brazil [22], and Japan [34]. It also underwent cross-country validation with Brazilian infants [35], and norm comparison between Canadian and Turkish infants [13] and Dutch and Canadian infants [36]. As language and cultural context may affect the AIMS's validity, reliability tests should be conducted after translation [37], because they may elucidate significant differences between the cultural context and normative sample [38]. This study was, therefore, necessary, because reliability studies on the AIMS translated into Korean had not yet been conducted.
Many previous studies performed multi-rater reliability tests using ICC but not kappa. Fleiss' kappa test (the reliability test for six interrater tests) exhibited better than substantial agreement (98.3%, 91.4%) for the 4-7 months and ≥8 months groups for 58 AIMS items (AVE = 94.9%). We also found lower reliability for infants 3 months or under, as previous studies have found [14,24]. The lowest reliability was found for supported standing (item 2) in standing subscale for infants 3 months or under. Regardless of infant age, interrater reliability showed moderate to substantial agreement for forearm support (item 6) in prone subscale. Pull to sit (item 3; 3 months or under) and sitting to prone (item 10; 8 months or over) from the sitting subscale showed low interrater reliability in this study. No significant difference between novice and experienced PT was found in our study. In the similar study using a Japanese population [34], the raters were also non-expert PTs and expert PTs, like in this study, and the ICC results for the interrater and intrarater reliabilities were good to high, which are similar results in this study.
We found the lowest Fleiss' kappa (=0.25) for prone mobility (item 5) in prone subscale in the intrarater analysis of 0-3 months old infants, meaning that this is the most challenging infant age for which to evaluate motor control. We found more than substantial agreement (>90%) from the 4-7 months and ≥8 months groups. The AIMS gave a higher Fleiss' kappa when infants were older. Live observations or video recordings [39,40] were used for the reliability test on motor skills. No significant difference was observed between the live observation and video recordings [41]. We used video clips for the K-AIMS assessment for convenience.

Conclusions
Our study demonstrates that for each K-AIMS item's score, subscales and total scores are reasonably reliable when screening for motor development delay and monitoring infants' progress in South Korea.

Limitations
We could not fully investigate sensitivity to temporal variations, which should be evaluated and compared for the same infant with an appropriate time gap to understand motor development.