Reliability and Structural Validity of the Movement Assessment Battery for Children-2 in Croatian Preschool Children

Monitoring and assessment of the development of motor skills is an important goal for practitioners in many disciplines as well as researchers interested in motor development. A well-established tool for such purpose is the Movement Assessment Battery for Children Second Edition (MABC-2) which covers three age ranges and contains eight motor items in each range related to the manual dexterity, aiming and catching, and balance. The main aim of the study was to investigate the reliability and validity of the MABC-2 age band one in a sample of Croatian preschool children. Structural validity was assessed using confirmatory factor analysis (CFA). Measures of relative and absolute reliability were established by computing the intraclass correlation coefficients (ICC), standard error of the measurement (SEM), and smallest detectable change (SDC). About 17% of the children of the total sample fall into the categories of motor impairment and risk for impairment, respectively, while 83% were found to be in the category of normally developing children. Intraclass correlation coefficient for the total standard score was 0.79 while individual items, all except one, ranged from 0.70 to 0.83. Drawing trail, but also throwing beanbag and one-leg balance items presented large SEM and SDC values. CFA initially yielded a model with questionable fit to the data. After re-specification, excellent model fit was attained confirming the proposed three-factor model. Satorra–Bentler χ2(26) reached 38.56 (p = 0.054), root mean square error of approximation (RMSEA) was 0.028, non-normed fit index (NNFI) was 0.98, adjusted goodness of fit (AGFI) was 0.97, and standardized root mean residual (SRMR) was 0.030. All the variables loaded significantly, and only two significant standardized residuals have been found. Correlations between the factors were weak, supporting discriminant validity of the test. We found MABC-2 to be an appropriate instrument to assess the development of motor competences of preschool children.


Introduction
Motor competence or "the acquisition and refinement of skillful performance in a variety of movement activities" [1] (p. 158) is thought to be an important aspect of children's engagement and disengagement, not only in physical activity [2] but also in children's social and academic development [3]. Lack of children's social cooperation could possibly result in emotional difficulties, poor social skills, lower academic achievement [4], reduced success within peer groups, or even experience of anxiety and depression [5]. Therefore, monitoring and assessment of the development of √ 2 × SEM. The above-described approach to reliability has already been adopted in at least two MABC-2 psychometric studies [11,12]. Structural validity is also an important issue in test evaluation. Since MABC-2 was published in 2007, to our knowledge, only two other studies have considered factor structure of age band one, which were evaluated in a normative sample of 431 British children [13] and a sample of 183 Greek children [14]. Apart from establishing the factor structure, several other psychometric studies have been conducted, though only a few of them have been concerned with preschool children, and only two of those evaluated all four age groups in age band one separately [13,15].
Additionally, the current study is also concerned with the issue of test evaluation, in a particular national and cultural context, already raised by Brown and Lalor [16]. The aim of the present study is to assess reliability and validity of MABC-2 age band one, for the sample of Croatian preschool children.

Participants
Participants were 683 children (366 boys and 317 girls) aged 3 to 6 yrs. who attended kindergartens in North-West Croatia. The assessment was conducted during 2017. The data, showing the sample subdivided into age groups, are presented in Table 1. Only children without known health issues were included in the sample, and written informed consent was obtained from the children's parents. All of the children were assessed individually according to the test rules. For various reasons (mostly refusal or failed items), 33 children did not complete the whole test, but their completed individual items were used in descriptive statistics. However, since the total test score cannot be obtained unless all of the tasks are done, only the subjects who completed all of the tests (n = 650) were included in the reliability and validity analyses.

Ethical Considerations
The parents or guardians of the children signed informed consent before the study began. It was clearly stated there that the child participated of free will and could at any time withdraw from the participation without giving a reason for doing so. It also stated that the anonymity was guaranteed and that the data would be protected. The subjects were treated according to the Helsinki Declaration, paying special attention to the paragraph for Vulnerable groups and individuals ( §19-20).

Instrument
MABC-2 age band one is comprised of eight motor tasks which belong to three domains of motor performance: manual dexterity, aiming and catching, and balance. Manual dexterity contains three tasks: drawing trail, posting coins, and threading beads. The first task is scored by the number of errors the subjects make, while in the latter two are scored as the time in seconds taken to complete. Aiming and catching consists of throwing and catching the bean bag, which are both scored by the number of successful attempts. Balance includes one-leg balance which is scored as the time recorded, and walking heels raised and jumping on the mats which is scored as the number of correct attempts registered.
For two items-threading beads and one-leg balance-the testing included preferred and non-preferred hand and leg, respectively, and consequently ten raw scores and total test scores were obtained.

Procedure (Assessment)
The test protocol was translated into Croatian in order to standardize instructions for children and to enhance consistency in scoring among raters. Children were assessed individually in kindergartens in quiet and isolated rooms exactly according to the directions provided in the manuals. The data collection was done in 2017.
During the first assessment, four raters independently rated 36 children, thus providing the data for inter-rater reliability. Retest reliability was established by assessing 183 children repeatedly, in a period of 12-16 days after the first evaluation.

Data Analysis
Descriptive statistics parameters were calculated for the whole sample but also for the subsamples divided according to age. Based on recommendations as described in the introduction, in reliability studies more than one statistic should be obtained [7][8][9]. Therefore, intraclass correlation coefficients (ICC), standard error of measurement (SEM), and smallest detectable change (SDC) were calculated. SEM (as an estimate of absolute reliability) indicated the expected error in the measurement of an individual's score expressed in real units of measurement, while SDC reflected the interval of confidence around an error.
For ICC calculation we adopted the ICC form described in Shrout and Fleiss [10] as ICC 2,1 , or two-way with random effect for absolute agreement. This is expressed in de Vet et al. [17] as: ICC agreement = s 2 between subj ./s 2 between subj .+ s 2 trials + s 2 residual . SEM is typically estimated by multiplying standard deviation by √ 1-ICC, but that form of ICC could substantially affect the result [8]. In addition, [7] it is also recommended that the error term from a two-way model should be employed because the one-way model combines random and systematic error. The above strategy is adopted in the present research, and SEM is calculated as stated in de Vet et al. [17] as SEM agreement = √ s 2 trials − s 2 residual . Because of the different metrics of the MABC-2 items, SEM was also expressed as the percentage of the mean: SEM = (SEM/Mean) × 100.
For the subsamples assembled on the basis of age, all reliability parameters were calculated using raw scores, because we assumed that it is more meaningful to obtain SEM and SDC values in the real unit of measurement than in standard scores.
In order to check the internal structure of the three-factor model proposed by Henderson et al. [6], confirmatory factor analysis was performed in LISREL 8.8 [18]. Due to the intention to directly compare the results, the same fit indexes used in the validity study of the authors of the test [13] were chosen.
We used Satorra-Bentler chi-square because data were not normally distributed (Mardia's kappa = 116.0). In concordance with the aforementioned validity study of Schulz et al. [13], we also used root mean square error of approximation (RMSEA), the non-normed fit index (NNFI), the adjusted goodness of fit (AGFI), and the standardized root mean residual (SRMR). Criteria for the acceptance of the fit indexes were based on the relevant references from the field of structural equation modeling.

Descriptive and Classification Results
Raw scores obtained in individual MABC-2 tasks are shown in Table 2. The results are not directly comparable between each age group because of the differences between performing and/or scoring of particular items for different ages.
While checking the frequencies, we observed that 60% of the children obtained the maximum score in walking heels raised, and 75% performed jumping on the mats with no error. In the subsample of 5 and 6-year-olds the percentage of maximal results in a jumping task was even higher with 84% of the children reaching the maximum, thus pointing to the ceiling effect. Henderson et al. [6] provided categorization of children according to the traffic light system. In the proposed classification, which was based on percentile scores obtained on a normative sample of British children, children below 5th percentile (red category) are considered as motor impaired, between 5th and 15th percentile (yellow category) are children at risk for impairment and above 15th percentile are normally developing children. Table 3 shows the classification of the present study sample, in which only the children without failed items were considered. About 17% of the children of the total sample fall into the categories of motor impairment and risk for impairment, respectively, while 83% were found to be in the category of normally developing children.

Reliability
Although 183 children were retested, the children who refused to perform on the retest (n = 1), or those who had fallen into the "red" category on one test occasion and in the "green" category on the other test occasion (n = 9) were excluded from the analysis. We presumed that such a shift from one category to another was a motivational and not ability issue.
Total sample intraclass correlation coefficients based on the standard scores for individual tasks (Table 4) all ranked, except for one (jumping on mats), between 0.70 and 0.83, while ICC for the total standard score was found to be 0.79. Based on Koo and Li's [19] suggestion, ICC values less than 0.5, between 0.5 and 0.75, between 0.75 and 0.90, and greater than 0.90 are indicative of poor, moderate, good, and excellent reliability, respectively. Intraclass coefficients for each task and for total test score for single age groups were calculated on raw scores and their ranges across the items were somewhat wider (Table 4). ICC for the total test score for 3-year-olds was 0.53, while ICC for other age groups ranged from 0.75 to 0.85. Interestingly, almost all ICC values of different age groups for fine motor tasks were above 0.70, some of them even above 0.80, while for gross motor tasks some ICC values (catching and throwing) were found to be below 0.60 or even 0.50, respectively (Table 4). Similar moderate values were obtained for some balance tasks. Considering that ICC is highly sample-dependent, we also calculated two sample-independent measures, standard error of measurement (SEM) and the smallest detectable change (SDC). Since we used raw scores, SEM and SDC are expressed in units of original measures. It should be noted that SEM rises as the value of measure rises, however, in the current research, both measures showed a large range of values because the metrics of MABC-2 tasks differ between items. To make the SEM values comparable between the items, they were also expressed relatively as a percent of the mean (SEM%). Table 5, drawing a trail had the largest SEM% in all age groups, but it is accompanied with one-leg balance items, and by throwing a beanbag. Accordingly, SDC measures were also higher in those tasks. During the first assessment, four raters independently rated 36 children (Table 6). ICC values of 0.88 and 0.86 were obtained for drawing trail and catching beanbag, respectively, while all other values were 0.94 or higher. Using the total test score as an example, the SEM was 3.16 which is relatively low. SDC95 for the total score was 8.75, meaning that the change in the total score larger than the stated value is needed to ensure 95% certainty that the change in score is not due to the variability or measurement error of the tester, but rather a real change in score.

Confirmatory Factor Analysis (CFA)
The hypothesized set of relations in the model is shown on the path diagram ( Figure 1). The initial model was of a simple structure thus not allowing double loadings. At the beginning of confirmatory factor analysis (CFA), we checked for outliers and 44 cases (6.62%) with a significant p-value (p < 0.05) for Mahalanobis d-square were found and subsequently removed from data, which left 606 cases in the data set available for confirmatory factor analysis. According to Kline [20], the remaining sample may be apprehended as sufficiently large regarding either the rule of thumb, based on which "minimum sample size should be no less than 200 (preferably no less than 400, especially when observed variables are not multivariate normally distributed; p. 111) or 5-20 times the number of parameters to be estimated, whichever is larger" (p. 178).
The multivariate distribution was also assessed and Mardia's kappa of 116.10 was obtained with the standardized value of −3.03 (p = 0.002). As suggested by Byrne [21], the kappa values greater than 30 imply significant departure from multivariate normality, therefore estimation was achieved using the robust maximum likelihood method with the Satorra-Bentler scaled chi-square (S-Bχ²).
Although some indices of fit were acceptable, we found that only half of the items loaded significantly on their respective factors, and that there were many large standardized residuals. Both indicators of dynamic balance loaded very little on their latent factors and showed large error The initial model was of a simple structure thus not allowing double loadings. At the beginning of confirmatory factor analysis (CFA), we checked for outliers and 44 cases (6.62%) with a significant p-value (p < 0.05) for Mahalanobis d-square were found and subsequently removed from data, which left 606 cases in the data set available for confirmatory factor analysis. According to Kline [20], the remaining sample may be apprehended as sufficiently large regarding either the rule of thumb, based on which "minimum sample size should be no less than 200 (preferably no less than 400, especially when observed variables are not multivariate normally distributed; p. 111) or 5-20 times the number of parameters to be estimated, whichever is larger" (p. 178).
The multivariate distribution was also assessed and Mardia's kappa of 116.10 was obtained with the standardized value of −3.03 (p = 0.002). As suggested by Byrne [21], the kappa values greater than 30 imply significant departure from multivariate normality, therefore estimation was achieved using the robust maximum likelihood method with the Satorra-Bentler scaled chi-square (S-Bχ 2 ).
Although some indices of fit were acceptable, we found that only half of the items loaded significantly on their respective factors, and that there were many large standardized residuals. Both indicators of dynamic balance loaded very little on their latent factors and showed large error variance not accounted for by their latent construct. Moreover, drawing seems not to be related to the manual dexterity factor in this particular sample.
After re-specification, which included general motor factor in the model, and by allowing three error terms to correlate, model fit improved substantially. The re-specified model yielded S-B χ 2 (26) of 38.56 (p = 0.054), RMSEA was 0.028 (<0.05 preferred; C.I. = 0.0|0.046), NNFI was 0.98 (≥0.95 preferred), AGFI was 0.97 (>0.90 preferred), and SRMR was 0.030 (<0.05 well fit). All of the variables loaded significantly on their respective factors, with the exception of drawing and jumping, which loaded directly onto the general motor factor. Only two standardized residuals larger than 2 (2.88, 2.40) were found.
The correlations between the factors of aiming and catching and balance (r = 0.33), and manual dexterity and balance (r = 0.19) were moderate and small, respectively, while manual dexterity and aiming and catching were not correlated (r = 0.01).
Correlations (Table 7), using standard item scores and total standard score were computed. All the items were significantly correlated with the total standard score ranging from r = 0.40 to r = 0.56 at p < 0.01. Most of the inter-item correlations were also significant ranging from 0.10 up to 0.69.

Discussion
The aim of the present study was to investigate the psychometric properties of the MABC-2 age band one in a sample of Croatian preschool children aged 3 to 6 years.
In terms of categorization of children related to the motor development, we found the ratio of the normally developing children and children at risk or motor impaired children to be relatively similar as in the Greek study [22], where 88% of the children were placed in the normally developing category, while 6.3% and 5.4% were placed in the risk for impairment and impairment categories, respectively.
On the contrary, in a Brazilian study [15], a somewhat lower percentage of children in the normally developing category was identified. They found 66%, 60%, 69%, and 88% of normally developing children in the age of 3, 4, 5, and 6 years, respectively. On the descriptive level of our data, we also found that more than half of the children obtained highest scores on balance tasks. That could be a scoring issue of the scale, but also an issue of intercultural differences [15,16,23].
Confirmatory factor analysis (CFA) was carried out to prove the three-factor structure of the original motor domains conceptualization of Henderson et al. [6]. Since the fit of our hypothesized model to the data was rather questionable, model re-specification was conducted. An approach from the study by Schulz et al. [13] was adopted, which included introducing general motor factor to the model. After several modifications of the model, excellent model fit was attained. Only drawing and jumping did not load on their respective factors, but they both loaded significantly on the general motor factor. All indices of fit were very close to those reported by Schulz et al. [13]. Apart from that study, the work conducted by Ellinoudis et al. [14] was the only one available for comparison. In their study, clear factor structure and higher loadings were reported. Yet, their sample was rather small (N = 183), except for chi-square, they reported only two indices of fit, and they did not provide information about residual variance.
Correlations between the factors in the present study were weak, which supported discriminant validity of the test. On the other hand, significant correlations found between items and total test score further confirmed the validity of the test. Inter-tester reliability was found to be excellent, yielding very high consistency among raters.
Retest reliability was calculated using raw scores for individual items because SEM and SDC are more meaningful when expressed in natural units of measurement than in standard scores. Considering the total sample, retest reliability expressed in terms of ICC was more than acceptable with only one item positioned below 0.70. The total standard score ICC was somewhat lower than the levels of reliability obtained in the study by Ellinoudis et al. [14], who reported ICC for the total score of 0.85, while individual items ranged from 0.66 to 0.96. Nonetheless, when compared to the Holm et al. study [11], where raw scores were also used, ICC's were higher. Unfortunately, values of items' ICCs were not stated by the authors of those studies, thus making the comparison with current values impossible.
When the sample in the present study was divided into age groups, the range of ICC coefficients became rather less narrow. However, when comparing the total test scores' ICCs between age groups in our study, ICC of 0.53 obtained for 3-year-olds, was the only one which was unsatisfactory. Smits-Engelsman et al. [12] obtained a greater ICC coefficient (0.94) for the same age, but individual item scores' ICCs varied in their study as well, namely between 0.67 and 0.85. They also argued that some MABC-2 tasks were more challenging for 3-year-olds than for older children. It may also be presumed that social unfamiliarity of the children with the raters also had certain impact on the motivation of those children for maximal performance and concentration. Thus, changes over time may be more pronounced in the youngest group than in older preschool children.
Nevertheless, from the view of psychometric theory, both high and low ICC values should be taken with caution, because, as stated in Weir [8] "large ICC can mask poor trial-to-trial consistency when between subjects variability is high" and "conversely, a low ICC can be found even when trial-to-trial variability is low if the between subjects variability is low" (p. 237). Weir [8] also pointed out the importance of the source of the error which should be briefly addressed. Namely, error term in ANOVA expresses the interaction of the subjects and trials, where a small error may reflect that scores change similarly in the repeated trials which may result in a significant trial effect, meaning that there is some systematic error present. On the contrary, random error may exist in data when changes between trials are not consistent (some scores fall, some rise). Since a two-way model allows the error to be partitioned, we checked the mean squares in ANOVA output, finding significant trial effect for total test score, throwing the ball, drawing, threading beads, posting coins non-preferred hand, jumping on the mats, and leg balance on the non-preferred leg. This indicates the possible systematic error, caused most likely by the familiarization of the children with the test, known also as "learning effect".
In some of the remaining items, examination of SEM and SDC indicated high random error. Large values of SEM and SDC have been found in all age groups for drawing, one-leg balance and throwing, while aiming and catching seems to be problematic for 3 and 4-year-olds and jumping on the mats for 5-year-olds.
Drawing presented the highest measurement error in our study, and it was also the item with the lowest ICC in the studies by Ellinoudis et al. [14] and in Smits-Engelsman et al. [12]. Moreover, drawing has also shown validity issues in the study by Schulz et al. [13], as well as in the present study, which may suggest that the drawing in the preschool age is more related to some other abilities (i.e., visual motor integration) than to fine motor skills alone.
Findings of the present study showed acceptable overall evidence of validity and reliability of the MABC-2 for age band one, suggesting that it can be a useful tool to assess motor competences in Croatian preschool children.

Conclusions
We found MABC-2 to be an appropriate instrument to assess the development of motor competences of preschool children. MABC-2 tasks are intuitive and easy to perform for the children, and the test could differentiate the level of attained motor skills. Dilemmas about the possible cultural limitation of MABC-2 raised here, but also in other studies, should be further investigated.