During the past 15 years, several studies have examined the reliability of clinical measures of the foot. The reliability of these measures has received due attention, as clinicians and researchers have questioned the value of the measures on which interventions are often based. Repeatable measures are essential for clinicians and researchers alike. Both need baseline measures that are reliable so that intervention effects can be accurately assessed. It is essential that measures with demonstrable reliability are adopted for use in clinical and research settings.
Podiatric physicians the world over embraced the Root paradigm of biomechanics until the late 1980s, when critical inquiry of these clinically accepted measures began. [
1-
4] Developed in the 1970s, the work of Root et al [
5] became the fundamental principles for undergraduate podiatric medicine students for at least 20 years. The more recent questioning of these principles has resulted in both healthy debate and investigations into the reliability of these measures. [
6-
10]
Some studies [
11-
13] have supported the reliability of traditional foot measures as described by Root et al. [
5] Intratester reliability has been consistently found to be stronger than intertester reliability. [
7,
11,
14] However, closer scrutiny of these results reveals large error estimates associated with measures, which reduces confidence in measurement precision. [
8,
15] Overall, the reliability of the traditional Root measures of foot posture has been found to be lacking. Some individual measures, for example, navicular height, display acceptable reliability, [
7,
10] whereas the two “stalwart” measures—resting calcaneal stance position and neutral calcaneal stance position—are less reliable, with the latter involving the elusive concept of the subtalar joint neutral position. [
1,
10,
15-
17] The subtalar joint neutral position is the crux of the Root theories, about which all feet are said to optimally function and from which all measurements are made. However, this concept has also been a significant barrier to reliability, lacking definition [
1] as a starting point for measuring the foot. [
10] The unacceptable reliability of the neutral calcaneal stance position is a critical finding and must be appreciated by podiatric physicians who rely on this measure for orthotic prescription.
It is important to distinguish between intrarater and interrater reliability. Intrarater reliability is the consistency of an individual examiner’s repeated measurements, and interrater reliability refers to the consistency of measurements between examiners measuring the same subjects. A measurement found to be reliable for one examiner repeatedly (intrarater) will not necessarily be reliable for all examiners (interrater), which limits the usefulness of such measures for other examiners. Interrater reliability provides greater confidence in the measures and indicates that other examiners would have found similar results. [
18]
The absence of acceptable reliability for several common clinical measures has posed a problem for podiatric physicians. However, to date there has been no real alternative presented for clinical and research settings. How can the foot be best assessed and “measured” if many traditional measures are demonstrably unreliable? A variety of observational assessment scales have been developed to try to address this issue. Most scales acknowledge the triplanar nature of the foot and include scaled observations in all three cardinal planes. [
11,
19,
20] The most recent scale to be developed is the Foot Posture Index (FPI) (AC Redmond, unpublished booklet, 2000). The FPI is a system for observing and rating foot posture features that includes eight criteria that sum to a total score. The FPI criteria require the examiner to observe and rate foot morphology in all cardinal planes, ultimately scaling it along the customary continuum of pronated to supinated. The reliability of this index has not yet been established. The aim of this study was to determine and compare the interrater and intrarater reliability of the FPI and selected traditional measures of foot position.
Methods
A single-subject repeated-measures design was used to determine both intrarater and interrater reliability. Three groups of subjects were recruited: 29 children (aged 4 to 6 years; 13 girls and 16 boys), 30 adolescents (aged 8 to 15 years; 12 girls and 18 boys), and 30 adults (aged 20 to 50 years; 15 women and 15 men). Mean group ages, overall and by gender, were as follows: children, 4.8 years (girls, 5.0 years; boys, 4.6 years); adolescents, 10.9 years (girls, 10.6 years; boys, 11.1 years); and adults, 35.6 years (women, 37.4 years; men, 33.4 years).
Participants were asymptomatic; consenting individuals were sampled by convenience from a kindergarten, a school, and a university campus. Individuals with a history of foot surgery were excluded. To measure interrater reliability, children were assessed by three examiners each and adolescents and adults were assessed by four examiners each. The examiners had 11 to 15 years of clinical experience. Ethical approval was obtained from the University of South Australia Human Research and Ethics Committee and from the school or kindergarten ethics bodies, as applicable. Informed parental consent was obtained for all participating children.
The clinical foot measures examined were the FPI and a selection of traditional measures of foot position, described in the following sections.
Procedure
This reliability study was conducted on three samples: adults aged 20 to 50 years, adolescents aged 8 to 15 years, and children aged 4 to 6 years. The procedure used for adults, adolescents, and children was the same.
A research assistant was responsible for randomization of examination times, reception of subjects, numbering of subjects, and collection of examination sheets to ensure anonymity and confidentiality.
Subjects entered the examination area and stood on a raised platform. To ensure consistency of the stance position across all testing sessions, a template outline of the feet was made for each subject. Color-coded data-collection sheets (examiner-specific and subject-numbered) were placed next to each limb of each subject. Each examiner performed all measures on each limb of each subject, with no subject’s limbs being examined consecutively by any examiners. Nonconsecutive limb examination was adopted to avoid rater bias when examining the second foot by the results of examination of the first foot. Once the data sheets were completed, the examiners placed them in a central collection box. The subjects remained in their positions while the examiners rotated at random between subjects, repeating the procedure until all examiners had performed all weightbearing measures on all subjects. Any pen marks used for measuring were removed by the examining podiatric physician at the end of testing.
Subjects then moved to the plinths, where they lay prone. The examiners measured the forefoot-to-rearfoot relationship in turn, again examining individual subjects’ limbs nonconsecutively. At completion of these measures, these examination sheets were placed in the collection box. For intrarater reliability, subjects were reexamined no less than 4 hours later.
Data Management and Analysis
Data were entered and all analyses were performed using constructed data sets in SPSS version 10 (SPSS Science, Chicago, Illinois) and Microsoft Excel 2000 (Microsoft Inc, Redmond, Washington) software packages.
The FPI assessments yielded categorical (ordinal) data for each individual criterion. Summation of the eight criteria to a total score resulted in continuous (interval) data. Traditional measures yielded only continuous (interval and ratio) data. Paired t-tests were applied to check for systematic differences between repeated measures for each rater.
To determine both intrarater and interrater reliability, categorical data (FPI criterion data) were analyzed using the nonparametric statistical analysis of Spearman’s rank correlation (ρ). Interrater analysis of FPI criteria was performed by calculating the intrarater ρ for each examiner pair for each session and then averaging the results for the examiner pairs (adults and adolescents: four pairs and sessions; children: three pairs and sessions).
To determine both intrarater and interrater reliability for continuous data (FPI total scores and traditional measures), parametric statistical analyses were used. To determine intrarater agreement, two approaches were used: intraclass correlation coefficients (ICCs) were calculated (model [3,1] based on two-way analysis of variance, mixed effect with consistency) and the standard error of measurement (SEM) with 95% limits of agreement was determined for each rater. To determine average intrarater ICCs, a form of the standardized (z) score was used. Individual rater ICC (r) values were transformed to z scores, resulting z scores were averaged, and the average z score was then transformed back to an r value. Intraclass correlation coefficients (model [2,k] based on two-way analysis of variance, random effect with absolute agreement) were calculated to determine interrater agreement, and the mean SEMs with 95% limits of agreement were determined across raters. The ICC, widely used for reliability analyses, reflects both correlation and agreement and provides a single index among two or more ratings, which was a requirement of this study. Calculating ICCs also made the results comparable with those of other studies. The SEM was calculated to enhance the clinical application of results. The SEM displays measurement error in the units in which original measurement occurred and hence is related more directly to the clinical setting. Each SEM was calculated with 95% limits of agreement, which contain the measures expected to fall within two standard deviations above and below the mean of the different scores.
Acceptable levels of reliability were defined, acknowledging that such limits are essentially arbitrary. However, if adopted by convention, such definitions provide useful “benchmarks” for discussion. [
23] Intraclass correlation coefficient values greater than 0.75 indicated good reliability, 0.50 to 0.75 indicated moderate reliability, and less than 0.50 represented poor reliability. [
18]
Results
Paired t-tests were applied to check for possible systematic differences between repeated measures for each rater. A few were statistically significant (P < .05) in the children’s study (FPI, raters 1 and 2; navicular height, rater 1; and neutral calcaneal stance position, rater 3) and adults’ study (FPI, rater 4; navicular height, rater 4; and forefoot-to-rearfoot relationship, rater 2). No paired-samples tests were statistically significant in the adolescents’ study.
Descriptive statistics were used for each of the measures to indicate the range of foot types in each of the study groups. The children’s study provided the following mean (range) values for each measure: FPI, 6.65 (–1 to +14); navicular height, 31.91 mm (20 to 42 mm); navicular drop, 6.23 mm (0 to 15 mm); resting calcaneal stance position, 4.02° everted (15° everted to 3° inverted); neutral calcaneal stance position, 0.79° inverted (5° everted to 10° inverted); and forefoot-to-rearfoot relationship, 3.43° (1° valgus to 11° varus).
In the adolescents’ study, mean (range) values for each measure were as follows: FPI, 5.43 (–4 to +11); navicular height, 39.40 mm (30 to 55 mm); navicular drop, 6.23 mm (0 to 15 mm); resting calcaneal stance position, 3.36° everted (10° everted to 7° inverted); neutral calcaneal stance position, 1.62° inverted (4° everted to 15° inverted); and forefoot-to-rearfoot relationship, 3.15° varus (4° valgus to 15° varus).
In the adults’ study, mean (range) values for each measure were as follows: FPI, 4.98 (–2 to +13); navicular height, 44.68 mm (29 to 64 mm); navicular drop, 7.21 mm (0 to 20 mm); resting calcaneal stance position, 1.88° everted (8° everted to 7° inverted); neutral calcaneal stance position, 1.79° inverted (4° everted to 10° inverted); and forefoot-to-rearfoot relationship, 2.01° varus (13° valgus to 20° varus).
Homogeneity of continuous data for each leg was analyzed to determine whether single-leg data could be pooled (
Table 1). From this analysis (using the type [1,1] ICC in its most conservative form), it can be seen that the data were suitable to be pooled. The slight aberrance of navicular height in the children’s data and of resting calcaneal stance position in both the adolescents’ and adults’ data is acknowledged. Pooling had the effect of doubling the sample size, that is, data could reasonably be analyzed on an individual limb basis rather than on the basis of subject numbers. Hence, after this analysis, the children’s study included 3 examiners, 29 subjects, and 58 limbs and the adolescents’ and adults’ studies included 4 examiners, 30 subjects, and 60 limbs.
The children’s study provided less reliable results overall (
Table 2). The three examiners involved all demonstrated good intrarater reliability results with the FPI. One examiner demonstrated good intrarater reliability for resting calcaneal stance position (ICC = 0.79), and another examiner demonstrated good intrarater reliability for navicular drop (ICC = 0.89). None of the examiners demonstrated good intrarater reliability for normalized navicular height, neutral calcaneal stance position, or forefoot-to-rearfoot relationship (ICC < 0.75). Interrater results were all moderate to poor. The FPI SEM for raters averaged 1.3 points (±2.6 points), which approximates 4% (±7.5%) of the 33-point FPI scale. The mean SEMs for traditional measures were large, especially for navicular drop (2.5 mm or 16.7% [±32.7%]; normal range, 10 to 15 mm). These results suggest that young children’s feet are less reliably assessed (especially between raters) than are those of older children and adults.
In the adolescents’ study, all examiners achieved good intrarater reliability for the FPI and normalized navicular height (ICC ≥ 0.75) (
Table 3). Two examiners achieved good intrarater reliability for forefoot-to-rearfoot relationship, one examiner achieved good intrarater reliability for resting calcaneal stance position, and one examiner achieved good intrarater reliability for neutral calcaneal stance position. Interrater reliability was moderate for FPI, normalized navicular height, forefoot-to-rearfoot relationship, and resting calcaneal stance position and poor for navicular drop and neutral calcaneal stance position. The FPI SEM was similar to that found in the children’s study. Again, navicular drop and forefoot-to-rearfoot relationship showed large measurement errors. Rater 4 showed the greatest error and the least intrarater reliability for forefoot-to-rearfoot relationship.
In the adults’ study, all examiners achieved good intrarater reliability for normalized navicular height and forefoot-to-rearfoot relationship (ICC > 0.75) (
Table 4). Three examiners achieved good intrarater reliability for the FPI, and two examiners achieved good reliability for navicular drop, but no examiner achieved good intrarater reliability for resting calcaneal stance position or neutral calcaneal stance position. Interrater analyses found that normalized navicular height displayed good reliability between examiners (ICC = 0.76). Interrater reliability was moderate for forefoot-to-rearfoot relationship and the FPI and poor for navicular drop, resting calcaneal stance position, and neutral calcaneal stance position. The FPI SEM was similar to that found in the younger subject studies; navicular drop showed larger measurement error, as did resting calcaneal stance position. Rater 4 showed the greatest measurement error for forefoot-to-rearfoot relationship.
Analysis of the FPI criterion data revealed that Helbing’s sign was the least reliable criterion across all subjects, with Spearman’s ρ interrater reliability of 0.17, 0.16, and 0.33 for the children, adolescent, and adult groups, respectively. Interrater reliability values were generally poor (
Table 5). The FPI criterion analysis across age groups revealed that criteria 4, 5, and 6 showed moderate interrater reliability (except criterion 4 in the adolescent group), whereas the other criteria showed poor reliability.
Discussion
The aim of this study was to determine the intrarater and interrater reliability of a variety of the traditional measures of foot position and the newly developed FPI.
Except for young children, it was apparent that normalized navicular height was the most reliable single measure across subjects and between examiners and therefore is useful to include in examination of the older foot. This finding agrees with that of Weiner-Ogilvie and Rome, [
10] who also found small intrarater and interrater differences in navicular height measures. However, it is difficult to compare the findings of this study with those of Weiner-Ogilvie and Rome, as their reliability analysis involved mean difference percentage agreement rather than the ICC. Navicular drop, the counterpart measure of normalized navicular height, largely showed poor interrater reliability (ICCs: children, 0.55; adolescents, 0.47; adults, 0.46). The SEM ranged from 1.8 to 3.5 mm, with a mean of approximately 2.5 mm, which represents an error size of 17% of the normal navicular drop range (10 to 15 mm). This finding is in agreement with that of Picciano et al, [
3] who also reported poor interrater reliability of navicular drop (ICC = 0.57, SEM = 2.7 mm), but their study involved two inexperienced testers. Mueller et al [
2] reported good reliability for navicular drop (ICC = 0.78 to 0.83, SEM = 1.7 to 2.1 mm), but this study involved only one rater and hence cannot be generalized.
The resting calcaneal stance position and the neutral calcaneal stance position were moderately to poorly reliable between examiners and in most cases also least reliable for repeated measures by the same rater. The SEM ranged from 0.9° to 6.2° for resting calcaneal stance position and from 0.2° to 2.1° for neutral calcaneal stance position. Applying these findings to the clinical setting where both measures are within a small range, the error component contributes approximately 20% to these scores. Findings for these traditional measures are in agreement with some previous studies. [
6,
8,
24] Although rater 2 showed good reliability for resting calcaneal stance position in the children’s and adolescents’ studies, examination of the raw data showed that this rater used fewer values when rating adult subjects (ie, 0, 5, and 10), which falsely enhanced reliability results. This was an interesting finding of this reliability study, and numerical preference by raters has been reported elsewhere. [
25] Unfortunately, it potentially confounds the data for adult subjects and hence is a limitation of this study.
The interrater reliability of the forefoot-to-rearfoot relationship varied markedly across the study populations (ICCs: children, 0.28; adolescents, 0.53; adults, 0.70). The results were compared with those of Somers and Hanson, [
24] who found interrater reliability (ICC) of 0.38 for goniometric assessment and 0.81 for visual assessment. Diamond and Delitto [
11] examined right and left limbs separately using goniometric assessment and found interrater (ICC) results for the forefoot-to-rearfoot relationship of 0.58 (right) and 0.77 (left). Although the levels of reliability are similar to those found in the older subject groups of the present study, it is not appropriate to compare these study results because homogeneity was not uniform. The SEM of the forefoot-to-rearfoot relationship was approximately 2° across age groups, which is high for the range of the measure but favorable compared with other measures.
Interrater reliability results (
Table 6) offer the more generalizable and thus more useful results for the wider podiatric profession, with intrarater results being pertinent only to individual examiners. Less reliable measures display a larger error range and hence need to be interpreted within the size of the expected range of scores. It is important that generally reliable, low-error measures are developed and used for comparison of intervention effects with baseline findings. It is also important that unreliable measures are identified and discarded when a suitable alternative for subject assessment and orthotic prescription becomes available. Unreliable measures for which there is no current alternative must be used with great caution when repeated measures are planned. Error can potentially account for much of the measurement range. This is clearly the case for resting calcaneal stance position and neutral calcaneal stance position in this and other studies. [
6,
8] Use of the FPI, by clinicians and researchers alike, should be considered in light of the error estimates and results of comparison with other measures. The validity of this tool is also yet to be verified.
Results from children were very different from those of the other populations. This was not surprising to the three raters, who found the experience of examining young children to be very different from that of the older subjects. The young children moved more, turned to look and talk, and had to be repeatedly repositioned on their templates. Although each rater displayed good intrarater reliability for the FPI, these results cannot be generalized for other raters, as interrater reliability was only moderate. The average intrarater ICCs (z transformed) showed greatest variation in the children’s study, ranging from 0.137 (neutral calcaneal stance position) to 0.808 (FPI). None of the measures tested exhibited adequate reliability across raters in the children, which implies that young children’s feet need to be examined differently than those of adolescents and adults and that rater skills across age groups cannot be assumed.
Results for the adolescent group showed good intrarater reliability for all raters for the FPI, the only study sample to do so. Interrater reliability of the FPI was improved compared with that found in both adults and children. Average intrarater ICCs (z transformed) were more consistent in this age group, ranging from 0.559 (neutral calcaneal stance position) to 0.916 (FPI). Raters 1, 2, and 3 were more practiced in the methodology protocol for the adolescent group, with that group being the last of the three age groups to be examined. Rater 4 was very familiar with the FPI at commencement of the study.
The adult group was the first subject group to be examined, and for raters 1 to 3 this was their foray with the FPI. It is possible that practice would improve the interrater reliability of the FPI, as it improved in the morphologically similar but later examined adolescent group. The surprising result from this study was the good intrarater and moderate interrater reliability of the forefoot-to-rearfoot relationship measure, as raters were previously least confident about their consistency of this measure. The average intrarater ICCs (z transformed) were similar to those of the adolescent group.
Several notable points arise from the descriptive data. The average FPI value for each study group was within the range attributed to a normal foot posture (+2 to +7). [
21] Although all studies involved subjects with supinated and pronated feet, there were fewer extremes of either foot type than could potentially occur in the clinical setting. Hence the reliability of the FPI in extremely supinated or pronated foot types has not been adequately assessed in this study. Not surprisingly, the average navicular height measure increased with age and concurrent increase in foot size and possibly medial arch development. Average navicular drop, resting calcaneal stance position, neutral calcaneal stance position, and forefoot-to-rearfoot relationship measures were relatively static across all age groups, as were the ranges of these measures, except for the adult forefoot-to-rearfoot relationship range. This is perhaps a more surprising finding, as traditional theorists of foot development have described decreasing resting calcaneal stance position eversion and forefoot-to-rearfoot varus with increasing age. Although a finding of this study, it is acknowledged that sample sizes are too small to provide normative trends of developing foot morphology.
In this study, the type (3,1) and (2,k) ICCs were used in their classic forms for reliability studies as described by Shrout and Fleiss. [
26] Use of the type (2,k) ICC enabled comparison with the results of other reliability studies. [
2,
7,
11,
13,
24] There is dissent within reliability studies regarding use of the ICC. [
26-
28] Specifically, Bland and Altman [
28] report concerns about the ICC’s lack of discrimination between correlation and agreement. Data that correlate do not necessarily agree, and low ICC values do not allow for discernment between poor correlation and poor agreement (clearly, this is not problematic for high ICC values). Pearson’s correlation and the
t-test allow for distinction of correlation from agreement but give two separate results of reliability, as opposed to the single ICC value.
The FPI criterion analysis was performed to investigate the reliability of the eight individual components that sum to the FPI total score. Clearly, a consistently unreliable criterion may reduce the overall reliability of the FPI total score, so identification of this criterion would potentially allow for improvement in FPI reliability. This analysis is ongoing. The FPI combines the scores of the eight individual foot measures (criteria) to give a total score of –16 to +16, with negative scores (less than –2) representing abnormally supinated foot types and positive scores (greater than +12) representing abnormally pronated foot types. The normal foot is believed to score 2 to 7. [
21] The summative nature of the FPI has limitations. The summed score is potentially ambiguous, as criteria making up the FPI total score actually reflect different characteristics of foot type, which are not all equal. Therefore, the same FPI total score may be obtained for quite different criterion findings and quite different foot types (
Table 7). Ambiguity is a common problem with summative scales. [
18] Despite this ambiguity, many summative rating systems are in common use, for example, the SF-36 Health Status Survey [
29] and the Foot Health Status Questionnaire. [
30] Within such scales it is possible to separate sections to examine smaller “domains.” The FPI has this potential, but further analysis of the various domains, for example, rearfoot
versus forefoot segments, is required. Until such analysis is complete, ambiguity will remain a limitation of the FPI.
An additional limitation of the FPI is that it examines static foot posture by observation of the foot but does not indicate joint ranges of motion. Thus alone it is incapable of differentiating between rigid and flexible flat feet. Subsequently, raters would still need to conduct a range-of-motion assessment (using unreliable techniques) to classify foot type more comprehensively.
This study was limited to asymptomatic subjects and therefore does not cover the complete clinical spectrum of foot types and conditions. In clinical practice it would be expected that a larger proportion of patients would present with foot type extremes. Such an increase in the range of foot types (increasing the range of observed values) may enhance the reliability of the measures.
Conclusion
The FPI showed moderate reliability across the three age groups studied. The FPI total score demonstrated better reliability than most other measures. Intrinsically, the FPI, as a summative scale, is potentially ambiguous, which could limit its usefulness as a clinical or research tool. Further analysis of FPI criterion reliability is required.
As a single clinical measure, normalized navicular height was the most reliable, exhibiting good reliability in the adult population. However, reliability was only moderate in young children, and its use in this population must be questioned.
Finally, although the reliability of the FPI and traditional measures (excluding normalized navicular height in the adult group) was in most cases barely acceptable for rigorous protocol, one should not abstain from measuring a phenomenon because of less-than-desirable reliability. If no suitable and reliable alternative method is available, one must use what is available, interpreting results with due caution.