Intra ‐ and Interobserver Reliability Comparison of Clinical Gait Analysis Data between Two Gait Laboratories

Featured Application: Clinical gait analysis (CGA) is an in vivo method used to measure the movement behavior/gait patterns of patients before and after an orthopedic treatment. It monitors joint kinematics and kinetics under dynamic conditions within the musculoskeletal system. This study contributes to a better understanding of the comparability and significance of motion analysis data recorded in different gait laboratories using different technical qualities. Abstract: Comparing clinical gait analysis (CGA) data between clinical centers is critical in the treatment and rehabilitation progress. However, CGA protocols and system configurations, as well as choice of marker sets and individual variability during marker attachment, may affect the comparability of data. The aim of this study was to evaluate reliability of CGA data collected between two gait analysis laboratories. Three healthy subjects underwent a standardized CGA protocol at two separate centers. Kinematic data were captured using the same motion capturing systems (two systems, same manufacturer, but different analysis software and camera configurations). The CGA data were analyzed by the same two observers for both centers. Interobserver reliability was calculated using single measure intraclass correlation coefficients (ICC). Intraobserver as well as between ‐ laboratory intraobserver reliability were assessed using an average measure ICC. Interobserver reliability for all joints (ICC total = 0.79) was found to be significantly lower (p < 0.001) than intraobserver reliability (ICC total = 0.93), but significantly higher (p < 0.001) than between ‐ laboratory intraobserver reliability (ICC total = 0.55). Data comparison between both centers revealed significant differences for 39% of investigated parameters. Different hardware and software configurations impact CGA data and influence between ‐ laboratory comparisons. Furthermore, lower intra ‐ and interobserver reliability were found for ankle kinematics in comparison to the hip and knee, particularly for interobserver reliability.


Introduction
Three-dimensional clinical gait analysis (CGA) is an important diagnostic tool within the field of movement disorders [1]. Optical marker tracking systems are considered the gold standard for this type of assessment [2,3]. However, discrepancies in CGA data across multiple analyses can be caused by differences in marker sets [4], observer error [5], marker placement [6] and patient variability [7]. Despite applied standardized protocols (e.g., marker-set, measurement pipeline), incompatibilities and discrepancies within CGA data of the same patient cohorts collected by different institutions are often observed, and have caused reservations concerning the applicability of CGA among multicenter studies [3,4,8]. Ferrari et al. [4] compared the trunk, pelvis and lower limb kinematics of five separate measurement protocols (different marker sets), using a single data pool of subjects. These authors reported good correlations between kinematic variables in the sagittal plane (flexion/extension), but poorer correlations for out-of-sagittal-plane rotations, such as knee abduction/adduction. They also reported good correlations among protocols with similar biomechanical models [4]. Ferrari et al. [4] concluded that the comparison of different measurement protocols results in higher data variability when compared to interobserver and interlaboratory comparisons for most gait characteristics. Obviously, model conventions and definitions are crucial for data comparison. Gorton et al. [8] identified the marker placement procedure of examiners as the largest source of error. The use of a standardized protocol for marker placement decreased data variability by up to 20%. Based on clinical experience, the authors of the current study hypothesized that significant discrepancies would exist between CGA data sets collected using different hardware and software infrastructures or configurations.
Due to the observed limitations in the reproducibility of CGA data, efforts have focused on the reduction of measurement error, with the aim of producing data with the highest possible degree of reproducibility. To improve our understanding of the accuracy and reproducibility of CGA data, cross-laboratory studies must be performed using a single cohort of subjects. Previous research has reported on the repeatability of motion capture data using separate trials, sessions and observers under various conditions for a specific CGA setup or laboratory [5][6][7]9,10]. However, reports for multi-center repeatability are limited and have focused on differences in hardware configurations, marker placement, as well as between trials and days of measurement [3,8,11]. Unfortunately, a validation using the same motion capturing technology, but different hardware and software infrastructure/configurations (e.g., capturing software, camera type), among a single cohort of subjects is lacking.
It is critical for clinicians and researchers to have reliable examination tools to accurately and objectively assess the functional status of a joint [12][13][14][15][16]. A high level of intraobserver reliability is imperative to accurately evaluate the longitudinal effects of the rehabilitation process and to identify differences between subjects [17,18]. A central research question is whether the findings of clinical gait analyses conducted by multiple laboratories are consistent and reliable enough for making clinical decisions-or is the dependence on observers, repeat measurements or measurement and analysis protocols too large?
The specific aims of this study were to evaluate the intra-and interobserver reliability and the equivalence of CGA data between two gait analysis laboratories (between-laboratory intraobserver reliability) using the same motion capturing technology, but different hardware and software infrastructure/configurations among a single cohort of subjects. The following two hypotheses were specifically investigated: 1) Intraobserver reliability of CGA data for the lower extremity obtained for the same cohort of subjects and captured at one laboratory will be good-to-excellent, whereas interobserver reliability will be fair-to-good.
2) CGA data for the lower extremity obtained for the same cohort of subjects, captured at two separate laboratories by the same observers, using different hardware and software systems, will be equivalent.

Subjects
A standardized measurement protocol was performed repeatedly on three adult subjects (one female, two males) who were asymptomatic at the time of testing and did not display any movement disorders. This cohort had a mean (±SD) age of 33.3 (±4.0) years, mean body weight of 77.3 (±16.3) kg, mean body height of 176.7 (±9.1) cm, and a mean body mass index of 24.3 (±2.9) kg•m -2 (Table A1). The study was approved by the ethical committee of Martin-Luther-University Halle-Wittenberg (approval number: 217/08.03.10/10). Written consent from study participants was obtained prior to data collection.

Measurement Set-Up
Gait analyses were performed in two gait laboratories ( Figure 1). Both laboratories used an optical infrared camera-based motion capturing system provided by one manufacturer (Vicon Motion System Ltd., Oxford, UK).

Figure 1.
Data processing chart-clinical gait analysis (CGA) data collected on three healthy testsubjects (n = 20 barefoot trials per person). The data were captured using the same measurement protocols in both the first (GL1) and the second laboratories (GL2). Three reliability tests were performed with the datasets from GL1 and GL2. Interobserver reliability was tested for two observers using the CGA data collected at GL2. Observer 1 used the data processing software from GL 1 (Workstation 4.6), which was different to that of Observer 2 (GL2: Nexus 1.3). To assess intraobserver reliability, the same data set was analyzed a second time by observer 1 after a time interval of 12 months, using the CGA data collected at GL2 and the software from GL1. To test between-laboratory intraobserver reliability, the CGA data collected at both sites (GL1 and GL2) were analyzed by Observer 1 using software from GL1.
The first laboratory (GL1) was equipped with a six-camera system (460 Vcam cameras, Workstation 4.6 build 142 software). The second laboratory (GL2) used a six-camera system (MXF-20 cameras, Nexus 1.3 software). These systems thus differed in their respective configurations, both in terms of analysis software (Workstation vs. Nexus) and camera type (Vcam vs. MXF). Both laboratories applied the same sampling rate (200 Hz) and the same reflective markers (14 mm diameter). Capture space represented the full dimensions (length, width and height) of the motion capture room for GL1 (7 m, 4 m and 3 m) and for GL2 (15.5 m, 8.8 m and 4.8 m). Capture volume was defined as the area within the capture space, for which motion capturing cameras were able to capture the motion task of each subject. The GL1 capture volume was 6 m, 2 m and 2 m. The capture volume for GL2 was 10.0 m, 8.5 m and 2.9 m. Captured marker data were processed and trajectories were labeled using the PlugInGait model under a standardized protocol at both laboratories. All kinematic data were Woltring filtered using a mean squared error setting of 10. These gait data were reduced to 100% of one gait cycle using gait cycle event detection, based on the available force plates (threshold: 20 N). All subsequent measurement conditions were consistent at both laboratories.

Measurement Protocol
Calibration of the optical infrared camera-based motion capturing systems at both laboratories were performed according to the manufacturer's guidelines. This calibration process consisted of two main steps: (i) a static calibration to calculate the origin of the capture volume and define the 3D workspace orientation (x, y, z directions) and (ii) a dynamic calibration to calculate the relative position and orientation of each camera within the capture volume. Calibration quality was checked according to the manufacturer's guidelines.
The PlugInGait marker setup for the lower extremity (kinematic model V 2.3) was used in both laboratories, based on the work of Kadaba et al. [19]. Markers were attached to each subject by the same experienced staff member (Observer 1) at both gait analysis laboratories, according to a standardized protocol for anthropometric measurements, landmark identification and marker mounting.
Each of the three subjects performed 20 barefoot gait trials with a self-selected walking speed resulting in a total of 60 CGA trials per gait laboratory. Individual gait speed was controlled and standardized between data collection sessions. The same two staff members (Observer 1 and Observer 2) performed data processing and analysis at both laboratories, using a standardized protocol for data processing, labeling and gait event detection. For each reliability analysis variation, the observer started with the original raw data in their data processing routine. These raw data were labeled, gait events were detected and possible gaps of reconstructed marker trajectories were filled. The kinematic parameters were extracted using the same template and again used a standardized workflow for both observers. Kinematic data used for reliability analysis, which consisted of specific movement parameters in the sagittal and frontal plane for the hip, knee and ankle. Parameters were selected according to the specifications of Benedetti et al. [20] (Figure 2). Data from the right leg of each subject were used for the analyses of reliability. Three different measures of reliability were performed ( Figure 1):  Interobserver Reliability: The reliability of the two observers, using the same data set, was assessed using separate analysis software (Observer 1: GL1 (Workstation); Observer 2: GL2 (Nexus)).  Intraobserver Reliability: The reliability of the same data set among the same observer was tested. This assessment was performed with a time interval of 12 months between analyses using the same software (Workstation). The CGA data for this assessment were collected at the GL2 site.  Between-Laboratory Intraobserver Reliability: To compare the effect of laboratory environment and instrumentation while excluding observer-dependent influences, CGA data collected at both laboratories were analyzed by a single observer using the same analysis software.
The following reliability variables were assessed. ICCmean was the average of all parameters for hip, knee and ankle. ICCtotal (Equation 1) was the average of hip ICCmean, knee ICCmean and ankle ICCmean for all three types of reliability (intraobserver, interobserver, between-laboratory intraobserver). ICCtotal = hip ICCmean + knee ICCmean + ankle ICCmean / 3 (1)

Statistics
Descriptive statistics (mean, standard deviation) were based on 20 barefoot trials and were calculated for 31 kinematic parameters in the sagittal and frontal plane for the hip, knee and ankle joints. Reliability analyses were divided into five parts: 1. For interobserver reliability, a single measure intraclass correlation coefficient (ICC) was calculated. The number of measures (k) was 60 (n = 20 barefoot trials from three subjects).
2. For intraobserver reliability, an average measure ICC was calculated. The number of measures (k) was 60.
3. To assess between-laboratory intraobserver reliability, an average measure ICC was calculated and again referenced to the same ICC value classification [21]. The ICC indicated excellent reliability if the value was above 0.75, fair-to-good reliability between 0.40 and 0.75 and poor reliability when less than 0.40. A two-way mixed-effects model (definition: absolute agreement) was used for all calculations. For all ICC values, 95% confidence intervals were reported.
4. To estimate experimental errors of a joint angle, the standard error was calculated (for intra-(σ repeated ), inter-(σ observer ) and between-laboratory intraobserver reliability (σ sess(lab) )) as described by Schwartz et al. [10], as well as the magnitude of the interobserver error and its ratio to intrasubject error r (Equations 2-4). [22] was used to assess interchangeability (equivalence) of CGA data between laboratories. Calculated differences for joint angles were plotted against their average value for each subject. The interchangeability of CGA data was tested by a bounding criterion defined as the mean ± two standard deviations of the measured differences (approximately 95% of all measured values).

A scatter-plot technique suggested by Bland and Altman
6. To evaluate the variability within and between subjects, the standard error of measurement (SEM) was calculated in conjunction with the ICCs, using the following equation from Portney and Watkins [23]: SEM SD 1 ICC (5) ICC values may be influenced by inter-subject variability of scores, because a large ICC may be reported despite poor trial-to-trial consistency if the inter-subject variability is too high [23,24]. However, the SEM is not affected by inter-subject variability [24]. 7. Mean differences for multicenter comparisons were tested using variance analysis. A one-factor, univariate general linear model (GLM; dependent variable: hip flexion during stance; independent variable: CGA center; covariate: walking speed) was performed. Prior to inference statistical analyses, all variables were tested for normal distribution (Kolmogorov-Smirnov test).
All statistical analyses were performed using SPSS version 25.0 for Windows (SPSS Inc., Chicago, IL, USA).

Results
Because of the controlled and standardized CGA data collection, walking speed and stride length parameters did not differ between laboratory measurement sessions (walking speed: 1.40 ± 0.06 m•s -1 vs. 1.40 ± 0.04 m•s -1 ; stride length: 1.38 ± 0.09 m vs. 1.40 ± 0.10 m).

Between-Laboratory Intraobserver Reliability
All between-laboratory intraobserver reliability variables fulfilled the assumption of normality. Between-laboratory intraobserver reliability for the entire lower limb (hip, knee, ankle) was poor (ICCtotal = 0.56, Table 2). Excellent ICC values were only calculated for 29% (9/31) of parameters (Table  1). Considering the lower limit of the 95% CI, no parameter reached an ICC level of 0.75. Calculated standard errors between-laboratory intraobserver reliability (σ sess(lab) ) were larger than those for interobserver and intraobserver reliability ( Table 3). The largest standard errors were the same as those found for interobserver reliability, which included hip adduction (% gait cycle) at the stance phase with a mean σ sess(lab) = 5.4%, and for ankle dorsiflexion (% gait cycle) at the stance phase with a mean σ sess(lab) = 4.8%. The largest ratio was again observed for knee abduction in the swing phase (σ sess(lab) = 4.2°, r = 5.1) whereas its smallest inter-trial error was 0.8°.

Scatter-Plot Technique Suggested by Bland and Altman [22]
The Bland and Altman [22] scatter-plot technique revealed that the largest (worst) bounding criterion was for frontal plane knee angle (−2.3 ± 16.1°) (Figure 3d). The hip joint angles (Figure 3ab), as well as the sagittal plane angles of the knee joint (Figure 3c-d), had the lowest bounding range. Ankle angles in general had a small bounding range (Figure 3e-f), with the exception of the maximum plantarflexion angle during the swing phase (−3.6 ± 10.7°). Bland and Altman plots presenting the computed differences between the clinical gait analysis data from the first laboratory (GL1) and the second laboratory (GL2) for (a-b) hip joint, (cd) knee joint and (e-f) ankle joint. Each data point (test subject 1 = circle, test subject 2 = triangle, test subject 3 = square) represents the computed difference between CGA data of GL1 and GL2 (ordinate) which was plotted vs. the mean difference (abscissa) respectively, from the same healthy test subjects. The solid horizontal line represents the mean of all the differences plotted, and the two dashed lines represent the mean ± 2 SD (standard deviation) agreement interval, as defined by Bland and Altman [22].
Based on the ICCtotal, significant differences were calculated between all three types of reliability (p < 0.001) ( Table 2).

Discussion
The ability to exchange and compare clinical gait analysis data between laboratories is a valuable tool to improve the monitoring of clinical treatments and rehabilitation progressions for patients with musculoskeletal disease. Gait analysis data was thus compared in three ways in the present study. First, to determine interobserver reliability; second, to determine intraobserver reliability; and finally, to determine between-laboratory intraobserver reliability of CGA data between two gait laboratories.
In support of our first hypothesis, the total mean reliability was excellent for interobserver (ICCtotal = 0.79, 95% CI: 0.67-0.86) and intraobserver reliability (ICCtotal = 0.93, 95% CI: 0.87-0.96) ( Table  2). For a single laboratory, we found fair-to-good or poor interobserver reliability, which suggests an observer influence dependent on subjective observer variability and the different software used for analysis (Workstation vs. Nexus) ( Table 1). These results support those of previous work by Schwartz et al. [10] (Table 3). We found excellent intraobserver reliability for one observer processing the data using the same software (excluding observer and software error). Nonetheless, ICC values for interand intraobserver reliability still fall under the excellent classification for a majority of the parameters observed (interobserver: 71% parameters; intraobserver: 94% parameters).
The ankle displayed fair-to-good and excellent interobserver reliability, whereas the knee and hip showed excellent reliability only ( Table 2). This outcome is in general agreement with the systematic review of McGinley et al. [3], who reported excellent intra-and interobserver reliability for the parameters of the hip and knee joints in the sagittal-plane, and a lower reliability for the ankle joint. However, the observed differences within ankle kinematics may in part be attributed to technical improvements in the applied software (Nexus, version 1.3 vs. Workstation, version 4.6 build 142). However, future research is needed to investigate this possibility.
For total mean reliability, we observed fair-to-good between-laboratory intraobserver reliability (ICCtotal = 0.56, 95% CI: 0.16-0.74) and excellent inter-(ICCtotal = 0.79, 95% CI: 0.67-0.86) and intraobserver reliability (ICCtotal = 0.93, 95% CI: 0.87-0.96). Only 26% of the between-laboratory intraobserver parameters revealed excellent ICC values, whereas the remaining 74% fell within the fair-to-good and poor ranges. Significant differences (p ≤ 0.002) were found in 39% of the compared parameters (Table 4). This fair-to-good and poor reliability may be caused by differences in the measurement system hardware configurations and by slight differences in marker placement between data collection sessions. Both laboratories employed electro-optical motion capturing from the same manufacturer, but use a different hardware configuration (capturing software, camera type).
According to our second hypothesis, we believed that CGA data could be accurately collected between two gait laboratories, making such data interchangeable. An acceptance of this hypothesis could be affirmed for 61% of the parameters, with significant differences (p ≤ 0.002) in the remaining 39% (Table 4). Between-laboratory intraobserver reliability of CGA data captured at two different gait analysis laboratories had total ICC values at the fair-to-good level, which was lower than the excellent inter-and intraobserver reliability values (Table 2). In general, differences were found at each joint (ankle, knee, hip) for angle as well as for time-dependent (% gait cycle) analysis. The Bland-Altman plots indicated a detectable amount of variation for specific parameters, such as the knee in the frontal plane (Figure 3d), when compared between laboratories. These plots also showed a small data distribution for each individual observer, when compared between laboratories and the largest (worst) bounding criteria for frontal plane knee angle (2.3 ± 16.1°) (Figure 3d).
The standard errors for between-laboratory intraobserver analysis (maximum abduction at stance phase σ sess(lab) = 4.2, maximum adduction at swing phase σ sess(lab) = 3.4, maximum second abduction swing phase σ sess(lab) = 3.2) supports the low between-laboratory intraobserver reliability found for the knee (ICCmean = 0.46, Table 2). These values are twice as large as for inter-(maximum first abduction stance phase σ observer = 2.4; maximum adduction/abduction swing phase σ observer = 1.5; maximum second abduction swing phase σ observer = 1.3) and intraobserver reliability (maximum first abduction stance phase σ repeated = 0.8; maximum adduction/abduction swing phase σ repeated = 1.0; maximum second abduction swing phase σ repeated = 0.7). The lower between-laboratory intraobserver reliability and significant differences observed indicate an influence of the applied motion capturing camera technology on the captured CGA data between testing sites. In contrast to our measure of intra-and interobserver variability in the between-laboratory comparison, the effect of different hardware (cameras, lenses, Analog-to-digital (A/D) converter, etc.), as well as marker removal and replacement between data sets can also contribute to lower reliability. The reduction in reliability suggests that the detection of tracked markers may be affected by the different camera resolutions investigated (Vcam with 0.3 million pixels, 659 x 439 black/white pixel sensor resolution vs. MXF with 2 million pixels, 1600 x 1280 grayscale pixel sensor resolution). We believe the effect of the measurement protocol and observer dependence are probably small in comparison to available technical requirements, existing in the different generations of cameras (i.e., sensor technology) because the ICC values of the one-site interobserver reliability were much higher, falling well within the excellent classification. These findings were somewhat different than those reported by Bucknall et al. [11], who observed apparent differences during motion data captured simultaneously using three camera systems (612, MX-13 and MX-F40).
Results of this study support the findings of Gorton et al. [8] and Bucknall et al. [11], which showed a dependence of CGA data quality on system resources, as well as applying standardized measurement protocols. The fact that CGA data quality could be affected by different camera types, even ones from the same manufacturer, is problematic, not only in the context of multicenter investigations, but it could also be a problem when comparing CGA data within institutions that have recently updated their laboratory equipment. Based on these findings when multicenter investigations are considered, it is important that both laboratories use the same or comparable camera equipment and software.
A limitation of this investigation was the sample size (n = 3). Each subject performed 20 barefoot gait trials using a self-selected walking speed. Subjects had no gait pathologies, resulting in similar gait patterns. This approach included the risk of correlated observations. On the other hand, it was not feasible to conduct our investigation with actual patients (with gait pathologies) and a powered sample, due to time restrictions in subject preparation (e.g., time required for marker application) and transportation limitations between testing sites.

Conclusions
The results of this study showed higher intraobserver reliability than interobserver reliability for CGA data conducted in a single gait analysis laboratory. There was weaker intra-and interobserver reliability ankle kinematics when compared to the hip and knee. Inter-or intraobserver reliability for data collected at a single laboratory was much stronger compared to data collected between laboratories. The outcomes of this study indicate that CGA results are probably influenced by testing conditions, such as laboratory equipment and software (capturing software and camera type). Peerreviewed literature has reported an effect on the repeatability of motion capture data using separate trials, sessions and observers, etc. [5][6][7]9,10], and for multicenter repeatability focused on different applied hardware configurations, marker placement, as well as between trials and days of measurement [3,8,11]. The results of the current study support these previous investigations, and suggest that when multicenter investigations are considered, it is important that both laboratories use the same or comparable camera equipment and software. The reduced between-laboratory intraobserver reliability implies that comparisons of CGA data between centers with varying measurement equipment are generally not recommended.