1. Introduction
Neck pain is the fourth most disabling condition worldwide with a point prevalence of 5.9% to 38.7% in adults. Not only is it prevalent, but it is on the rise [
1]. Physiological, biomechanical, and psycho-social factors all contribute to the manifestation of neck pain. Because neck pain is a multifactorial condition, it can be challenging to pinpoint the root causes of the pain. Traditionally, patient surveys and questionnaires are used to quantify neck pain [
2]. While these tools provide information regarding pain sensitivity and perception, They do not serve as accurate measures of neck pain and, thus, may not reflect the biomechanical functional limitations of the patient. Therefore, a quantitative functional measurement tool is desirable to provide reliable results to assess specific functional status associated with neck disorders.
One way to capture the functional status of the neck is to study the kinematics of its movement. Various measurement tools have been used to quantify neck kinematics. Some studies used more traditional tools such as dynamometers and goniometers to measure body kinematics [
3,
4,
5]. Recent developments in wearable inertial measurement units (IMUs) have provided researchers with more accurate tools that can capture the complex biomechanical kinematics of human motion [
6]. However, for such metrics to be meaningful, one must understand the reliability and repeatability of the results produced by the measurements. According to the Six Sigma process, it is a crucial step to analyze and limit the variance of a measurement system, so that variance is attributed to factors being measured by the tool as opposed to measurement system noise or error [
7].
Different metrics have been used to assess tool reliability including intraclass correlations, standard error of measurement, Bland–Altman plots, Cohen’s kappa, and Pearson’s r, yet it is still debatable as to which method is most appropriate [
8]. Maynard et al. [
9] assessed the reliability of gait measures from an optical motion analysis system using the intraclass correlation coefficient (ICC) method and the Bland and Altman test, whereas Lee et al. [
10] used Cohen kappa to report inter-rater reliability of inertial measurement units (IMUs) when used to measure the motion of different body parts. The advantage of using ICCs is that the reliability quantified through this method enables researchers to assess observer effects [
11]. According to McGraw and Wong [
12], there are 10 types of ICCs, which are used in different experimental settings. However, most research papers fail to report which method was used for the reported ICC values making interpretation and comparison of those papers challenging. Thus, it is crucial to establish the type of ICCs when reporting the reliability of a measurement system.
Previous studies have reported a range of reliability ICCs for IMU-based metrics. Yoon et al. [
13] showed the reliability of IMUs in measuring cervical range of motion to be good to high in 33 healthy patients. Carmona-Pérez et al. [
14] assessed the reliability and validity of IMUs for the assessment of craniocervical range of motion (ROM) in cerebral palsy patients and showed intra-day reliability of 0.82 to 0.93 when measuring range of motion in all three anatomical planes. Chalimourdas et al. [
15] showed the reliability of an IMU-based device designed to measure the cervical range of motion using 36 healthy individuals. The results showed ICC values between 0.54 to 0.9 for a two-way, mixed effect, absolute agreement model. As described in the literature, most of the studies investigate the range of motion and ignore higher degrees of motion, speed, and acceleration, which contain valuable information about the intrinsic nature of functional status [
16].
Despite the popularity of IMU-based platforms, there are some drawbacks in using them which make results challenging to interpret. Long-term drift, magnetic interference, and inconsistency are some of the most critical issues involved with IMU-based platforms. However, several methods have been developed to reduce potential issues, including the use of a three-axis magnetometer to measure the magnetic field direction in the proximity of sensors [
17].
The goal of this study was to determine the degree to which the cervical spine’s kinematic properties, as measured by an integrated software/IMU-based platform, were reliable. To comprehend the total variation among the raters at various time points, inter- and intra-rater reliability must be assessed. Future research that uses the same or comparable systems will build on this study’s findings.
3. Results
3.1. Demographics
The subjects’ demographic backgrounds are shown in
Table 2; 70% (
n = 14) of the population were Caucasian, 25% (
n = 5) were Asian, and 5% (
n = 1) were unspecified. Additionally, the mean values for the participants’ heights and weights were 171.5 ± 9.6 cm and 69.8 ± 11.9 kg, respectively.
3.2. Intra-Rater Reliability
A total of 37 kinematic features during functional motion assessments were extracted and examined for reliability. Detailed descriptions of each measure and its interpretation are presented in
Table A1 and
Table A2. For each motion task, the intra-rater ICC estimates and 95% confidence intervals are displayed as a function of the cardinal planes. The intra-rater reliability (shown in
Table 3) for all motion metrics in the axial plane were excellent, with ICC values of 0.94 (95% CI: 0.90–0.97), 0.94 (95% CI: 0.89–1.0), and 0.92 (95% CI: 0.83–1.0) for axial flexibility, velocity, and acceleration measures, respectively.
In the lateral plane, all flexibility motion metrics also yielded excellent reliability estimates, with mean ICC = 0.94 (95% CI: 0.88–0.97). Furthermore, mean lateral velocity metrics also ranged from good to excellent, ICC = 0.9 (95% CI: 0.85–0.96). In addition, lateral acceleration metrics showed good reliability, with right lateral acceleration showing the highest reliability of ICC = 0.89 (95% CI: 0.81–1.0).
Lastly, we discovered excellent ICC estimates after examining all the intra-rater reliability values in the sagittal anatomical plane, with mean ICC estimates ranging between 0.90–0.95 for flexibility, velocity, and acceleration measures.
For all multi-planar motion metrics of flexibility, velocity, and acceleration measurements, the intra-rater ICC values show good to excellent reliability.
3.3. Inter-Rater Reliability
When compared to intra-rater reliability, the inter-rater reliability results (
Table 4) showed a different trend. All motion metrics in the axial plane resulted in good inter-rater reliability, with mean values ranging from 0.79–0.87. Overall, the highest inter-rater reliability for the axial plane was for left axial velocity with ICC = 0.87 (95% CI: 0.77–1.0). With mean ICC estimates ranging from 0.78 to 0.85 in the lateral plane, flexibility and velocity measures yielded good inter-rater reliability estimates, and acceleration metrics showed moderate reliability estimates, with mean ICC estimates ranging from 0.70–0.75.
For the sagittal symmetric tasks, the most reliable metric was sagittal velocity with an ICC = 0.89 (95% CI: 0.82–1.0). The best metric for the sagittal asymmetric tasks was extension velocity (ICC = 0.86, 95% CI: 0.79–1.0). All other metrics for sagittal symmetric and asymmetric tasks produced good reliability.
3.4. Trend of Reliability of Metrics
An interesting trend in the ICC distributions is revealed when the inter-rater and intra-rater data are combined. For instance, in the axial and sagittal planes, velocity metrics exhibited greater ICCs for all the motion assessments relative to both flexibility and acceleration metrics, but flexibility metrics displayed the greatest ICC values in the lateral plane.
Figure 2,
Figure 3 and
Figure 4 represent ICCs of the inter-rater vs. intra-rater reliability for all of the metrics in the axial, lateral, and sagittal planes, respectively.
In order to understand the relative relationship between the various kinematic measures, these plots were created to illustrate the estimations of the ICC distributions. In order to provide for a more comprehensive analysis of the correlations between the intra-rater and inter-rater ICC mean estimates, the reliability areas for poor, moderate, good, and excellent ratings are presented in these figures in red, yellow, light green, and dark green, respectively. Circles represent positional measurements, triangles represent velocity measurements, and squares represent acceleration measurements. According to these figures, the mean ICC values for all measures were good for the axial and sagittal planes and moderate to good for the lateral plane. These plots show two key patterns that may be identified. First, the estimates of ICC values for inter-rater and intra-rater reliability within the sagittal and axial planes of the body were mostly higher than the values for the lateral plane of the body. Second, ICC values for the velocity kinematic metrics were almost always higher compared to the other kinematic metrics in all planes of the body.
4. Discussion
This study shed light on the reliability of various cervical kinematic metrics from a wearable IMU-based cervical motion monitoring system. The link between spine kinematics and its use as a functional indicator has been established in the literature [
22]. This study further highlights the importance of recording reliable kinematic data to determine the cervical spine status.
Optical motion capture systems are the gold standard for accurate 3D measurement of kinematics but require a big laboratory setup and are expensive to acquire, operate, and maintain. On the other hand, IMUs are accessible at a low price and can be used in custom setups and environments due to properties, such as lightweight and flexibility. The technique utilized here is incredibly accurate in measuring spine motion. IMUs were compared to a high-fidelity visual motion capture system and found to be within about 99% of the vision system’s accuracy [
20]. Therefore, the reliability values can be attributed to the repeatability of the test as a result of the variance in raters’ performance and discrepancies over time.
In this study, inter-rater and intra-rater reliability of kinematics measures derived from various tasks showed moderate to excellent reliability. In general, intra-rater reliability ICCs were slightly higher than inter-rater reliability ICCs. It is important to note that the confidence intervals for these ICCs are large, so concluding that intra-rater ICCs are significantly larger than inter-rater ICCs is premature. It is also important to note that the intent of this effort was indeed to “expose” the motion features that have lower ICCs so they can be excluded from use in future algorithms. Many features have high intra- and inter-rater ICC that can be prioritized during future algorithm development while features with low intra- and inter-rater ICC can be avoided.
The main sources of variability truly coming from the raters would primarily be from the placement of the harnesses on the subject and any human discretion to recollect a motion in the case that the subject performed it incorrectly and the software did not flag it. The rest of the protocol is automated, and the subject receives primary instructions from the computer. Given that the software algorithms are intentionally designed to minimize the effect of harness placement (there is a wide range of “correct” harness locations), the slightly inflated variability that is driving this phenomenon is likely the result of the effect of the order being conflated with the effect of rater within a given day.
Another important observation was that velocity-related metrics proved to produce more reliable results compared to the other metrics. In axial and sagittal planes, velocity-related metrics were among the most reliable metrics, whereas in the lateral plane, flexibility metrics produced more reliable results. This information can prove to be valuable when developing models for cervical spine status quantification providing the most reliable and useful features.
Our ICCs are comparable to the same values reported by other studies investigating the cervical spine, and in many cases, are even better. Besides the studies mentioned in the Introduction section, other studies have tried to evaluate the reliability of IMUs to measure cervical motion. Fletcher and Bandy [
23] investigated the reliability of measuring cervical active range of motion for 25 individuals with neck pain and 22 individuals without neck pain using a cervical range-of-motion (CROM) device. The results showed ICC(3,1) values from 0.87–0.94 for the subjects without neck pain and 0.88–0.96 for the subjects with neck pain. Anoro-Hervera et al. [
24] assessed intra-rater and inter-rater reliability of IMU-measured cervical active range of movements in 20 young asymptomatic adults with two raters. They reported ICC(3,1) values above 0.9 for intra-rater reliability and ICC(3,2) values above 0.75 for inter-rater reliability.
Moreover, using the CROM device, Audette et al. [
25] evaluated the test-retest reliability of the range of motion values between two testing days using ICC(3,3). In that study, flexion resulted in the lowest ICC value (ICC = 0.89, 95% CI: 0.73–0.96), and extension resulted in the highest ICC value (ICC = 0.98, 95% CI: 0.95–0.99). Stenneberg et al. [
26] calculated inter-rater reliability using ICC(2,1) for two raters conducting the test on symptomatic patients using a smartphone application and a Polhemus Liberty. The study reported flexion/extension to have the lowest reliability (ICC = 0.90, 95% CI: 0.78–0.95) and rotation to have the highest reliability (ICC = 0.96, 95% CI: 0.90–0.98). Reliability measures have also been calculated for the OSI CA 6000 spine motion analyzer, as shown in the study by Petersen et al. [
27]. This study evaluated the reliability of the cervical motion measurements of healthy and symptomatic participants. ICC(2,1) was used for intra-rater measures for both subject groups and ICC(2,k) was used for inter-rater measures for healthy subjects. In healthy subjects, the lowest inter-rater reliability measure was reported for extension (ICC = 0.88), and the highest inter-rater reliability measure was reported for right rotation (ICC = 0.94). The lowest intra-rater reliability measure for healthy subjects was seen in flexion (ICC = 0.78), whereas the highest intra-rater reliability for healthy subjects was seen in left-side bend and right rotation (ICC = 0.94). Finally, the lowest and highest intra-rater ICC values for symptomatic participants were seen in flexion (ICC = 0.68) and left-side bend (ICC = 0.96), respectively.
In general, our ICC values indicated that intra-rater reliability is better than inter-rater reliability. This suggests that there was somewhat more agreement between testing days (time points) than there was between raters. Additionally, rather than the raters or testing procedure, the participants’ daily variations may potentially be the cause of the measures’ variability. Given that the subjects did not experience any appreciable changes in stiffness or discomfort, it is plausible that uncontrollable additional factors such as sleep, any physical activity prior to participating in the study, nutrition, or psychological and social changes influenced the subject’s ability to move. It is also likely that the learning effect occurred in between sessions as a deeper analysis did indicate that the first test day’s intra- and inter-rater agreement of metrics was worse than the same metrics on the following test days.
There were a number of key points that can be considered limitations of this study. As previously noted, various external factors, such as recent intense athletic activities and psychological issues, which may have affected motion evaluations were not measured. Because motion evaluations given on the same day would probably not exhibit significant differences in these parameters, some of them may be eliminated or reduced by the study design and reliability computations. Furthermore, it was anticipated that repetition of the movements would make participants more familiar with the protocols. The other limitation of this study included only asymptomatic subjects, which hindered the expansion of the results of this study to subjects experiencing neck pain. Another possible drawback was that additional, complicated features besides the conventional kinematic measurements that may serve as reliable metrics were not investigated in the analysis of the motion signals.
Despite all of the limiting variables, there was relative consistency between and among raters, as shown by the results of good to excellent ICC values for measures with intra-rater reliability and moderate to excellent values for measures with inter-rater reliability.