The Reliability of the Microsoft Kinect and Ambulatory Sensor-Based Motion Tracking Devices to Measure Shoulder Range-of-Motion: A Systematic Review and Meta-Analysis

Advancements in motion sensing technology can potentially allow clinicians to make more accurate range-of-motion (ROM) measurements and informed decisions regarding patient management. The aim of this study was to systematically review and appraise the literature on the reliability of the Kinect, inertial sensors, smartphone applications and digital inclinometers/goniometers to measure shoulder ROM. Eleven databases were screened (MEDLINE, EMBASE, EMCARE, CINAHL, SPORTSDiscus, Compendex, IEEE Xplore, Web of Science, Proquest Science and Technology, Scopus, and PubMed). The methodological quality of the studies was assessed using the consensus-based standards for the selection of health Measurement Instruments (COSMIN) checklist. Reliability assessment used intra-class correlation coefficients (ICCs) and the criteria from Swinkels et al. (2005). Thirty-two studies were included. A total of 24 studies scored “adequate” and 2 scored “very good” for the reliability standards. Only one study scored “very good” and just over half of the studies (18/32) scored “adequate” for the measurement error standards. Good intra-rater reliability (ICC > 0.85) and inter-rater reliability (ICC > 0.80) was demonstrated with the Kinect, smartphone applications and digital inclinometers. Overall, the Kinect and ambulatory sensor-based human motion tracking devices demonstrate moderate–good levels of intra- and inter-rater reliability to measure shoulder ROM. Future reliability studies should focus on improving study design with larger sample sizes and recommended time intervals between repeated measurements.


Introduction
The clinical examination of individuals with shoulder pathology routinely involves the measurement of range-of-motion (ROM) to diagnose, evaluate treatment, and assess disease progression [1][2][3]. The shoulder complex involves the coordination of the acromioclavicular, glenohumeral and scapulothoracic joints, to allow motion in three biomechanical planes, specifically the sagittal, coronal, and axial planes [4]. Forward flexion and elevation occur in the sagittal plane; abduction and adduction occur in the coronal plane; and internal and external rotation occur along the long axis of the humerus [5].
The shoulder joint's complex multiplanar motion presents a challenge for clinicians to accurately measure ROM and upper limb kinematics [6,7]. Prior attempts to implement a global coordinate system to describe shoulder movement and define arm positions in space [8] have failed to gain clinical consensus due to practical difficulties. The biomechanical complexity of the shoulder is demonstrated by the synergy of movements necessary for a person to perform activities of daily living. Activities such as reaching for a high 1.
What is the intra-and inter-rater reliability of using the Microsoft Kinect, inertial sensors, smartphone applications, and digital inclinometers to calculate a joint angle in the shoulder? 2.
What are the types of inertial sensors, smartphone applications, and digital inclinometers currently used to calculate a joint angle in the shoulder? 3.
What clinical populations are utilising motion-tracking technology to calculate the joint angle in the shoulder? 4.
Which anatomical landmarks are used to assist the calculation of joint angle in the shoulder?

Materials and Methods
The protocol for this review was devised in accordance with the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) statement guidelines [84], published with PROSPERO on the 8 December 2017 (CRD 42017081870).
The search strategy was developed and refined by previous systematic reviews investigating reliability [85,86]. A database search of Medline (via OvidSP), EMBASE (via OvidSP), EMCARE (via Elsevier), CINAHL (via Ebsco), SPORTSDiscus (via Ebsco), Compendex (via Engineering Village), IEEE Xplore (via IEEE), Web of Science (via Thomson Reuters), Proquest Science and Technology (via Proquest), Scopus (via Elsevier), and Pubmed was initially performed on 30 January 2020 by two independent reviewers (PB, DB). These databases were searched from their earliest records to 2020. An updated search was completed on 17 December 2020. Details of the search strategy are found in Supplementary S1. The reference lists of all included studies were screened manually for additional papers that met the a priori inclusion criteria.

Inclusion and Exclusion Criteria
Studies were included if they met the following criteria: published in peer-reviewed journals; measured human participants of all ages; used the Microsoft Kinect, inertial sensors, smartphone applications, or digital inclinometers to measure joint ROM of the shoulder joint and assessed the intra-and/or inter-rater reliability of these devices; published in English and had full text available. Case studies, abstracts only and "grey" literature was not included. Studies only investigating validity, scapular or functional shoulder movements were excluded, as the aim of the review was to examine the reliability of specific shoulder joint movements commonly measured in clinical practice.
The titles and abstracts of studies were retrieved using the search strategy (Supplementary S1) and screened independently by two review authors (PB, DB). Full text versions that met the selection criteria were uploaded to an online systematic review program (Covidence) for independent review by both reviewers (PB, DB). Any disagreements on eligibility were initially resolved by discussion between reviewers and resolved by a third reviewer (WRW), if necessary.

Data Extraction
A standardised, pre-piloted form was used to extract data from the included studies for assessment of study quality and evidence synthesis. The following information was extracted for each study: bibliometric (author, title, year of publication, funding sources); study methods (study design, country, setting, description and number of raters, type of shoulder joint movements; type of movement (active ROM (aROM) or passive (pROM)); number of sessions, session interval, type and description of technology); participants (recruitment source, number of drop outs, sample size, age, gender inclusion criteria); anatomical landmarks, statistical methods (type of reliability), and outcomes (intraclass correlation coefficient (ICC) values).

Evaluation of Reliability Results
Reliability was assessed using ICCs; an ICC value approaching 1 was indicative of higher reliability. The level of intra-and inter-rater reliability was determined by the criteria identified by Swinkels et al. [87]. Intra-rater reliability was considered good with an ICC > 0.85, moderate with ICCs 0.65-0.85, and poor with an ICC < 0.65. Inter-rater reliability was considered good with an ICC > 0.80, moderate with ICCs 0.60-0.80, and poor with an ICC < 0.60.

Evaluation of the Methodological Quality of the Studies
The two review authors independently assessed the methodological quality of each included paper using the latest (2020) Consensus-based Standards for the selection of health Measurement Instruments (COSMIN) Risk of Bias tool [88]. The studies were rated against a specific set of criteria, with nine items assessing reliability standards and eight items assessing measurement error standards. To satisfy item seven of the measurement error standards, the study had to report absolute reliability statistics (standard error of measurement (SEM), smallest detectable change (SDC) or Limits of Agreement (LOA)). Each item was graded on a four-point scale as either very good, adequate, doubtful, or inadequate. The worst-score-count method was applied in accordance with the COSMIN protocol; the overall score was determined by the lowest score awarded for the measurement property, as used in previous studies [89,90].

Data Analysis
Meta-analyses of relative intra-and inter-rater reliability were performed for studies with outcome measures that reported comparable data. Pooled analysis was completed for maximal aROM and pROM. The right-hand dominant value for the healthy, asymptomatic population was included for analysis. Studies with multiple reliability values were pooled and one overall mean result was reported. If a single study reported values for more than one rater, the mean value was reported. Reflecting clinical practice, any reliability values taken in supine position were included in the pROM analysis, and the standing or sitting positions were included in the aROM analysis.

Flow of Studies
A flowchart of the different stages of the article selection process is outlined in Figure 1. From the 2006 studies identified, 32 studies  were found to meet the criteria for inclusion. In total, nine studies reported reliability for the Microsoft Kinect; six studies for wearable inertial sensors; seven studies for smartphone/mobile applications; and ten studies for digital inclinometers or goniometers.

Description of Studies
The characteristics of the included studies are summarised in Table 1. A total of 1117 participants were included in this review, with a mean age ranging from 17.0 to 56.1 years of age. The mean sample size was 35 participants with a considerable range (minimum, 1; maximum 155) and variance (SD, 32.1). Six studies recruited more than 50 participants and five studies recruited fewer than ten participants. In 13 of the studies, there was a higher percentage of women compared to men. Most studies (n = 26) recruited participants who were healthy and asymptomatic. Participants with shoulder pain or pathology were reported in six studies.
A physical therapist (PT) was the most reported type of rater (n = 12 studies). In six studies the rater was a medical practitioner (MP), and in two studies a PT student was the sole primary rater. Thirteen studies did not report the profession of the rater.
The shoulder movements assessed across all studies included flexion, extension, abduction, external rotation, internal rotation, and scaption. A total of 24 studies only assessed aROM; eight studies assessed pROM, and two assessed both. The most common measuring position was standing (n = 10 studies), followed by seated (n = 6 studies) and supine (n = 3 studies). There were twelve studies that used a combination of supine and standing, side-lying, prone or seated positions. Only one study did not report the position used.
The majority of studies (n = 25) reported two sessions; five studies had one session, and two studies involved three sessions. The time interval between assessments varied considerably from 10 s to 7 days. The most common consecutive measurements were on the same day (n = 13) followed by 7 days (n = 5).

Intra and Inter-Rater Reliability
Results for intra-and inter-rater reliability are shown in Table 2. The last column of Table 2 indicates the level of reliability, grouped by type of device, and includes the shoulder movement assessed.

Description of Studies
The characteristics of the included studies are summarised in Table 1. A total of 1117 participants were included in this review, with a mean age ranging from 17.0 to 56.1 years of age. The mean sample size was 35 participants with a considerable range (minimum, 1; maximum 155) and variance (SD, 32.1). Six studies recruited more than 50 participants and five studies recruited fewer than ten participants. In 13 of the studies, there was a higher percentage of women compared to men. Most studies (n = 26) recruited participants who were healthy and asymptomatic. Participants with shoulder pain or pathology were reported in six studies.
A physical therapist (PT) was the most reported type of rater (n = 12 studies). In six studies the rater was a medical practitioner (MP), and in two studies a PT student was the      Table 2. Intra-rater and Inter-rater reliability (95% CI) for measurement of shoulder range of motion by device and movement direction.     Poor-Good (inter-rater)

Inertial Sensors
One study assessed intra-rater reliability [114], three studies assessed inter-rater reliability [96,117,118] and two studies assessed both [100,102]. Three studies reported moderate to good intra-rater reliability using one, two or four wearable inertial sensors [100,102,114]. Inter-rater reliability was good or moderate in four studies for shoulder flexion, extension, abduction, external and internal rotation [96,102,117,118]. A wider range of poor (ICC < 0.60) to good inter-rater reliability was reported in two studies for shoulder abduction, external and internal rotation [100,102].

Smartphone/Mobile Applications
A total of five of seven studies [95,110,113,116,120] assessed intra-rater and inter-rater reliability. All shoulder movements across most of the studies demonstrated moderate or good levels of intra-and inter-rater reliability. Only one study reported a wider range of reliability values, between poor and good, for flexion and scaption [116].

Digital Inclinometer/Goniometer
Two studies assessed intra-rater reliability [104,121], one study assessed inter-rater reliability [103], and seven studies assessed both [91,97,98,108,109,115,119]. Intra-rater reliability was predominately moderate to good for all shoulder movements (n = 7). Two studies reported poor to moderate intra-rater reliability for external and internal rotation [91,104]. Poor inter-rater reliability was reported in five studies [91,103,108,115,119]. Only two studies reported good intra-and inter-rater reliability for all shoulder movements [97,109].

Methodological Evaluation of the Measurement Properties
Of the thirty-two included studies, only two [109,110] scored very good on all items of the COSMIN reliability standards checklist. A total of 24 studies scored adequate, five were rated doubtful and one was rated inadequate. Table 3 lists the COSMIN standards of reliability checklist and all subsequent scores.
Using the COSMIN criteria, only one study [109] was found to have a very good score on all items for the measurement error standards. A total of 18 studies scored adequate, with two rated doubtful and 11 rated inadequate. Table 4 lists all items of the COSMIN standards on measurement error checklist and the subsequent paper scores.

Synthesis of Results (Meta-Analysis)
ICC values were included from studies with n > 1 participant included in intra-and inter-rater reliability analysis. The ICC values for outcome measures (aROM or pROM for abduction, flexion, internal rotation, external rotation) were individually assessed based on motion and grouped by method (K, SP, DG, DI and IS) to produce a pooled correlation with a 95%CI (Figures 2-4).

Synthesis of Results (Meta-Analysis)
ICC values were included from studies with n >1 participant included in intra-and inter-rater reliability analysis. The ICC values for outcome measures (aROM or pROM for abduction, flexion, internal rotation, external rotation) were individually assessed based on motion and grouped by method (K, SP, DG, DI and IS) to produce a pooled correlation with a 95%CI (Figures 2-4).

Anatomical Landmarks
Twenty-three studies identified the anatomical landmarks for each device and are summarised in Table 5. A total of six studies reported using a vector from the shoulder joint to the elbow for the Microsoft Kinect [92][93][94]105].
Five studies identified the anatomical landmarks for inertial sensor placement [100,102,114,117,118]. All studies used a sensor located on the upper arm that was either unspecified (n = 2), placed on the middle third of the humerus (n = 3), or attached 10 cm distal to the lateral epicondyle (n = 1). Two studies placed a sensor on the flat part of the sternum [100,102]. Only two studies reported using a lower arm sensor on the wrist [102,118].
Anatomical positions for smartphone device placement were described in five studies [95,110,111,113,120]. The most common attachment was on the humerus (n = 3) followed by positions at the wrist (n = 2).
Seven studies reported anatomical landmarks for digital inclinometers [98,104,108,109,115,119,121]. Locations were predominately determined by the type of shoulder movement performed, orientation, and assessment position.

Discussion
Thirty-two studies investigating four different types of devices were included in this review. A thorough search of relevant literature found no previous systematic review of intra-rater and inter-rater reliability of the Microsoft Kinect and ambulatory sensor-based motion tracking devices to measure shoulder ROM.
Good intra-rater reliability for multiple types of shoulder movement was demonstrated with the Kinect [105,107], smartphone applications [95], and digital inclinometers [97]. The Kinect consistently demonstrated higher intra-rater ICC values over other devices for all shoulder movements. Only one study reported poor intra-rater reliability for measuring shoulder extension with the Kinect [99]. Overall, inertial sensors, smartphones, and digital inclinometers demonstrated moderate to good intra-rater reliability across all shoulder movements.
Good inter-rater reliability for more than one type of shoulder movement was demonstrated with the Kinect [101,112], smartphone applications [95,111], and a digital inclinometer/goniometer [97,98]. Inertial sensors predominately exhibited moderate to good inter-rater reliability across all types of shoulder movements. Broader ranges of inter-rater reliability (between poor to moderate) were more commonly reported with digital goniometers.

Quality of Evidence
All included studies and measurement properties were assessed for their methodological quality using the COSMIN tool. The methodological quality ranged from doubtful to very good for reliability standards. The strict COSMIN criteria of using the worst-score counts to denote the overall score resulted in only two very good studies [109,110], which reported moderate or good reliability for using a digital inclinometer and a smartphone device. An adequate rating was scored by five studies for the Kinect, six studies for inertial sensors, five studies for smartphone applications, and eight studies for digital inclinometers/goniometers.
Five studies missed achieving an overall very good rating due to receiving only an adequate score for the time interval between measurements (COSMIN item two). The authors acknowledge an appropriate time interval depends on the stability of the construct (COSMIN item one), and the target population [88]. The time interval must be adequately distanced to avoid recall bias, yet within a compact enough window to distinguish genuine differences in measurements from clinical change [123][124][125]. Studies had a time interval ranging from the same day (22/32) to 7 days (4/32) between two repeated measurements. Ideal time intervals of 2-7 days have been recommended to minimise the risk of a learning effect, random error, or other modifying factors that can affect the movement pattern [126,127].
Small sample sizes contributed to five studies scoring doubtful or inadequate, in accordance with COSMIN item six. An insufficient sample size may not detect true differences and reduces the power of the study to draw conclusions [128]. Of the 32 included studies, a power analysis for sample size calculation was reported in only four (12.5%) studies [97,98,100,104]. The latest COSMIN checklist has removed the standards for adequate sample sizes, as the authors suggested that several small high-quality studies can together provide good evidence for the measurement property [129]. The guidelines recommend a more nuanced approach that considers several factors including the type of ICC model. Studies with small sample sizes were considered acceptable if the authors justified the reasons outlining its adequacy [129]. Therefore, for methodological quality, reviewers scored sample sizes of 1 inadequate, <10 doubtful, <30 adequate and ≥30 very good. This criterion was based on literature citing a rule of thumb of recruiting 19-30 participants when conducting a reliability study [130][131][132].
With respect to measurement error assessment, just over one-half of the studies (18/32) scored adequate, and one scored very good for methodological quality. Eleven studies were rated inadequate, as they failed to calculate SEM, SDC, LoA or CV values (COSMIN item seven). Two studies [92,107] were rated doubtful due to minor methodological flaws (COSMIN item six); notably, this strict item offered reviewers no adequate option.
Reliability and measurement error are inextricably linked, and a highly reliable measurement contains little measurement error. A clinician can confidently verify real changes in patient status if the measured change from the last measurement is larger than the error associated with the measurement [133]. The minimal detectable clinical difference at a 90% confidence level (MDC90) is the minimal value to determine whether a change has occurred [72]. MDC values are open to interpretation and are based on clinical judgement.
For shoulder ROM measurement, differences between observers which exceed 10 • are deemed unacceptable for clinical purposes [103].
The Kinect and inertial sensors demonstrated low SEM and MDC values for measuring most types of shoulder movements [96,99,102,106]. Similarly, for the Kinect, low CV values (1.6%, 5.9%) were reported for shoulder abduction [93]. Smartphones had moderate SEM and MDC values, with better (smaller) errors demonstrated for intra-rater analysis [118], abduction and forward flexion [122], and higher target angles [116]. One study comparing smartphone measurements with universal goniometry, analysed Bland-Altman plots to indicate narrow LoA and excellent agreement, particularly for glenohumeral abduction [111].

Clinical Implications
The Microsoft Kinect is an affordable depth imaging technology that can conveniently and reliably measure shoulder aROM. As a low-cost markerless system, the Kinect can provide clinicians with fast, real-time objective data to quantify shoulder kinematics. The Kinect's visual feedback can aid in patient motivation by way of monitoring treatment and disease progression. The massive amounts of kinematic data generated allows clinicians to potentially analyse shoulder motion paths and correlate specific movement patterns to shoulder pathology [94]. Moreover, higher clinical efficiency arises from relying less on time and labour-intensive patient-reported outcome measures. The portability of the Microsoft Kinect over expensive motion capture systems permits its practical use in private clinics, rehabilitation centres, and home settings [107].
All studies were limited to motion performed along the anatomical planes. The simplicity of calculating the angles between two corresponding vectors does not take into account movements that occur outside the plane. In contrast to goniometric measurements, Lee et al. [60] found subjects could abduct their shoulders to a greater degree in front of the Kinect because their movements were not controlled in a given anatomical plane by an examiner. The authors performed a supplementary experiment that compared goniometric and Kinect shoulder measurements in rapid succession within three cardinal planes. Results demonstrated a significant decrease in 95% limit of agreement between both methods in all directions. It was concluded that the variability was due to the unrestricted motion of the Kinect.
With respect to reliability, one study reported lower repeatability with the Kinect in the frontal and transverse planes compared to the sagittal plane [94]. Another study reported large discrepancies for precise shoulder angle measurements with the Kinect [106].
Discrepancies between standing, sitting, and lying positions can also be a source of difference for shoulder ROM measurements [134,135]. One study [60] reported discrepancies between goniometric shoulder ROM measurements with seated subjects and Kinect ROM measurements for standing subjects. The authors attributed this result to the limitation of the Kinect's skeletal tracking, which is optimised for standing rather than sitting. Moreover, better accuracy for the Kinect has been reported for standing postures [136]. Therefore, adequate patient positioning and protocol standardisations are essential to reduce measurement error [105]. Suggested examples include placing coloured footprints on the floor and fixating the Kinect sensor bar [105].
Wearability and usability are two aspects to consider for implementing sensor-based human motion tracking devices in clinical practice. For the included studies, IMUs were most often positioned on the upper arm with additional placements on the sternum and wrist. Methods of fixation included double-sided adhesive tape with an elastic cohesive [100], an elastic belt [114] and, velcro straps [118]. Smartphone devices were attached by commercial armbands [110,111,116,120] or were hand-held by the examin-ers [95,113,122]. Most notably, no studies reported any calibration issues, and only one study [111] reported attachment difficulties.

Limitations
There were some limitations in this review. First, no additional search was performed for grey literature, and only studies written in English were included. Although authors identified an additional six reliability studies, they were excluded because they did not assess ICCs.
Second, the authors acknowledge the limitations of the revised COSMIN methodological quality tool, as it was primarily developed to assess risk of bias and not study design. Although more user-friendly than the original version, the omission of a sample size criterion leaves open a wider interpretation as to what constitutes an adequate sample size. Furthermore, no standards exist regarding the types of patients, examiners (well-trained or otherwise), and testing procedures. Future studies can apply other tools such as the modified GRADE (Grading of Recommendations Assessment, Development, and Evaluation) approach to address these issues [137]. Additionally, because the revised COSMIN guidelines are relatively new update, caution should be exercised when interpreting and comparing these results with prior studies that used the original COSMIN checklist.
Third, our meta-analysis was limited by the heterogeneity of the studies, given the variance in sample sizes, protocols, shoulder positions, and number of raters. Several studies did not report the 95% confidence intervals for ICCs. Furthermore, the calculation methods for ROM angles with the Kinect represents a potential source of difference across studies. Therefore, the general conclusions should be interpreted with caution.
Lastly, for reasons mentioned earlier, the authors did not examine validity, the degree to which a tool measures what it claims to measure. However, given the potential variety and the lack of any agreed-upon "gold standard" tool identified in the literature, a separate review is warranted to address validity. Reliability should always be interpreted with validity in mind to provide a complete assessment of the clinical appropriateness of a measuring tool.

Future Directions
Future reliability studies should focus on improving study design, with larger sample sizes (>80 participants) [138] and set recommended time intervals (2-7 days) between repeated measurements to increase confidence with results. Moreover, further investigations should report on absolute measures of reliability or measurement error to improve the overall risk of bias.

Conclusions
The primary result of our systematic review is that the Kinect and ambulatory sensorbased human motion tracking devices demonstrate moderate to good levels of intra-and inter-rater reliability to measure shoulder ROM. The assessment of reliability is an initial step in recommending a measuring tool for clinical use. Future research including the Kinect and other devices should investigate validity in well-designed, high-quality studies.