The Intra- and Inter-Rater Reliability of a Variety of Testing Methods to Measure Shoulder Range of Motion, Hand-behind-Back and External Rotation Strength in Healthy Participants

This study determined the intra- and inter-rater reliability of various shoulder testing methods to measure flexion range of motion (ROM), hand-behind-back (HBB), and external rotation (ER) strength. Twenty-four healthy adults (mean age of 31.2 and standard deviation (SD) of 10.9 years) without shoulder or neck pathology were assessed by two examiners using standardised testing protocols to measure shoulder flexion with still photography, HBB with tape measure, and isometric ER strength in two abduction positions with a hand-held dynamometer (HHD) and novel stabilisation device. Intraclass correlation coefficient (ICC) established relative reliability. Standard error of measurement (SEM) and minimum detectable change (MDC) established absolute reliability. Differences between raters were visualised with Bland–Altman plots. A paired t-test assessed for differences between dominant and non-dominant sides. Still photography demonstrated good intra- and inter-rater reliability (ICCs 0.75–0.86). HBB with tape measure demonstrated excellent inter- and intra-rater reliability (ICCs 0.94–0.98). Isometric ER strength with HHD and a stabilisation device demonstrated excellent intra-rater and inter-rater reliability in 30° and 45° abduction (ICCs 0.96–0.98). HBB and isometric ER at 45° abduction differed significantly between dominant and non-dominant sides. Standardised shoulder ROM and strength tests provide good to excellent reliability. HBB with tape measure and isometric strength testing with HHD stabilisation are clinically acceptable.


Introduction
Reliable measurements of shoulder joint range of motion (ROM) and strength are essential during a clinical examination to diagnose shoulder pathology, evaluate treatment efficacy and quantify changes in joint mobility or muscle force [1][2][3]. Clinicians must be convinced that a change in ROM or strength is due to a genuine change in the patient's status rather than an error in measuring instruments or methods, especially in patients with chronic health conditions who have repeated measures taken over time [4].
A variety of methods for measuring shoulder joint ROM include goniometry [5,6], visual estimation [7,8], inclinometry [9,10], tape measure [11,12], still photography [13,14], inertial sensors [15,16], smartphone devices [17,18] and markerless 3D motion-tracking systems [19,20]. Goniometry is widely used in clinical settings to quantify ROM since it is affordable and portable. However, using two hands in goniometry makes it difficult to stabilise the joint properly and increases the risk of measurement error [21]. Furthermore, the time taken to accurately position the goniometer against the anatomical landmarks of the shoulder can further exacerbate the patient's symptoms. A priori sample size calculation was based on the methods of Walter et al. [45], assuming a significance level (α) = 0.05, probability of type II error (β) = 0.2, minimal acceptable level of reliability (ρ0) = 0.7, and expected reliability (ρ1) = 0.9, a sample size of 18 participants would be required.

Raters
Rater A and Rater B were registered physiotherapists at Prince of Wales Hospital with 10 years and 2 years of rotating hospital clinical experience, respectively. Both raters were blinded to the muscle strength data, and a third independent unblinded rater recorded all measurements. Rater A was male, height 1.69 m; and weight, 72 kg; and Rater B was female, height, 1.68 m; weight, 56 kg.

Instruments
Shoulder flexion ROM was measured using a tablet device (Samsung Tab S, Samsung, Seoul, South Korea) with an 8-megapixel camera and 8.4" Super AMOLED touchscreen with a 2560 × 1600 pixel resolution. The degrees were obtained using a small (15 cm), two-armed plastic transparent Baseline goniometer (Gymna International, Bilzen, Belgium), with a 360 • head and angle scale of 1 • increment. HBB measurements were measured using a tape measure.
Isometric muscle strength of the shoulder external rotators was measured using the MicroFET 2 HHD (Hoggan Scientific, Salt Lake City, UT, USA). The MicoFET2 HHD is a portable, battery-operated load cell system that records strength between 0.8 and 100 pounds. A digital screen displayed the peak force at low or high thresholds, with a low threshold selected for better sensitivity. A hospital engineer designed a novel stabilisation device ( Figure 1) to enhance HHD measurement accuracy. A slot on the side of the device was built to fit the HHD dimensions, and a rotatory wheel allowed correct height modifications depending on the patient. Like Kolber et al. [43], investigators used a plastic hinged-arm apparatus to maintain 30 • and 45 • of abduction in the scapular plane.

Procedures
Raters A and B measurements were taken independently on separate days, with repeat testing occurring after 14 days. The raters had no contact during the assessments, and any residual markings on the patient's skin were removed at the end of the testing. All measurements were taken bilaterally, and identical procedures were used for sessions 1 and 2. The measurement results were recorded on paper and stored on an excel spreadsheet by a third independent rater.

Procedures
Raters A and B measurements were taken independently on separate days, with repeat testing occurring after 14 days. The raters had no contact during the assessments, and any residual markings on the patient's skin were removed at the end of the testing. All measurements were taken bilaterally, and identical procedures were used for sessions 1 and 2. The measurement results were recorded on paper and stored on an excel spreadsheet by a third independent rater.
All participants completed a simple, standardised warm-up procedure of general shoulder movements prior to assessment. The procedure started with shoulder forward flexion, using the identical protocol as Ginn et al. [12]. Reference points were identified using the following landmarks: the lateral epicondyle of the humerus, the point 6cm below the postero-lateral acromion process and the end of rib 12 ( Figure 2). To ensure consistency and control for potential errors due to positioning, participants stood at a position marked by tape on the floor that was 1.5 metres away from the camera. The backdrop was a plain white wall with no decorations. The participant was instructed to move the upper extremity (thumb pointing upwards) to the end of active shoulder flexion within comfort levels. The assessor used the tablet device to photograph each end-of-range position from a perspective aligned with the axis joint of motion. The digital photograph was taken of the patient in profile, with the camera lens positioned parallel to the sagittal plane at eye level. All images were saved in JPEG format (1836 × 3264 pixels). A standardised, full-print A4 size (21 cm × 29.7 cm) colour photo was printed for each participant. Using a plastic goniometer and the printed photograph, the assessor manually calculated the resultant joint angle in degrees.
HBB was measured using a tape measure and non-permanent marker with the identical testing protocol of van den Dolder et al. [29]. The raters identified each PSIS on the participant and connected both by drawing a horizontal line. Participants were instructed to "take your arm as far as you can go," and the distance between the level of the PSIS and thumb tip was measured in centimetres ( Figure 3). The difference in ROM between the dominant and non-dominant sides was recorded, with a lower number indicating a poorer ROM. If a participant could not reach the PSIS level, the distance from that point was recorded as a negative value.
Isometric ER muscle strength was assessed with a HHD and stabilisation device using the same testing protocol as Kolber et al. [43]. The dynamometer and stabilisation device were set against a wall at an appropriate height for each participant. This height was recorded and standardised for all follow-up assessments.
Participants were seated in an armless chair with their trunk supported and feet flat on the floor. A hinged arm apparatus was positioned beneath the participant's axillary area, and separate foam wedges were inserted to keep the humerus at 30 • and 45 • in the scapular plane. Velcro straps attached to the arm apparatus were fastened around the trunk and humerus to prevent unwanted active abduction during testing ( Figure 4).
To confirm the correct technique was performed, participants made one isometric contraction (i.e., push against resistance) against the assessor's hand. Participants positioned the arm in 0 • rotation, and 90 • elbow flexion with the wrist in neutral. In each trial, participants pushed on the dynamometer's circular padded contact surface using the distal aspect of their radius/ulnar. Participants were instructed to apply their "maximum and best" effort to the device for six seconds.
Raters gave no verbal encouragement, and all instructions were standardised. Strength was measured three times, with a fourth measurement taken if the third trial exceeded the second. A 10-s rest interval was provided between trials, and the maximum force output from the trials was used in the analysis.  HBB was measured using a tape measure and non-permanent marker with the identical testing protocol of van den Dolder et al. [29]. The raters identified each PSIS on the participant and connected both by drawing a horizontal line. Participants were instructed to "take your arm as far as you can go," and the distance between the level of the PSIS and thumb tip was measured in centimetres ( Figure 3). The difference in ROM between the dominant and non-dominant sides was recorded, with a lower number indicating a poorer ROM. If a participant could not reach the PSIS level, the distance from that point was recorded as a negative value. Isometric ER muscle strength was assessed with a HHD and stabilisation device using the same testing protocol as Kolber et al. [43]. The dynamometer and stabilisation device were set against a wall at an appropriate height for each participant. This height was recorded and standardised for all follow-up assessments.
Participants were seated in an armless chair with their trunk supported and feet flat on the floor. A hinged arm apparatus was positioned beneath the participant's axillary vice were set against a wall at an appropriate height for each participant. This height was recorded and standardised for all follow-up assessments.
Participants were seated in an armless chair with their trunk supported and feet flat on the floor. A hinged arm apparatus was positioned beneath the participant's axillary area, and separate foam wedges were inserted to keep the humerus at 30° and 45° in the scapular plane. Velcro straps attached to the arm apparatus were fastened around the trunk and humerus to prevent unwanted active abduction during testing ( Figure 4).

Statistical Analysis
Descriptive data, including means and standard deviations (SD) were calculated for all measurements, with ROM values in degrees and centimetres and isometric strength data in kilograms.
Reliability was determined using ICCs with corresponding 95% confidence intervals (CI) [46]. For intra-rater reliability, an ICC (2.1) two-way random model for absolute agreement was used because a single rater was the only rater of interest. Inter-rater reliability used an ICC (2.2), two-way mixed model, to estimate the absolute agreement between two independent assessors [47].
Reliability was assessed using the criteria by Portney and Watkins [47], where ICC values above 0.90 were considered excellent reliability, 0.75 to 0.90 were good, and less than 0.75 were moderate to poor.
Since ICCs can be impacted by intersubject variability, absolute measures of reliability were determined [48]. The standard error of measurement (SEM) is the amount considered as measurement error, with a lower SEM indicating high reliability. The SEM was calculated with the equation: SEM = SD × √ 1 − ICC [47]. By determining the SEM, the minimal detectable change (MDC) was calculated with the equation: MDC = SEM × 1.65 × √ 2. The MDC is the smallest change needed to be confident that a change between two tests is a "true" change and not due to measurement error [47]. Bland-Altman plots [49] were utilised to visualise inter-rater reliability with 95% limits of agreement (LoA) calculated with the equation: 95% LoA = mean difference ± 2SD.
A paired t-test was used to calculate the differences between dominant and nondominant side for Raters A and B. Peak measurement comparisons between participants' dominant and non-dominant arms for shoulder flexion ROM, HBB and ER strength were also reported for both raters.

Intra-Rater Reliability
Results of intra-rater reliability with ICC, SEM and MDC values are presented in Table 2. For each rater, still photography demonstrated good intra-rater reliability (ICCs 0.76, 0.86) for ROM assessment of active shoulder forward flexion and excellent intra-rater reliability (ICCs 0.94, 0.96) for HBB tape measurements. Isometric ER strength measurements with HHD and the stabilisation device demonstrated excellent intra-rater reliability in 30 • abduction (ICCs 0.97, 0.96) and 45 • abduction (ICCs 0.97, 0.98). SEM and MDC95 values were relatively low across all measurement methods, indicating good absolute reliability. Table 3 shows the mean and SDs for both raters over both sessions.

Inter-Rater Reliability
Results of inter-rater reliability with ICC, SEM and MDC 95 values are presented in Table 4. Inter-rater reliability for shoulder flexion assessment with photography was good (ICC = 0.75) and excellent (ICC = 0.91) for HBB with tape measure. Similarly, inter-rater reliability was excellent for isometric ER strength measures in 30 • abduction (ICC = 0.97) and 45 • abduction (ICC = 0.98). Bland-Altman plots for ROM (Figures 5 and 6) and strength (Figures 7 and 8) were used to evaluate the level of agreement and bias between the mean differences for both raters. The mean shoulder forward flexion differences between raters were 2.6 • and 2.7 • for session one and 5.8 • and 3.4 • for session two ( Figure 5). The 95% CI LoA ranged from −5.7 • to 11.0 • and −5.7 • to 11.1 • for session one, and −0.4 • to 11.9 • and −5.5 • to 12.3 • for session two ( Figure 5). for session one and 5.8° and 3.4° for session two ( Figure 5). The 95% CI LoA ranged from −5.7° to 11.0° and −5.7° to 11.1° for session one, and −0.4° to 11.9° and −5.5° to 12.3° for session two ( Figure 5).   for session one and 5.8° and 3.4° for session two ( Figure 5). The 95% CI LoA ranged from −5.7° to 11.0° and −5.7° to 11.1° for session one, and −0.4° to 11.9° and −5.5° to 12.3° for session two ( Figure 5).     The mean HBB differences between raters were −1.6 cm and −2.1 cm in session one and −2.3 cm and −1.6 cm in session two. All ROM values were distributed sparsely and not strongly concentrated along the horizontal axis (Figure 6).
At 30° of abduction, the mean isometric ER strength differences between raters were −0.5 kg (LoA: −3.0 kg to 2.0 kg) and −0.3 kg (LoA: −3.5 kg to 2.9 kg) for session one, and - The mean HBB differences between raters were −1.6 cm and −2.1 cm in session one and −2.3 cm and −1.6 cm in session two. All ROM values were distributed sparsely and not strongly concentrated along the horizontal axis (Figure 6).
At 30 • of abduction, the mean isometric ER strength differences between raters were −0.5 kg (LoA: −3.0 kg to 2.0 kg) and −0.3 kg (LoA: −3.5 kg to 2.9 kg) for session one, and -0.3 kg (LoA: −2.9 kg to 2.4 kg) and −0.1 kg (LoA: −2.7 kg to 2.4 kg) for session two (Figure 7). At 45 • of abduction, the mean isometric ER strength differences between raters were 0.3 kg (LoA: −2.5 kg to 3.0 kg) and −0.1 kg (LoA: −1.9 kg to 1.8 kg) for session one, and 0.4 kg (LoA: −1.8 kg to 2.5 kg) and −0.1 kg (LoA: −2.7 kg to 2.4 kg) for session two (Figure 8). Table 5 compares the shoulder measurements of Raters A and B based on hand dominance. There were significant differences between dominant and non-dominant sides for HBB (−2.2 cm, 2.7 cm, p = < 0.001) and isometric ER strength at 45 • abduction (0.7 kg, p = < 0.001; 0.5 kg, p = < 0.011) for both rater assessments. For shoulder flexion ROM assessments, no statistically significant differences in arm dominance were identified. The peak measurement comparisons for the dominant and non-dominant arms are presented in Table 6. For each session, both raters observed that the dominant arm produced the highest peak forward flexion, HBB and isometric ER strength at 30 • and 45 • abduction. Non-dominant arm peak measurements were higher for forward flexion ROM and isometric ER strength than HBB. For both raters in any session, only a limited number (<10%) of participants generated peak measurements equally for both arms for any movement. When comparing isometric testing positions, raters observed that the dominant arm was predominately strongest for ER at 30 • and 45 • abduction.

Discussion
In this study, we used established testing protocols to determine good reliability for photographic evaluation of shoulder flexion ROM and excellent reliability for HBB assessment with a tape measure. Similarly, HHD with a stabilising device produced excellent intra-and inter-rater reliability when evaluating isometric strength, irrespective of abduction position.
When compared to other types of measurement, the still photography method had the lowest ICC value (<0.90) for both raters when quantifying shoulder flexion. Prior studies using identical testing protocols for still photography reported ICC values of 0.88 [12] and 0.92 to 0.98 [13]. In our study, still photography with a tablet device resulted in good ICC values of 0.86 and 0.75 with wide 95% CIs. In agreement with our findings, Hayes et al. [21] employed the same photographic testing protocol and reported an ICC of 0.75 for shoulder flexion ROM. The Bland-Altman plots for still photography demonstrated low mean differences but wide 95% LoA above the clinically significant threshold of 5 • [50].
HBB measurement using the modified PSIS technique with tape measure had excellent intra-rater reliability (ICCs 0.96, 0.94) and inter-rater reliability (ICC = 0.91). Our findings support those of van den Dolder et al. [29], who reported intra-rater reliability ICC values of 0.95 and inter-rater reliability ICC values of 0.96. Similarly, we found small SEM and MDC values. For the HBB protocol, we found that raters with 2 or 10 years of clinical experience were just as reliable as examiners with 15 years of musculoskeletal experience, as used by van den Dolder et al. [29]. Bland-Altman plots found no uniformity or systematic measurement differences.
Isometric ER strength measurements demonstrated excellent intra-and inter-rater reliability for the two abduction positions tested. Using a hinged arm apparatus that allows testing in varying degrees of abduction with a foam wedge insertion, we modified the prior method by Kolber et al. [43]. The original testing protocol placed the patient at 30 • abduction, with the authors citing reduced capsular stress, appropriate muscle length tension, and preventing hypovascular adducted positions as reasons [51][52][53]. In accordance with findings by Edouard et al. [54], we also tested ER in the seated position at 45 • abduction in the scapular plane, as this position was reported to be most reliable. Our study established that both positions produced excellent intra-and inter-rater reliability, with ICC values ranging from 0.96 to 0.98. Only Rater B produced a higher reliability for ER strength at 45 • of abduction (ICC = 0.98). A systematic review by Schrama et al. [55] evaluated intra-rater reliability for manual HHD and reported ICCs ranging from 0.77 to 0.98 for ER strength measures. In contrast to our study, Hayes et al. [32] observed excellent intra-rater reliability (ICC 0.92) and good inter-rater reliability (ICC 0.82) with manual HHD to quantify isometric ER strength in symptomatic patients.
Although it was not the focus of our study, both raters identified statistically significant differences in dominant and non-dominant arms for HBB and isometric ER strength at 30 • abduction. Peak measurement values were greater in the dominant arm for all measurements for both raters. The non-dominant arm was most common for peak forward flexion.
Despite evidence that still photography produces good reliability and less potential for exacerbating shoulder symptoms, the technique is more time-consuming than standard goniometry. Marking bony landmarks, precisely positioning the patient for the photograph, and printing the photo to manually calculate the angle all take a considerable amount of time. While still photography is useful for research purposes, it is impractical in a clinical setting.
HHD with a stabilisation device demonstrates high reliability for ER and is a cheaper and practical alternative to isokinetic testing in a laboratory setting. The investigators collaborated with a hospital engineer to create a portable, height-adjustable stabilising device that could accommodate each participant's sitting height. The velcro straps on the hinged arm device provided enough elasticity to conform around any body shape or size to provide additional stability. When performing isometric strength testing with HHD in clinical practice, it is critical to ensure that the patient makes no compensatory movements, as this can significantly affect measurement accuracy. Furthermore, using a stabilising device eliminates the variable of tester strength, which can also affect reliability. In contrast to Kolber et al. [43], our device provided hand-free stabilisation, enabling the therapist to closely monitor technique and apply additional stability to the distal end of the humerus if required. Although we used a healthy cohort, any level of pain or discomfort should be considered when positioning patients with shoulder pathology in various degrees of abduction.
The reliability assessment for shoulder testing protocols for a single rater and among raters is one of the study's strengths. We blinded raters to isometric ER strength measurements by using a third independent rater to collect and analyse results. Furthermore, we assessed and evaluated the reliability of isometric strengthening in 30 • and 45 • degrees of abduction.
This study had some limitations. First, we used a relatively young and asymptomatic cohort with no known shoulder or neck pathology. A healthy population was chosen to avoid the inclusion of any confounding variables that may have impacted measurement outcomes. However, future research should compare the reliability measures of healthy cohorts to those of shoulder pathology. Second, we did not compare active and passive ROM because our aim was to evaluate the reliability and measurement error of active shoulder ROM protocols. Third, because we only utilised one patient position for each type of movement and measurement tool, we cannot make any inferences as to whether different patient positions impact reliability. Four, to potentially improve measurement accuracy with digital photography we could have standardised the angle and height of the tablet device by utilising a tripod stand. Lastly, for HBB measurements we used a PSIS protocol and did not compare it with the vertebral T1 level method. We adopted van den Dolder and Roberts' [28] modified PSIS level approach because it was easier to palpate and anatomically identify than the vertebral body [62].

Conclusions
This study confirmed the reliability of standardised shoulder ROM and strength testing protocols. According to the criteria by Portney and Watkins [47], ICC values above 0.90 are considered acceptable for clinical purposes. HBB with tape measure and isometric strength testing with HHD and a stabilisation device both meet this clinical criterion. Future studies comparing different protocols, movements, and patient positions for the same type of measurement, can provide clinicians with further confidence in applying specific testing protocols for shoulder assessment.