A Deep Learning Algorithm for Radiographic Measurements of the Hip in Adults—A Reliability and Agreement Study

Hip dysplasia (HD) is a frequent cause of hip pain in skeletally mature patients and may lead to osteoarthritis (OA). An accurate and early diagnosis may postpone, reduce or even prevent the onset of OA and ultimately hip arthroplasty at a young age. The overall aim of this study was to assess the reliability of an algorithm, designed to read pelvic anterior-posterior (AP) radiographs and to estimate the agreement between the algorithm and human readers for measuring (i) lateral center edge angle of Wiberg (LCEA) and (ii) Acetabular index angle (AIA). The algorithm was based on deep-learning models developed using a modified U-net architecture and ResNet 34. The newly developed algorithm was found to be highly reliable when identifying the anatomical landmarks used for measuring LCEA and AIA in pelvic radiographs, thus offering highly consistent measurement outputs. The study showed that manual identification of the same landmarks made by five specialist readers were subject to variance and the level of agreement between the algorithm and human readers was consequently poor with mean measured differences from 0.37 to 9.56° for right LCEA measurements. The algorithm displayed the highest agreement with the senior orthopedic surgeon. With further development, the algorithm may be a good alternative to humans when screening for HD.


Introduction
Joint deformity as seen in the presence of hip dysplasia is a common cause of hip pain in young skeletally mature patients and may lead to osteoarthritis (OA) [1]. Traditionally, first line modality when diagnosing dysplasia of the hip is radiographs, where measurements taken from a standardized anterior-posterior (AP) pelvic radiograph are used to evaluate the anatomical configuration of the pelvis. The lateral center edge angle of Wiberg (LCEA) describes femoral head coverage by the acetabulum and the acetabular index angle (AIA) quantifies the inclination of the acetabular roof [2].
An accurate and early diagnosis may postpone, reduce or even prevent the onset of OA and ultimately hip arthroplasty at a young age [3]. It was previously reported, though, that patients with non-specific hip pain may be left with symptoms for years before the correct diagnosis of hip dysplasia is made, perhaps because the anatomical deformities indicative of hip dysplasia are not routinely reported in all departments of radiology. They found that the correct diagnosis of hip dysplasia was delayed with up to several years (range: 0-204 months) and more than three confrontations (range: 0-11) with the healthcare system [4,5]. The anatomical deformities associated with hip dysplasia may be diagnosed earlier by using artificial intelligence models and algorithms, possibly as a screening tool, which may also have the potential to improve reader variability and workflow.
The process of developing and testing an algorithm for measuring LCEA and AIA has recently been published and results indicated that an automatic measurement model is feasible [6]. Moreover, Fraiwan and colleagues showed the potential of deep transfer learning for detecting developmental dysplasia of the hip in pelvic radiographs of infants [7]. It has also been suggested that AI is useful for detection and classification of hip dysplasia using ultrasound images [8]. Clinical tests of algorithms for measuring hip parameters such as LCEA and AIA in skeletally mature patients are to the best of the authors' knowledge limited.
The overall purpose of this study was, in a clinical setting, to assess the performance of an algorithm designed to read pelvic AP radiographs of skeletally mature patients. We aimed to assess reliability of the algorithm and agreement between the algorithm and, respectively, orthopedic surgeons, radiologists and a reporting radiographer for measuring LCEA and AIA.

Study Design
In this retrospective study, we used an algorithm trained to identify several specific segments related to hip dysplasia. The algorithm was applied to 78 pelvic radiographs that were consecutively collected from one center. Moreover, two orthopedic surgeons, two radiologists and one reporting radiographer evaluated all images in regard to LCEA, AIA, and the width of both obturator foramen. The study was approved by the Danish National Committee on Health Research Ethics (Project-ID: 2103745) and registered with the regional health authorities (project-ID: 21/22036). The analyses were carried out in concordance with current Guidelines for Reporting Reliability and Agreement Studies [9,10].

Study Population
Anterior-posterior pelvic radiographs of adults referred to diagnostic workup in relation to non-traumatic hip pain at Odense University Hospital were retrospectively identified and collected in a consecutive manner until the desired sample of 78 was achieved (Section 2.6. Statistical analyses). Inclusion criteria were weight-bearing pelvic radiographs of adults (≥18 years). All weight-bearing pelvic radiographs are taken with the legs internally rotated 15 • . Exclusion criteria were the presence of arthroplasty or other types of surgical hardware, signs of congenital abnormalities, surgical or fracture sequela. Radiographs that did not include the entirety of the bony pelvis and the proximal femurs were excluded, as were radiographs that did not show the exposure value. Stratified enrolment by sex and age was applied, such that a similar number of males and females above and below the age of 50 years was present in the sample. Ninety-eight pelvic radiographs were screened consecutively, and inclusion were made according to inclusion and exclusion criteria until the desired sample of 78 pelvic radiographs were obtained ( Figure 1). The 78 pelvic radiographs were analyzed by the algorithm and read by all human readers. rolment by sex and age was applied, such that a similar number of males and females above and below the age of 50 years was present in the sample. Ninety-eight pelvic radiographs were screened consecutively, and inclusion were made according to inclusion and exclusion criteria until the desired sample of 78 pelvic radiographs were obtained ( Figure 1). The 78 pelvic radiographs were analyzed by the algorithm and read by all human readers.

Figure 1.
Flow-chart describing the screening process according to inclusion/exclusion criteria. In total, 98 pelvic radiographs were screened until the desired sample of 78 radiographs was achieved.

Anatomic Definitions
All measurements were made in relation to a horizontal reference line adjoining the most inferior points of the ischial tuberosities. The LCEA was defined as the angle between two lines both drawn from the center of the femoral head (CFH), a line perpendicular to the reference line and a line from the CFH to the lateral sourcil of the acetabulum, respectively. The AIA was defined as the angle between a horizontal line from the medial sourcil (medial aspect of the sclerotic acetabular roof) parallel to the reference line and a line connecting the medial and lateral sourcils [11]. Moreover, the foramen obturator index (FOI), an indicator of pelvis rotation, was calculated as the ratio between the widths of the two foramina. The widths of the foramina was measured at the widest point of the foramina, parallel to the reference line ( Figure 2) [12]. Flow-chart describing the screening process according to inclusion/exclusion criteria. In total, 98 pelvic radiographs were screened until the desired sample of 78 radiographs was achieved.

Anatomic Definitions
All measurements were made in relation to a horizontal reference line adjoining the most inferior points of the ischial tuberosities. The LCEA was defined as the angle between two lines both drawn from the center of the femoral head (CFH), a line perpendicular to the reference line and a line from the CFH to the lateral sourcil of the acetabulum, respectively. The AIA was defined as the angle between a horizontal line from the medial sourcil (medial aspect of the sclerotic acetabular roof) parallel to the reference line and a line connecting the medial and lateral sourcils [11]. Moreover, the foramen obturator index (FOI), an indicator of pelvis rotation, was calculated as the ratio between the widths of the two foramina. The widths of the foramina was measured at the widest point of the foramina, parallel to the reference line ( Figure 2) [12].

Algorithm Development and Training
Automatic measurements were extracted from the pelvic radiographs using a newly developed algorithm (RBhip™, Radiobotics, Copenhagen, Denmark). The algorithm was developed using deep-learning and computer vision and trained on more than 2900 pelvic radiographs. The pelvis, including the acetabulum, and the femoral head and neck were independently segmented using a segmentation model with a modified U-Net architecture trained using augmentation with ResNet34 as a backbone [13].
The horizontal reference was established as a line through the most inferior points on the ischial tuberosities. For the LCEA, the circle best encompassing the femoral head was found through a parameterization of a circle fitted to points along the femoral head contour using least squares. The sourcils in both hips were independently segmented to provide the landmarks of the lateral and medial extent of the weight-bearing area. The sourcil segmentation model was trained primarily using annotations from an orthopedic surgeon and optimized to find the lateral extent of the acetabular roof. This point does not necessarily coincide with the lateral acetabular rim [14,15]. For the FOI, the width of each foramina was calculated at 13 equidistant lines parallel to the horizontal reference line. The FOI was established as the maximum width for the right foramina relative to the maximum width of the left foramina. The algorithm flowchart is depicted in Figure 3.

Human Readers
Five readers, two senior and three junior readers, all accustomed to reading and measuring angles and distances related to hip deformities made all the measurements blinded to each other's results. No clinical information was available to the readers. The senior readers were a musculoskeletal (MSK) radiologist (LB) and a consultant hip surgeon (CV), with 21 and 8 years of experience, respectively. Moreover, a junior MSK radiologist (JR), a junior hip surgeon (MHH) and a reporting radiographer (LBO) (respectively, 3, 5 and 12 years of experience) made all the measurements. To minimize systematic bias, a protocol with definitions of measurements was distributed to each participant prior to the measurement session ( Figure 2). In keeping with daily clinical practice, the readers made all measurements digitally in a picture archiving and communication system (GE Healthcare, IL, USA) and recorded the measurements in a database, REDCap  protocol with definitions of measurements was distributed to each participant prior to the measurement session ( Figure 2). In keeping with daily clinical practice, the readers made all measurements digitally in a picture archiving and communication system (GE Healthcare, IL, USA) and recorded the measurements in a database, REDCap (Research Electronic Data Capture). Five of the radiographs were reported three times by all human readers. Additionally, the exposure index was collected as indirect indication of digital image quality, i.e., noise [16].

Algorithm
The algorithm was running as a Software as a service within the hospital firewall. Images were processed by forwarding the AP radiographs directly from the Picture Archiving and Communication System to a secure Digital Imaging and Communications in Medicine destination. The results were returned as a JavaScript Object Notation file and uploaded electronically to the REDCap database. To assess consistency of the algorithm, all radiographs were read twice approximately two weeks apart.

Statistical Analyses
Mean, standard deviation (SD), and range were calculated for all measurements with scatterplots visualizing bivariate associations. Differences between first and second read by the algorithm were presented descriptively by mean, SD, min, max, first (Q1), and third quartile (Q3). Agreement between the algorithm and individual human readers were estimated and illustrated by Bland-Altman (BA) plots with limits of agreement (LoA), bias, and respective 95% confidence intervals (CI). Assuming normality of data, the LoA are estimates of the range within which 95% of all differences between algorithm and human readers will fall. The bias is the mean measured difference between algorithm and human reader [17,18].
Linear mixed effect models were used to assess factors influencing variance in human data by estimating the repeatability coefficient (RC). The RC is a limit below which an estimated 95% of differences between two measurements is expected to fall. Age, sex, FOI, and noise were treated as fixed factors. Patient, reader, and repeated measurements were considered random effects. We derived RCs for (a) Repeatability; the closeness of repeated measurements of the same patient made under similar conditions by the same reader (intrarater variability analysis), and (b) Reproducibility; the closeness of measurements of the same patient made by readers of varying experience in the same measurement setup (interrater variability analysis). The RCs were calculated as 2.77 times the estimated within-subject SD as derived from the mixed effect model [19].
Sample size was calculated based on the procedure proposed by Lu et al. (2016) for sample size assessment of the Bland-Altman method [17,20]. Assuming an SD of 2.1 • and a clinical acceptable agreement limit of 5 • (LCEA), a sample size of 176 was required to show a similar agreement between algorithm and human readers, with a power of 80% at a significance level of 5%. Since measurements were carried out on both hips and repeated three times on five patients (2 × 5 additional measurements), the required number of patients was 78.
p-values < 0.05 were considered statistically significant. The Stata version 16 (Stata-Corp. 2019, College Station, TX, USA) was used for all statistical analyses.

Results
The algorithm was not able to read seven of the 78 included images. Therefore, 71 radiographs were read by the algorithm, resulting in a sample with an average age of 50.1 years [range; 18 to 91] consisting of 36 females and 35 males for agreement analyses. All 78 radiographs were reported by the human readers and included in repeatability and reproducibility estimates.
The algorithm proved highly consistent when double reading all measurements, displaying variances between first and second read that were identical or within the range of machine precision (Table 1). Values for all parameters showed a tendency to be higher when measured by humans, particularly the LCEA measurements. The LCEA (right hip) for humans ranged from 25.8 to 35.0 • versus 25.4 • when measured by the algorithm. Corresponding values for AIA (right hip) ranged from 4.1 to 6.7 • for humans versus 4.7 • when measured by the algorithm (Table 2). Scatterplots visually depict human measurements over algorithm measurements (Figure 4).  Using the BA LoA analyses, the bias estimate between human readers and algorithm for bilateral LCEA was statistically different from 0, apart from the right LCEA measurements made by the experienced orthopedic surgeon. Mean measured difference between human readers and the algorithm for the right LCEA ranged from 0.37 • (95% CI: −0.61 to 1.36) to 9.56 • (95% CI: 8.14 to 10.97), for the experienced orthopedic surgeon and the experienced radiologist, respectively. The corresponding values for left LCEA were 3.56 • (95% CI: 2.41 to 4.74) and 10.01 • (95% CI: 8.37 to 11.82). Bias for AIA measurements displayed a tendency to center around the zero line for all readers ranging from −0.17 • (left) to 2.06 • (right) as measured by the junior radiologist and the reporting radiographer, respectively ( Figures 5 and 6). Bland-Altman inter-observer agreement between the algorithm and individual human readers, including the LoA and 95% CI, are presented for LCEA and AIA in Tables 3 and 4.
For the mixed effects model, measurements from all 78 cases and the five human readers were included. For LCEA right, the mixed effect model revealed that patient, reader, and repeated measurement variance were 39.46, 11.88, and 7.44, respectively, indicating the between-patient variance to be the prevailing contributor to variance in data. The RC for a repeated measurement of the same patient by the same human reader was 11.69 (2.77 times √ 17.80), where the RC increased to 15.09 (2.77 times √ (17.80 + 11.88)) when the same measurement was made on the same patient but by a different human reader. As expected, the between-patient variance was dominant across all measurements. The RCs (same patient, same reader) for AIA was 8.28 and 8.22 • for right and left hip, respectively. Corresponding RCs for same patient, different reader increased slightly to 8.85 and 8.52 • . In Table 5, the estimated variances derived from the mixed effect model including 95% CI are shown. In Table 6, the estimated RCs for between and within reader measurements are presented. rithm (A) (n = 71).

LCEA (SD) [Range]
AIA (SD)  Using the BA LoA analyses, the bias estimate between human readers and algorithm for bilateral LCEA was statistically different from 0, apart from the right LCEA measurements made by the experienced orthopedic surgeon. Mean measured difference between human readers and the algorithm for the right LCEA ranged from 0.37° (95% CI: −0.61 to 1.36) to 9.56° (95% CI: 8.14 to 10.97), for the experienced orthopedic surgeon and the experienced radiologist, respectively. The corresponding values for left LCEA were 3.56° (95% CI: 2.41 to 4.74) and 10.01° (95% CI: 8.37 to 11.82). Bias for AIA measurements displayed a tendency to center around the zero line for all readers ranging from −0.17° (left) to 2.06° (right) as measured by the junior radiologist and the reporting radiographer, re- LCEA; Lateral center edge angle, AIA; Acetabular index angle. Scatter plots in the top row line (LCEA) depict a tendency that human readers measure the LCEA at a higher value than the algorithm, although the senior orthopedic surgeon agree the most with the algorithm (green dots seen in close proximity to the red identity line). Table 3. Lateral center edge angle. Bland Altman limits of agreement and bias (mean and SD). Agreement between algorithm and individual human readers. (n = 71). spectively (Figures 5 and 6). Bland-Altman inter-observer agreement between the algorithm and individual human readers, including the LoA and 95% CI, are presented for LCEA and AIA in Tables 3 and 4.

Discussion
To the best of our knowledge, this is the first study presenting an algorithm that assesses hip dysplasia in adults in a clinical setting. Radiographic evaluation of the pelvis is commonly the first-line approach in patients suspicious of hip dysplasia, where a set of measurements are used to describe anatomy of the pelvis. Several of those measurements are, however, associated with reader variability [11].
We evaluated an algorithm that provided highly consistent measurements of LCEA and AIA. Agreements between algorithm and human readers were associated with subjective variability. Our data showed that particularly for LCEA measurements, the human readers appeared to obtain higher values than the algorithm. If this holds true, the algorithm will identify more people with potential dysplasia based on the LCEA than the human readers. This finding correlates with previous studies reporting that hip dysplasia may be underdiagnosed radiologically and that an accurate diagnosis of hip dysplasia can be delayed, at times, for several years and following more contacts to the healthcare system [4,5,21]. Perhaps reader variance combined with a systematic overestimation of particularly the LCEA may, in part, explain why hip dysplasia is under-diagnosed. As a technical note, the algorithm, including identification of the extent of the lateral sourcil, was trained based on annotations made primarily by an orthopedic surgeon, which could explain the higher correlation between the senior surgeon and the algorithm found in the current study. This, however, seems to be a suitable fit considering it is the surgeon that ultimately decides whether an operation is required.
An inherent limitation in the current study is the lack of a ground truth against which the algorithm and human measurements can be compared. Establishing a ground truth is, however, a common challenge in radiographic measurements. The proposed algorithm in the current study was consistent but accuracy remains to be proven. However, the bias between the algorithm and the five human readers were noted to be similar to the betweenreader variability estimated by the mixed effects model. High measurement variability between human readers has previously been reported in regard to radiographic measurements used to assess hip dysplasia. Although the human readers in the current study were presented with a protocol defining the measurements, the repeatability coefficient was high from a clinical perspective, i.e., ranging from 12 to 15 • for LCEA measurements. Perhaps a consensus meeting with all human readers, prior to the measuring sessions, could have clarified definition of landmarks further; particularly, the extent of the lateral sourcil may have affected inter-observer variance. Moreover, the human readers electronically drew and manually positioned a best-fit circle for the femoral head in the PACS which is also an inherent limitation that may have contributed to inconsistency in LCEA measurements. Although the BA LoA for the right LCEA was narrow for reader 3, no other systematic differences in measurement variability were uncovered. Hence, educational background or experience cannot explain the variation in data.
In conclusion, the newly developed algorithm based on deep learning offered consistent measurement outputs for LCEA and AIA when reading pelvic radiographs. Manual identification of the same landmarks made by five human readers were subject to variance, and the level of agreement between the algorithm and human readers was consequently poor, although a tendency was revealed that the senior orthopedic surgeon agreed the most with the algorithm, particularly for LCEA right measurements.

Clinical Implication
We predict this algorithm as a promising tool in the future with the potential to improve measurement consistency. Potentially, with further development, this algorithm can act as a screening tool within departments of radiology and orthopedic surgery, minimizing delayed diagnosis by identifying abnormal findings on pelvic radiographs suspicious of dysplasia. Considering the vast rising bundle of medical imaging, tools to assist radiologist, physicians and reporting radiographers are needed and integrating locally validated algorithms into clinical practice is essential.