Automated Artificial Intelligence-Based Assessment of Lower Limb Alignment Validated on Weight-Bearing Pre- and Postoperative Full-Leg Radiographs

The assessment of the knee alignment using standing weight-bearing full-leg radiographs (FLR) is a standardized method. Determining the load-bearing axis of the leg requires time-consuming manual measurements. The aim of this study is to develop and validate a novel algorithm based on artificial intelligence (AI) for the automated assessment of lower limb alignment. In the first stage, a customized mask-RCNN model was trained to automatically detect and segment anatomical structures and implants in FLR. In the second stage, four region-specific neural network models (adaptations of UNet) were trained to automatically place anatomical landmarks. In the final stage, this information was used to automatically determine five key lower limb alignment angles. For the validation dataset, weight-bearing, antero-posterior FLR were captured preoperatively and 3 months postoperatively. Preoperative images were measured by the operating orthopedic surgeon and an independent physician. Postoperative images were measured by the second rater only. The final validation dataset consisted of 95 preoperative and 105 postoperative FLR. The detection rate for the different angles ranged between 92.4% and 98.9%. Human vs. human inter-(ICCs: 0.85–0.99) and intra-rater (ICCs: 0.95–1.0) reliability analysis achieved significant agreement. The ICC-values of human vs. AI inter-rater reliability analysis ranged between 0.8 and 1.0 preoperatively and between 0.83 and 0.99 postoperatively (all p < 0.001). An independent and external validation of the proposed algorithm on pre- and postoperative FLR, with excellent reliability for human measurements, could be demonstrated. Hence, the algorithm might allow for the objective and time saving analysis of large datasets and support physicians in daily routine.


Introduction
Malalignment of the leg axis can be congenital or acquired. Physiological bone morphology of the knee changes over a lifetime, for example, due to physical activity in youth, sports trauma, or osteoarthritis in old age [1][2][3][4]. The assessment of the knee alignment using standing full-leg radiographs (FLR) is an established and standardized method. This "vital imaging modality" of weight-bearing radiographs allows for the determination of the load-bearing axis of the leg [5]. The determination of the load-bearing axis represents indispensable information for diagnostic and individual therapy planning. Especially -mFAmTA: angle between the mechanical axis of the femur and the mechanical axis of the tibia, also known as the hip-knee-ankle angle. -FSAmTA: angle between the anatomical femur shaft axis and the mechanical axis of the tibia. -mMPTA: angle between the mechanical axis of the tibia and the tibial plateau knee joint line, measured on the medial side. -mLDFA: angle between the mechanical axis of the femur and the femoral condyles knee joint line, measured on the lateral side. -mLDTA: angle between the mechanical axis of the tibia and the tibial plafond, measured on the lateral side.
All angles were measured in degrees. The validation dataset contained both bilateral and single leg X-rays.

Manual Measurement Procedure
Manual measurements were performed using the software suite MediCAD ® 2D classic (mediCAD Hectec GmbH, Altdorf/Landshut, Germany). The software allows adjustment for brightness and contrast, as well as magnification. To assess inter-rater reliability, all preoperative images were measured by the orthopedic surgeon (rater 1), and a second physician trained and experienced in radiographic measurements (rater 2). Postoperative images were measured only by rater 2. Rater 2 conducted all pre-and postoperative measurements twice, at two different time points, to determine intra-rater reliability (rater 2a and 2b).

Automated AI-Workflow and Training Procedure
A fully automatic AI-based algorithm was developed for the determination of the lower limb alignment angles. The algorithm works in multiple stages, and the workflow is illustrated in Figure 1. It consists of five deep convolutional neural networks (CNNs) and is triggered with the input of a full-leg X-ray. After passing through all stages, the measured angles are visualized. For bilateral images, all parameters are computed for both legs separately.
Diagnostics 2022, 12, x FOR PEER REVIEW 4 Figure 1. Fully automatic AI algorithm for the determination of lower limb alignment angles. B on the predicted segmentation masks around anatomical structures and implants, crops of the p imal femur, knee area, and ankle joint were created. Landmark detection models predicted marks on these smaller crops. The landmarks were projected back on the original image, an final parameters were computed and visualized.

Preprocessing
Each DICOM image was anonymized and normalized with the use of the win width and window center information from the DICOM tags. Fully automatic AI algorithm for the determination of lower limb alignment angles. Based on the predicted segmentation masks around anatomical structures and implants, crops of the proximal femur, knee area, and ankle joint were created. Landmark detection models predicted landmarks on these smaller crops. The landmarks were projected back on the original image, and the final parameters were computed and visualized. The training datasets are described per model below. The images used for training the models were completely independent of the validation images.

Preprocessing
Each DICOM image was anonymized and normalized with the use of the window width and window center information from the DICOM tags.

Segmentation of Anatomical Structures and Implants
After preprocessing, a segmentation model was used to automatically detect and segment all anatomical structures and implants in the image. Polygon masks and rectangular boxes around the bones and implants were predicted (see Figure 1).

Training of the Segmentation Model
To train the segmentation model, a dataset of training images was manually annotated with bounding polygon masks by trained medical staff. The regions of interest included femur, fibula, tibia, talus, and implants. The training dataset consisted of 202 unilateral antero-posterior FLR (192 preoperative, 10 postoperative) from two independent clinical sites. Horizontal flipping was applied to half of the training set, for the purpose of data augmentation, resulting in a trained model that is insensitive to the leg side. The instance segmentation model was based on mask-RCNN architecture, implemented in the Pytorch framework [21,22]. Pre-trained weights from https://pytorch.org/serve/model_zoo.html (accessed on 23 June 2021) were used to initialize the model. Training ran for 200 epochs on a NVIDIA GeForce GTX 1080 GPU, with a learning rate of 0.002.

Landmark Placement
In the next step, the bounding boxes determined by the segmentation model were used to generate smaller crops for the landmark detection models. The leg side was determined based on the relative position of fibula and tibia. The images were flipped accordingly to train the landmark placement models on the right leg crops.

Training of the Landmark Placement Models
In total, four different landmark detection models were trained: one for the proximal femur, including the greater and lesser trochanter, as well as the femoral head (9 landmarks), one for a TKA implant (15 landmarks), one for the preoperative knee joint (20 landmarks), and one for the talus (2 landmarks). Depending on the detection of a TKA implant, the appropriate model for the knee area was triggered. Contrast enhancement was applied to the crops to increase visibility of structures.
To train the proximal femur landmark detection model, a crop was made by taking the upper quarter of the femur bounding box. Landmarks were placed on the femoral head, femoral neck, and the greater and lesser trochanter by trained medical staff. The training set consisted of 326 images (320 preoperative, 6 with hip implants) from two clinical sites.
The TKA landmark detection model was trained on crops based on the bounding box around the femoral and tibial parts of the TKA implant. Five landmarks were placed on the femoral part, and ten landmarks were placed on the tibial part of the implant on pre-defined locations optimal for determination of parameters. Trained medical staff manually placed these landmarks on a total of 190 images with TKA implants from two clinical sites.
The crops for the knee landmark detection models were generated by merging the lower eighth of the femur and the upper eighth of the tibia bounding boxes detected by the segmentation model. In total, 20 landmarks were placed in pre-defined positions on the femoral condyles and tibial plateau. The training set consisted of 319 images (288 preoperative, 30 images with unicondylar implants) from two clinical sites.
For the training of the talus landmark detection model, the predicted talus bounding boxes were directly used to create the crops. Two landmarks were placed on the superior articular surface of the talus. In total, 320 images from two sites were used for training.
Manual landmark placement by the trained medical staff was conducted in a custom graphical user interface (GUI) developed in Python. The landmarks were constantly reviewed for quality and consistency. During training of each model, data augmentation was applied by rotating and/or scaling the crops by random factors. The crops were then downscaled to 256 × 256 pixels. The network was adapted from UNet and implemented in TensorFlow [23,24]. Training of all models ran on a NVIDIA GeForce GTX 1080 GPU, with a constant learning rate of 0.001 for 100 epochs each.

Projection and Parameter Computation
In the final two stages, the coordinates of the predicted landmarks on the crops were projected back to the original image. The segmentation masks and projected landmarks were then used to compute the five lower limb alignment parameters. In case of prediction failures (e.g., the talus was not detected by the segmentation model), the affected parameters were not computed. The remaining parameters, however, were still determined, resulting in individual detection rates per parameter.

Statistical Analysis
For preoperative images, intra-(R2a vs. R2b) and inter-rater analyses (R1 vs. R2a, R1 vs. R2b, and each rater against the algorithm R-AI) were conducted. Only intra-rater analysis was performed for postoperative images. Agreement within and between raters was quantified by mean differences (95% confidence interval (CI), standard deviation (SD)), root mean square error (RMSE), Pearson's correlation coefficient r, and single-measure intra-class correlation coefficients (ICC) for absolute agreement. Python 3 programming language was used for all statistical computations.

Patient Sample
The initial cohort consisted of 119 patients, with age span of 44 to 85 years (mean 66 ± 9.3) and gender distribution of 39% male and 61% female. After application of exclusion criteria involving the availability of ground truth values for all parameters, the final dataset consisted of 95 preoperative and 105 postoperative weight-bearing, anteroposterior FLR. Some of the patients were undergoing a second surgery and had TKA on one of the knees operated on in a previous timepoint. In such cases, and if bilateral images were present, parameters were measured on both TKA implants in the postoperative image, but only the native knee was measured in the preoperative image. Therefore, the number of postoperative FLR was higher than preoperative FLR. Table 1 displays the results of the intra-rater reliability analysis, based on the repeated measurements of all pre-and postoperative images by rater 2. The intra-rater agreement, as defined by the ICC-value, was highest for mFAmTA (ICC preop = ICC postop = 1.0) and lowest for mLDFA (ICC preop = 0.95, ICC postop = 0.97). All comparisons reached significant agreement (all p < 0.001). The RMSE were smallest for mFAmTA (0.3 • for preoperative, 0.3 • for postoperative images) and largest for mLDTA (1.2 • for preoperative, 1.3 • for postoperative images).

Human vs. Human Inter-Rater Reliability Analysis
The preoperative images were measured by two raters to allow for an inter-rater reliability analysis. The results are displayed in Table 2. The agreement between raters was highest for the two parameters mFAmTA and FSAmTA (ICC preop = 0.99 for both R1 vs. R2a and R1 vs. R2b). Again, the agreement was significant (p < 0.001) for all comparisons. The agreement was lowest for mLDTA (ICC preop

AI Detection Rates
For preoperative images, mMPTA, mLDFA, and FSAmTA could be computed in 98.9% of cases. mLDFA and mFAmTA could be determined in 92.6% of cases. For postoperative images, mMPTA and mLDTA were computed in 97.1% of cases. The detection rates for postoperative FSAmTA, mLDFA, and mFAmTA were 95.2%, 94.3%, and 92.4%, respectively.

Critical Comparison with Literature
Previous scientific publications in the field of AI for the automated assessment of lower limb alignment utilized different software architectures and approaches. The "YOLOv4 And Resnet Landmark regression Algorithm" (YARLA) by Tack et al. represented a first step toward a fully automated assessment of knee alignment [13]. The accuracy of hip-knee-ankle angle computations was assessed on 2943 radiographs by comparing the results of two independent, publicly accessible image analysis studies. The average deviation of manually placed landmarks and automatically detected ones was less than 2.0 ± 1.5 mm for all structures. The average mismatch between hip-knee-ankle angle determinations was 0.09 ± 0.63°, which is in the same error range as the algorithm presented here (mean error R2a vs. AI: preoperative: 0.0°, postoperative: 0.15°). Compared to our algorithm, the method proposed by Tack et al. demonstrated a slightly higher detection rate of 99.85% for the hip-knee-ankle, which may be due to the larger training set of 900 images. Nevertheless, our approach achieved remarkable detection rates, with approximately one third of the amount of training data. Due to differences in the validation images, comparing the performance of the two methods is challenging. Nevertheless, the reported ICCs for the hip-knee-ankle angle were similar (0.98 by Tack et al. vs. 0.99 presented in this study). Our proposed method can determine four additional parameters apart from the hip-knee-ankle angle, which was the only parameter considered by Tack et al. Furthermore, in the presented study, there was a clear distinction between pre-and postoperative analyses, and the similarity in their statistical evaluation indicates the robustness and generalizability of our algorithm.
Notably, the results for FSAmTA and the hip-knee-ankle angle (mFAmTA) were generally better than for angles between the knee joint lines and the long axes of the bones. This applies to both the comparisons between two sets of human measurements and comparisons between human and AI measurements (see Tables 1-3). However, all ICC-values are still considered excellent, indicating that an accurate automatic determination is possible, even for these more challenging parameters.

Critical Comparison with Literature
Previous scientific publications in the field of AI for the automated assessment of lower limb alignment utilized different software architectures and approaches. The "YOLOv4 And Resnet Landmark regression Algorithm" (YARLA) by Tack et al. represented a first step toward a fully automated assessment of knee alignment [13]. The accuracy of hipknee-ankle angle computations was assessed on 2943 radiographs by comparing the results of two independent, publicly accessible image analysis studies. The average deviation of manually placed landmarks and automatically detected ones was less than 2.0 ± 1.5 mm for all structures. The average mismatch between hip-knee-ankle angle determinations was 0.09 ± 0.63 • , which is in the same error range as the algorithm presented here (mean error R2a vs. AI: preoperative: 0.0 • , postoperative: 0.15 • ). Compared to our algorithm, the method proposed by Tack et al. demonstrated a slightly higher detection rate of 99.85% for the hip-knee-ankle, which may be due to the larger training set of 900 images. Nevertheless, our approach achieved remarkable detection rates, with approximately one third of the amount of training data. Due to differences in the validation images, comparing the performance of the two methods is challenging. Nevertheless, the reported ICCs for the hip-knee-ankle angle were similar (0.98 by Tack et al. vs. 0.99 presented in this study). Our proposed method can determine four additional parameters apart from the hip-kneeankle angle, which was the only parameter considered by Tack et al. Furthermore, in the presented study, there was a clear distinction between pre-and postoperative analyses, and the similarity in their statistical evaluation indicates the robustness and generalizability of our algorithm.
Notably, the results for FSAmTA and the hip-knee-ankle angle (mFAmTA) were generally better than for angles between the knee joint lines and the long axes of the bones. This applies to both the comparisons between two sets of human measurements and comparisons between human and AI measurements (see Tables 1-3). However, all ICC-values are still considered excellent, indicating that an accurate automatic determination is possible, even for these more challenging parameters. Table 3. Inter-rater reliability between AI method and manual measurements; preoperative n = 94 for mMPTA, mLDTA, FSAmTA, n = 88 for mLDFA, mFAmTA; postoperative n = 102 for mMPTA, mLDTA, n = 100 for FSAmTA, n = 99 for mLDFA, n = 97 for mFAmTA.  Simon et al. trained an algorithm on over 15,000 radiographs to measure various clinical angles and lengths from standing long-leg radiographs [14]. AI and expert measurements were performed independently. A total of 295 long leg radiographs from 284 patients were analyzed. The AI model produced outputs on 98.0% of the images. AI vs. mean observer revealed mean-absolute-deviation between 0.39 • and 2.19 • for angles and 1.45-5.00 mm for lengths. Similar to the results presented here, their algorithm demonstrated excellent reliability in all lengths and angles (ICC ≥ 0.87). Again, no differentiation between preoperative and postoperative measurements were reported, so a scientific comparison was not possible within the scope of this study. The authors profess an algorithm with reproducible, accurate measures and time savings. On average, their algorithm was 130 s faster than the clinicians [14]. Compared to our novel algorithm, they exhibited similar overall reliability and accuracy. The calculation of the time savings was not part of our study. In our opinion, a determined calculation was very subjective and depended on computing power. To provide a preliminary time value, using hospital's medical informatics infrastructure, our new algorithm can process a fully automated assessment of lower limb alignment in less than 60 s.

Inter-Rater
What all above-mentioned solutions have in common is a lack of transparency into the black-box mechanism of the machine learning algorithm. We are convinced that the serial combination of five consecutive algorithms and intermediate visualization of predicted segmentations and landmarks is suitable to disclose the localization and the exact reason for the potential errors. Our final software solution would include GUI control for the visualization of the single steps. This is the first step toward a warning mechanism for borderline cases.

General Limitations of Method
The basic procedure of measuring lower limb alignment remains unchanged in the presented method. The use of AI may result in time saving, fatigue-free controls, objectified examination procedures, and easy investigation in large volumes of images. However, fundamental procedural deficiencies persist, as a result of the inherent technique of weightbearing FLR. Jud et al. measured the hip-knee-ankle angle in simulated antero-posterior FLR. Deviations caused by rotation up to 30 • , flexion up to 30 • , or varus/valgus up to 9 • did not vary more than 3 • from median values. Their findings concluded that deviations in hip-knee-ankle angle measurements are comparable in patients with different coronal alignment [6]. However, Brouwer et al. pointed out several pitfalls in determining knee alignment in a cadaver study, e.g., simultaneous flexion of the knee and rotation of the leg induced large changes in projected angles. However, flexion of the knee without rotation of the lower extremity had little effect on angles, as projected on full-length anteposterior radiographs. Similarly, the rotation of the lower extremity without flexion of the knee also had little effect [25]. Zahn et al. showed that the postoperative mechanical axis correlates with limb loading. They used two digital scales separately to capture the load of each limb during X-ray imaging. The mechanical axis changed from an initial ten days after surgery −1 • ± 2 • valgus alignment to three months after surgery varus axis of +1 • ± 2 • . The alterations were much more pronounced in patients with postoperative incomplete extension [26]. In alignment with Zahn et al. and the authors mentioned above, an interval of three month prior to surgery appears to be sufficient. The algorithm itself does not address the specific structural measurement inaccuracy of the technique, and future warning mechanisms, regarding malrotation and contractures, may be useful.

Limitations of the Study
The size of the validation dataset was limited. However, our database increases automatically with time, allowing for the continuous improvement of the algorithm. Furthermore, the automated assessment of lower limb alignment was only evaluated on native knees and patients with TKA, even though the field of clinical application was larger, due to the variety of available implants. Therefore, we plan to include other implant types like plates or unicompartmental knee arthroplasty in the validation dataset. The postoperative images were only measured by rater 2, preventing an inter-rater reliability analysis between the two human raters. However, based on the excellent inter-rater reliability for pre-operative images, a comparable agreement between the two raters would be expected for postoperative images, too. Future studies, including multiple raters, may prove this assumption.

Outlook and Connecting Factors
There are several possible extensions to our algorithm. The measurements of alignment angles on different types of protheses, such as unicompartmental knee arthroplasty, could be the next step. Other parameters, such as the joint line orientation angle, described as the angle between the knee joint line and the floor, could be assessed; this parameter was associated with worse postoperative outcomes in unicompartmental knee arthroplasty [27]. Leg length discrepancies are not commonly associated with TKA, but large changes in the leg length are common after hinged TKA. Labott et al. reported an absolute mean and median change in leg lengths of 20 mm and 13 mm [28]. Those measurements could be integrated into the existing algorithm.
In the future, a holistic set of algorithms is conceivable by combining the achievements of several research teams. For example, a supplementary automated "Sarcopenia Screening", as well as an additional algorithm for the identification of specific orthopedic implant models from imaging, could be combined [29,30]. Another approach could be the integration of clinical factors, such as lab results or quality of life questionnaires. Bonakdari et al., for example, developed a comprehensive machine learning model that bridges major osteoarthritis risk factors, serum levels of adipokines, and related inflammatory factors to predict the risk of disease [31].
Recent studies point out a more differentiated consideration of the knee joints' physiology. Subtypes of different physiological characteristics, such as constitutional varus knees, have been described [32]. The performance of TKA changed from "bone surgery" to "soft tissue surgery". New concepts of kinematic alignment are paving the way for individual treatment options [33]. In the future, the accurate and fully automated algorithm for assessment of knee alignment can replace arduous routine tasks, leaving more time for careful diagnosis and the selection of appropriate treatment options. For instance, the algorithm could be automatically triggered after an X-ray is taken in the hospital. The clinician could then review and potentially correct or immediately approve the automatic measurements.

Conclusions
The overall findings of the study provide a validated, transparent, and reliable AIbased algorithm for the automated assessment of lower limb alignment on weight-bearing FLR, with a clear distinction between pre-and postoperative analysis. The AI algorithm is able to determine five key lower limb alignment angles with excellent reliability and accuracy. Based on these results, we propose the two main following use cases. First, the algorithm may be used to independently analyze large datasets for research purposes. Secondly, it could be used in supervision to support clinicians in time-consuming manual routine measurements. Funding: This work was partially funded as a DLR project by the German Federal Ministry for Economic Affairs and Climate Action (https://www.bmwi.de (accessed on 31 October 2022); ID: 01MK20003A and 01MK20003G) and resulted from a project within the AIQNET initiative (https: //aiqnet.eu (accessed on 31 October 2022)). However, the sponsors had no role in the study design, data collection and analysis, or preparation of the manuscript. PG, MD, and CS are employees at RAYLYTIC GmbH (https://www.raylytic.com (accessed on 31 October 2022)). We acknowledge support by Open Access Publishing Fund of University of Tübingen.

Institutional Review Board Statement:
The study was conducted in accordance with the Declaration of Helsinki and approved by the Ethics Committee of University of Tübingen (ID: 197/2021BO2, approval date 25 May 2021).

Informed Consent Statement:
Informed consent was obtained from all subjects involved in the study.
Data Availability Statement: Not applicable.