Clinical Score for Predicting the Risk of Poor Ambulation at Discharge in Fragility Femoral Neck Fracture Patients: A Development Study

Surgical treatment in patients with fragility femoral neck fractures often leads to a longer length of hospital stay (LOS) and higher costs. Intensive rehabilitation is one of the choices to reduce LOS, but patient selection criteria are controversial. We intended to develop a clinical score to predict the risk of poor ambulation at discharge. This study was based on a retrospective cohort of patients diagnosed with fragility femoral neck fractures surgically managed from January 2010 to December 2019 at Chiang Mai University (CMU) Hospital. Pre-, intra-, and post-operative factors that affect rehabilitation training were candidate predictors. All patients were categorized into able or unable groups based on their ability to bear self-weight at discharge. Logistic regression was used for score derivation. Five hundred and nine patients were included in this study. Male sex, end-stage kidney disease (ESRD), cerebrovascular disease, psychiatric disorders, pre-fracture ambulation with gait aids, concomitant fracture, post-operative intensive care unit (ICU) admission or ventilator use, and urinary catheter use at second day post-operation were identified as the prognostic factors. The score showed an AuROC of 0.84 with good calibration. The score can be used for risk stratification on the second day post-operation. External validation is encouraged before clinical implementation.


Introduction
Currently, surgical management is the primary treatment for clinically stable fragility hip fractures, generally defined as any fracture in an adult over 50 that occurs by a lowenergy mechanism of injury [1,2]. However, surgical management often leads to a longer length of hospital stay (LOS) [3][4][5], which is significantly associated with higher morbidity, mortality, and healthcare costs [6,7]. Several strategies have been studied to decrease the LOS. However, clear recommendations have not been settled on. Early surgery and rehabilitation have been proposed to reduce the overall LOS among geriatric hip fracture patients [2,4,[8][9][10]. However, certain barriers exist in practice and hinder the effective implementation of early in-hospital rehabilitation, such as delayed surgery, insufficient mobilization, functionally independent training, a lack of coordinated discharge planning or referral, and family or caregiver burdens [11].
Stratifying patients according to their risk of poor ambulation at discharge may lead to better prioritization and reduce the aforementioned barriers. The intensive rehabilitation program should only focus on patients who are unlikely to achieve adequate ambulation 2 of 13 within their hospital stays. Our previous study in patients with fragility hip fractures reported prognostic factors that were predictive of the inability to self-ambulate at discharge, such as patients with significant comorbidities, impaired baseline ambulatory status, associated fractures, and pressure ulcers [12]. Recognizing these factors during patient evaluation could potentially help clinicians to identify patients at risk of poor ambulatory status at discharge; however, this approach does not consider the multivariable nature of clinical prediction and does not provide estimations of the individual probability of poor ambulation.
A multivariable prediction score is an attractive clinical tool that simultaneously considers multiple features within each patient and provides absolute predictive values of outcomes. To date, only a few clinical scores for predicting in-hospital ambulation status among geriatric hip fracture patients are available, and often they are not specific for each subtype of hip fracture, resulting in low predictive performance [13]. Moreover, some of the previous scores did not include factors that were associated with the functional outcomes of the patients, such as concomitant fracture, post-operative complications, and post-operative intensive care unit (ICU) admissions [14][15][16][17][18][19], but incorporated certain factors that were not routinely collected in practice, such as the mental status examination (MSE), and, thus, were irrelevant and impractical [20]. In this study, we intended to develop a simplified clinical score to predict the risk of poor ambulation or being unable to bear self-weight at discharge in patients with fragility femoral neck fractures. The information gained from the score might be useful for clinicians to predict early post-operative functional outcomes and provide optimal, individualized rehabilitation plans for each patient.

Study Design
A clinical risk score was developed based on a retrospective observational cohort of patients diagnosed with fragility femoral neck fractures surgically managed with fixation or arthroplasty from January 2010 to December 2019 at Chiang Mai University (CMU) Hospital, a tertiary care center, and medical school. The study protocol was approved by the Research Ethics Committee, Faculty of Medicine, Chiang Mai University (certificate of approval No. EXEMPTION 7375/2020). Patient consent was waived due to the retrospective nature of data collection.

Study Patients
Adult patients with fragility femoral neck fractures who were operated on at our institution during the study period were included. Fragility neck fracture was defined as any fracture in an adult over 50 that occurred by a low-energy mechanism of injury (e.g., falling from a standing height). The exclusion criteria were patients who were referred to or from other hospitals and patients with poor baseline ambulatory status (ambulation with wheelchairs or non-ambulatory status, e.g., bedridden).

Candiate Predictors and Data Collection
We collected data on evidence-proven pre-, intra-, and post-operative factors that affect rehabilitation training and ambulation status at discharge. These factors were used as candidate predictors during statistical modeling and score derivation. The pre-operative factors were gender [14,15], age [14,21], body mass index (BMI) [22], pre-injury ambulation status [13,21,23,24], serum albumin level [25,26], concomitant fracture [27], second hip fracture [28], operative technique (arthroplasty or fixation surgery) [29], and any comorbidity that might affect patients' rehabilitation and ambulation [30,31]. The intra-operative factors were the amount of time from admission to surgery, total anesthesia time, and volume of intra-operative blood loss [32]. Lastly, the post-operative factors included post-operative intensive care unit (ICU) admission or ventilator use, post-operative sedative drug use, pain score on the initial rehabilitation day [33], post-operative blood transfusion [34], and urinary catheter use on the second day post-operation [35].

Study Endpoint
The endpoint to be predicted was the ambulatory status at discharge for each patient. All patients were categorized based on their ability to bear self-weight at discharge as either able or unable. Patients who could not bear self-weight were defined as those who could only ambulate in a wheelchair or those who could not ambulate or were bedridden. In the other group, patients who were able to bear self-weight were defined as patients who had independent ambulation or patients who could ambulate with gait aids (crutch, cane, or walker).

Sample Size Estimation
We estimated the sample size required for developing the multivariable clinical risk score using methods suggested by Riley et al. [36]. Assuming that the number of significant candidate predictors from the univariable analysis was 15 and estimating the concordance statistics to be 0.85 and the average incidence of the outcome to be 20% based on our previous report [12], a minimum sample size of 460 with 92 events was needed. Achieving this target sample size would guarantee a 0.05 acceptable difference in apparent and adjusted R-squared, a 0.05 margin of error in the estimation of intercept, and an optimal number of events per predictor.

Statistical Analysis
All data analyses were performed using Stata 17 (StataCorp, College Station, TX, USA). The level of statistical significance was considered to be a p-value less than 0.05. Frequency and percentage were used to describe categorical variables. Mean and standard deviation (SD) or median and interquartile range (IQR) were used to describe numerical variables based on the underlying distribution. A comparison of categorical variables between groups was conducted with Fisher's exact test. Student's t-test or the Mann-Whitney U test was used to compare numerical variables. The area under the receiver operating characteristic (AuROC) curve for each candidate predictor was also presented to reflect the discriminative performance [37].
2.6.1. Score Development Association between candidate predictors and outcome was explored using univariable binary logistic regression. Missing data were explored. If any predictor showed more than 50% missing data, that predictor was not included in the multivariable analysis. Multiple imputation (MI) with the chained equation method (MICE) was used to handle the missing data and improve statistical efficiency. Gender, age, pre-fracture ambulation status, underlying diseases, operative technique, and ambulation status at discharge were pre-selected as independent variables to predict the missing values using linear regression.
Multivariable binary logistic regression with a stepwise backward elimination method was used to identify the final set of predictors. Predictors with a p-value less than 0.05 in the multivariable model were removed in a stepwise manner. After the final model was derived, the weighted score of each remaining predictor was calculated by dividing each logit coefficient by the lowest one and rounded up to the nearest integer.

Model Performance and Internal Validation
The model performance was assessed in terms of discrimination and calibration. The discriminative ability was represented with an AuROC. The score calibration was illustrated with the score calibration plot. Internal validation of this model was carried out via bootstrap re-sampling with 1000 replicates. The test AuROC, calibration slope, and the bootstrap shrinkage factor were reported.

Categorizing Risk Groups
For clinical applicability, the newly derived risk score was categorized into three groups: low, moderate, and high risk of inability to bear self-weight at discharge. The cut-point was determined by considering both clinical and statistical aspects of the clinical decision in this context. The lower and the higher cut-off points were chosen based on the sensitivity, specificity, and likelihood ratio (LR) of each possible score cut-off point. The low-risk group had high sensitivity and the upper bound of the 95% CI of an LR less than 1, whereas the high-risk group had high specificity and the lower bound of the 95% CI of an LR higher than 5. The moderate risk group had an LR ranging from 1 to 10 [38].

Study Patients
Five hundred and seventy-four patients with fragility femoral neck fractures were eligible for inclusion. Sixty-five patients were excluded: fifty-four were injured due to high-energy trauma, six were bedridden at baseline, four used wheelchairs at baseline, and one patient was referred to another hospital. Finally, 509 patients were included for statistical analysis and score derivation ( Figure 1). Of these, 99 (19.4%) patients were unable to bear self-weight at discharge. In this cohort, most of the patients (408, 80.2%) underwent surgical arthroplasty (396 (77.8%) hemiarthroplasty and 12 (2.4%) total hip arthroplasty), while only 101 (19.8%) received surgical fixation (73 (14.3%) multiple screw fixation and 28 (5.5%) dynamic hip screw fixation).  Table 1 describes and compares the differences in baseline characteristics and pre-, intra-, and post-operative factors between patients who were unable and able to bear selfweight at discharge. There were significant differences between the two groups in several  Table 1 describes and compares the differences in baseline characteristics and pre-, intra-, and post-operative factors between patients who were unable and able to bear self-weight at discharge. There were significant differences between the two groups in several predictors ( Table 1). The AuROC analysis identified four prognostic factors that had an AuROC ≥ 0.60, which were gender, pre-fracture ambulation status, hypoalbuminemia, and urinary catheter use on the second day post-operation. Table 2 shows the results of univariable and multivariable logistic regression. A total of twelve candidate predictors with statistical significance according to the univariable analysis were included in the multivariable analysis (Table 2). Table 1. Baseline characteristics and pre-, intra-, and post-operative factors of patients with fragility femoral neck fractures included in this cohort by their ability to bear self-weight at discharge.  Univariable and multivariable binary logistic regression presenting univariable odds ratio (uOR) and multivariable odds ratio (mOR). a Included patients with drug abuse; b COPD = chronic obstructive pulmonary disease; c eye diseases included blindness, cataracts, and glaucoma; d fixation included multiple screw fixation and dynamic hip screw fixation (DHS).

Multivariable Modeling and Score Derivation
After stepwise backward elimination, eight predictors remained significant. They were used for score derivation: male gender, ESRD, cerebrovascular disease, psychiatric disorders, pre-fracture ambulation with gait aids, concomitant fracture, post-operative ICU admission or ventilator use, and urinary catheter use on the second day post-operation ( Table 3). The logit coefficient of each predictor was transformed into the weighted score by dividing each logit coefficient by the lowest one (0.96) and rounded up to the nearest integer, as shown in Table 3. The possible total score ranged from 0 to 14.5 points. For our data, the scores ranged from a minimum of 0 to a maximum of 8.5. The higher the score, the higher the probability of inability to bear self-weight at discharge. The score was significantly different between the two groups (median 3 (IQR 2.5, 5) vs. median 1 (IQR 0, 2), p < 0.001).

Model Performance and Internal Validation
Our clinical risk score showed excellent discriminative performance with an apparent AuROC of 0.84 (95% CI 0.79, 0.89) (Figure 2A). The score also showed good agreement between the predicted probability of the inability to bear self-weight at discharge and the observed proportion of patients with the outcomes according to the score calibration plot ( Figure 2B).

Model Performance and Internal Validation
Our clinical risk score showed excellent discriminative performance with an apparent AuROC of 0.84 (95% CI 0.79, 0.89) (Figure 2A). The score also showed good agreement between the predicted probability of the inability to bear self-weight at discharge and the observed proportion of patients with the outcomes according to the score calibration plot ( Figure 2B).

Figure 2. Apparent validation of score performance in terms of discrimination and calibration. (A)
parametric receiver operating characteristic (ROC) curve representing the ability of the score to discriminate between patients who were able and unable to bear self-weight at discharge, (B) calibration plot visualizing the agreement between the predicted risk and the observed proportion of the outcome across the range of the newly-derived score.

Risk Categories and Score Accuracy
Two cut-off points were identified, a lower cut-off point at a score > 1 and a higher cut-off point at a score ≥ 4. At the lower cut-off point, the sensitivity was 92.9%, and the specificity was 36.6%, whereas the sensitivity was 39.4%, and the specificity was 96.6% for the higher cut-off point. Patients with risk scores of 0 to 1 were classified as low-risk, those with risk scores of 1.5 to 3.5 were classified as moderate-risk, and those with a risk score Figure 2. Apparent validation of score performance in terms of discrimination and calibration. (A) parametric receiver operating characteristic (ROC) curve representing the ability of the score to discriminate between patients who were able and unable to bear self-weight at discharge, (B) calibration plot visualizing the agreement between the predicted risk and the observed proportion of the outcome across the range of the newly-derived score.

Risk Categories and Score Accuracy
Two cut-off points were identified, a lower cut-off point at a score > 1 and a higher cut-off point at a score ≥ 4. At the lower cut-off point, the sensitivity was 92.9%, and the specificity was 36.6%, whereas the sensitivity was 39.4%, and the specificity was 96.6% for the higher cut-off point. Patients with risk scores of 0 to 1 were classified as low-risk, those with risk scores of 1.5 to 3.5 were classified as moderate-risk, and those with a risk score higher or equal to 4 were classified as high-risk. The positive predictive values and the LR of each risk category are shown in Table 4.

Discussion
This study developed and internally validated a simplified and fracture-specific clinical risk score for predicting the inability to bear self-weight at discharge in patients with fragility femoral neck fractures. The newly developed score consists of eight clinically relevant and routinely available predictors: gender, ESRD, cerebrovascular disease, psychiatric disorders, pre-fracture ambulation with gait aids, concomitant fracture, post-operative ICU admission, or ventilator use, and urinary catheter use on the second day post-operation. The score showed excellent discriminative ability and good calibration. Based on the internal validation, the degree of optimism, or overfitting, was low.
Currently, clinicians often rely on standard physical performance scores or individualized multivariable risk scoring systems to predict post-operative ambulation status in patients with fragility hip fractures [18]. The commonly used standard physical performance scores in practice were Barthel-20 [39], Barthel-100 [40], Cumulated Ambulatory Score (CAS) [19], and New Mobility Score (NMS) [15]. However, these scores were not specifically designed to predict ambulatory status in geriatric patients with hip fractures, which explains why their discriminative performances in validation studies were often low, with an AuROC ranging from 0.64 to 0.66 [13]. In contrast, individualized risk scoring systems have continuously emerged as a more attractive and more specific method that can improve predictive performance [17][18][19]. A few scoring systems have been developed for the prediction of post-operative functional outcomes and functional statuses, such as the simple scoring system by Hagino T. et al. [17], the six risk scores by Burgos E. et al. [23], the prognostic model for predicting the recovery of walking independence of elderly patients after hip-fracture surgery by Bellelli G. et al. [41], and the predictive model of gait recovery at one month after hip fracture [16]. Table 5 provides specific details of the study domain, predictors used, and the definition of the clinical endpoint for each scoring system. The American Society of Anesthesiologists' (ASA) Physical Status Classification, the Charlton Comorbidity Index (CCI), and the Mental State Examination (MSE) were commonly incorporated within these scoring methods, as mentioned earlier because they represent the comorbidity burden of the patients. However, these parameters have some important weaknesses. Firstly, the ASA classification shows fair interrater reliability, which may result in an inconsistent assessment [42]. Secondly, several comorbidities that might significantly affect the rehabilitation outcomes, such as psychological disorders, are not included in the CCI [43]. Thirdly, the MSE is a standard tool for evaluating the neurocognitive status of geriatric patients [44] and may be able to identify patients who would not respond well to rehabilitation. However, due to its low sensitivity, the MSE might not be appropriate for screening [20]. Moreover, most hospitals in Thailand do not perform MSE screening in their routine outpatient and inpatient practice. Finally, these scoring systems do not include some well-known predictive factors that serve as significant barriers to post-operative rehabilitation (e.g., concomitant fracture, post-operative ICU admission or ventilator use, and post-operative complication) [14][15][16][17][18][19]. Surgical techniques might be another promising prognostic factor that could have good predictive ability and should be incorporated into clinical scoring systems. Hip arthroplasty with the cementing technique has been proven to result in less post-operative pain and early postoperative ambulation due to immediate implant stability [29]. In contrast, patients who were treated by surgical fixation have to ambulate by protective weight bearing during the early phase because early full weight bearing may lead to several complications, such as excessive hip screw sliding, hip screw breakdown, and the hip screw cutting out from the femoral head [45].
In this study, we developed a simplified and practical clinical risk scoring system for patients with fragility femoral neck fractures using readily available clinical data. The specificity of the domain and the inclusion of clinically relevant predictors and predictors that affect post-operative rehabilitation outcomes led to the high predictive ability of the score. The point of prediction of the score was defined on the second day post-operation, which is generally an appropriate time for patient evaluation, as common post-operative events, including the removal of urinary catheters, post-operative complications, and ICU admissions, usually occur within 48 h after surgery [46]. For clinical applicability, the score was categorized into three risk groups: low-risk (0-1 points), moderate-risk (1.5-3.5 points), and high-risk (≥4 points).
Intensive rehabilitation aims to minimize impairments and improve functional independence as much as possible [47]. However, the definitive criteria of patients who are candidates for intensive rehabilitation are controversial and remain unclear. In Thailand, the indication for intensive rehabilitation is left either to the discretion of the attending physicians or the local policy and practice guidelines of each hospital [4,9,10]. This ambiguity leads to improper rehabilitation consultation and management, prolonged LOS, and higher overall healthcare costs. Our newly developed clinical risk score confers potential clinical benefits in helping clinicians to properly stratify their patients into three separated risk groups according to their risk of inability to bear self-weight at discharge. We recommend that intensive rehabilitation programs be offered only to patients in the high-risk group, whereas early or usual in-ward rehabilitation can be offered to both the low-and moderate-risk groups (Figure 3). Unpredictable external factors could, however, affect the hospital's capacity for the effective delivery of the rehabilitation program. The COVID-19 pandemic is a good example of a critical situation that could tremendously limit the capacity of hospital services [48]. In a limited resource situation, patients with hip fractures should be discharged early from the hospital [49], as these patients are often accompanied by high-clinical-risk features for COVID-19 pneumonia and mortality, such as older age and multiple comorbidities [50,51]. Thus, patients predicted to be moderate-to high-risk should be candidates for early discharge and home-based rehabilitation (Figure 3). In addition, effective planning and the training of caregivers for home-based rehabilitation showed no difference in short-term outcomes compared to intensive in-patient rehabilita-tion [52]. Undoubtedly, the implementation of our score should be further modified to fit the organizational context of each hospital. limit the capacity of hospital services [48]. In a limited resource situation, patients with hip fractures should be discharged early from the hospital [49], as these patients are often accompanied by high-clinical-risk features for COVID-19 pneumonia and mortality, such as older age and multiple comorbidities [50,51]. Thus, patients predicted to be moderateto high-risk should be candidates for early discharge and home-based rehabilitation ( Figure 3). In addition, effective planning and the training of caregivers for home-based rehabilitation showed no difference in short-term outcomes compared to intensive inpatient rehabilitation [52]. Undoubtedly, the implementation of our score should be further modified to fit the organizational context of each hospital.  Our study has important strengths and limitations. In terms of strengths, firstly, the development data included a homogenous population domain that directly answers our clinical question. Secondly, all predictors incorporated within the model were evidencebased, clinically relevant, and routinely available. Thirdly, the sample size and the number of events in this cohort were adequate for multivariable score development. The most important limitation of our study was the data collection design, which was retrospective. Thus, the quality of data might not be perfect, and missing data for some important predictors may have affected the model's statistical power and caused biases. In this study, missing data were properly handled using standard multiple imputation methods to maximize the amount of information used during statistical modeling and preserve statistical power. Secondly, the point of prediction on the second day post-operation may be too delayed to guide clinical and rehabilitation management in some circumstances. Ideally, predictions using only pre-and intra-operative factors might be more feasible in actual practice. Thirdly, the number of predictors included was high and may affect clinical applications. Additional efforts might be taken to further simplify the score by reducing the number of predictors or improve the practicality by embedding the score within a web or mobile application. Fourthly, our patient cohort only included Thai patients who visited a single tertiary care center and may only represent Thailand's specific clinical context. An external validation study is therefore required before clinical implementation in different settings. Finally, the predicted outcome of our scoring system was dichotomized as either able or unable to ambulate at discharge, as we believed it would be sufficient for patient and family counselling in the acute phase. During this phase, the patients and their family members, or caregivers, are often anxious about the prognosis after treatment and require effective risk communication and short-and long-term treatment plans from clinicians [53]. Moreover, the post-operative status could be altered during home-program ambulation training. It was revealed that about 1/3 of the patients could return to pre-fracture mobility or functional independence, and about 1/2 to 2/3 of the patients could achieve activities of daily living (ADL) without difficulty [54]. Further research could develop a prediction model to estimate the probability of each ambulatory level, such as partial weight bearing, full weight bearing, non-weight bearing, or unable to ambulate, which might provide more useful information for shared decision making. However, a larger sample size and a well-balanced proportion of each ambulatory status are needed to yield optimal estimates.

Conclusions
A simplified clinical risk score was developed to predict the probability of inability to bear self-weight at discharge in patients with fragility femoral neck fractures. The score incorporated eight clinical predictors and can be used for prediction and risk stratification on the second day post-operation. Recommendations were provided for group-specific rehabilitation management protocols during admission, with the primary intention of optimizing patient functional outcomes and reducing the overall LOS and healthcare cost. Although the score showed excellent discriminative ability and good calibration within the development dataset, validating the score in other external datasets is highly encouraged before clinical implementation.