1. Introduction
Accurate prediction of rehabilitation outcomes has emerged as a critical objective in healthcare, particularly for patients undergoing treatments for musculoskeletal injuries or chronic conditions. Rehabilitation success can vary widely depending on numerous factors, including the patient’s baseline clinical condition, demographic characteristics, and subjective experiences captured through patient-reported outcome measures (PROMs). Clinicians have long relied on experience and intuition to assess prognosis; however, the advent of data-driven methods has introduced new opportunities to standardize and enhance these predictions [
1].
In recent years, machine learning (ML) has been at the forefront of this transformation. By leveraging structured datasets containing clinical measures (CROMs) and PROMs, researchers have developed predictive models capable of assessing the likelihood of treatment success or failure. These tools not only aid in personalizing rehabilitation plans but also enable healthcare systems to optimize resource allocation. Despite their utility, traditional machine learning models often face limitations in capturing the complex, nonlinear relationships inherent in clinical data, leading to reduced predictive accuracy when dealing with heterogeneous populations [
2].
Deep learning, a subset of machine learning inspired by the structure and functioning of the human brain, has the potential to address these limitations. Characterized by multilayered neural networks capable of automatically learning high-level feature representations, deep learning models excel in extracting meaningful insights from vast and complex datasets. Techniques such as Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) have already demonstrated superior performance in fields like medical imaging, natural language processing, and genomics. This study seeks to explore whether these advantages extend to the domain of rehabilitation outcome prediction [
3].
Rehabilitation is a cornerstone of recovery for individuals with injuries to the hip, knee, or foot, as well as those undergoing procedures such as arthroplasty. The success of rehabilitation programs depends not only on the quality of care but also on patient-specific factors such as age, baseline functional status, and compliance with therapy. Traditional evaluation methods often involve the clinician’s subjective assessment, supplemented by quantitative measures like range of motion (ROM) and pain scales. While these methods provide valuable insights, they may fail to fully capture the interplay of variables that determine outcomes [
4].
In a previous study utilizing machine learning, models such as Random Forest and Extra Trees classifiers were employed to predict rehabilitation success based on CROMs and PROMs. While these models achieved weighted F1-scores of up to 65%, they were limited by their reliance on manually engineered features and their inability to fully model complex interactions in the data. These constraints highlight the need for more sophisticated methods capable of uncovering deeper patterns within clinical datasets.
Deep learning has revolutionized several areas of healthcare by offering a powerful framework for analyzing structured and unstructured data. CNNs, for example, have achieved human-level accuracy in medical imaging tasks such as detecting tumors or identifying fractures. Similarly, RNNs and Long Short-Term Memory (LSTM) networks have proven effective in modeling sequential data, including patient histories and sensor readings. Beyond their predictive accuracy, deep learning models also facilitate interpretability through techniques like attention mechanisms, which can identify the most influential variables in a prediction [
5,
6,
7,
8,
9].
In the context of rehabilitation, deep learning has the potential to transform how clinicians assess and predict treatment outcomes. By integrating multiple data sources—such as CROMs, PROMs, and demographic information—deep learning models can provide a more comprehensive understanding of patient trajectories. Moreover, these models can dynamically adapt to new data, making them ideal for applications in real-time clinical decision-making [
10,
11].
The findings of this research have the potential to address significant gaps in the field of rehabilitation medicine. By demonstrating the feasibility and advantages of deep learning, this study aims to provide clinicians with more accurate and actionable tools for planning rehabilitation strategies. Furthermore, the use of interpretable deep learning models could improve clinician trust in these technologies, fostering wider adoption in clinical practice.
In addition, this research contributes to the broader effort of integrating artificial intelligence into healthcare systems. As healthcare continues to evolve towards a data-driven paradigm, the development of predictive models that are both accurate and interpretable will be essential for improving patient outcomes. This study seeks to lay the groundwork for such advancements in the specific domain of rehabilitation outcome prediction.
  Study Contributions
This paper introduces novel contributions to the domain of rehabilitation outcome prediction through the integration of advanced deep learning techniques. The primary contributions include:
- This study demonstrates the superior predictive power of deep learning methods, such as CNNs and RNNs, over traditional machine learning models like Random Forest and Extra Trees classifiers. By leveraging their ability to learn complex, nonlinear relationships, these models improve the accuracy of both categorical outcome classifications and numerical progress predictions. 
- The proposed framework incorporates hybrid CNN-RNN models to handle the multidimensional nature of rehabilitation data. CNNs are utilized for feature extraction from structured clinical data, while RNNs process sequential PROM data to capture temporal trends and dependencies. This hybrid approach is novel in the context of rehabilitation prediction. 
- The paper addresses a common limitation of deep learning—its “black-box” nature—by incorporating attention mechanisms. These mechanisms highlight the most influential features for each prediction, offering clinicians insights into key factors driving rehabilitation outcomes and fostering trust in AI-based tools. 
- The study expands on the use of the rehabilitation dataset by including additional preprocessing steps, normalization, and feature engineering, ensuring optimal input quality for deep learning models. This comprehensive approach maximizes the utility of both CROMs and PROMs, providing a robust foundation for model training and evaluation. 
- A detailed comparison of deep learning models with traditional machine learning techniques provides quantitative evidence of their advantages. The study reports a 9% improvement in F1-scores and a 12% reduction in mean absolute error (MAE), underscoring the value of adopting deep learning in this domain. 
- The paper pioneers the integration of multimodal data, including demographic, clinical, and patient-reported measures, within a unified deep-learning framework. This holistic approach enables a more nuanced understanding of factors influencing rehabilitation success. 
These contributions collectively push the boundaries of what predictive analytics can achieve in rehabilitation medicine, paving the way for more precise, interpretable, and actionable insights that benefit both clinicians and patients.
The remainder of this paper is organized as follows: 
Section 2 provides a detailed review of related work, highlighting the strengths and limitations of existing approaches. 
Section 3 outlines the methodology used to develop and evaluate the proposed deep learning models. 
Section 4 presents the results, including a comparison with baseline machine learning methods. Finally, 
Section 5 discusses the implications of these findings and identifies avenues for future research.
  2. Background
The prediction of rehabilitation outcomes has garnered significant attention in recent years due to its potential to personalize treatment plans and improve patient recovery. Researchers have explored a variety of approaches, including traditional statistical models, machine learning techniques, and, more recently, deep learning algorithms. This section reviews existing work in rehabilitation prediction, focusing on the evolution from conventional methods to advanced artificial intelligence (AI)-based approaches [
12].
Historically, rehabilitation success has been assessed using statistical models and clinician judgment. These models rely on patient demographic data, clinical outcome measures, and patient-reported outcome measures (PROMs).
Statistical techniques such as linear regression, logistic regression, and survival analysis have been commonly used to evaluate rehabilitation success. For instance, models incorporating PROMs like the Health Assessment Questionnaire (HAQ) and clinician-reported measures (CROMs) such as range of motion (ROM) have demonstrated moderate success in predicting outcomes for musculoskeletal rehabilitation. However, these models often assume linear relationships between variables and fail to capture complex, nonlinear interactions inherent in clinical data.
Traditional models are limited by their dependency on feature engineering and their inability to process large and diverse datasets. Moreover, they lack adaptability to dynamic patient profiles, making them less effective for personalized rehabilitation planning. These limitations have driven interest in more sophisticated, data-driven approaches [
13].
The advent of machine learning (ML) introduced a paradigm shift in predictive modeling for rehabilitation outcomes. ML algorithms, such as Random Forest, Support Vector Machines (SVM), and Gradient Boosting, have shown improved performance by handling nonlinear relationships and interactions between variables.
One notable study by Zhu et al. demonstrated the potential of ML algorithms in predicting rehabilitation outcomes for home-care patients. Using algorithms like k-Nearest Neighbors (kNN) and SVM, they found that ML outperformed traditional assessment protocols in accuracy. Similarly, Lin et al. utilized Random Forest and logistic regression to predict outcomes following stroke rehabilitation, achieving classification accuracies exceeding 70% [
14].
PROMs and CROMs have been instrumental in machine learning applications. For example, the Western Ontario and McMaster Universities Osteoarthritis Index (WOMAC) and the Timed Up and Go (TUG) test have been extensively used as input features. Studies have shown that including a combination of PROMs and CROMs enhances model performance, as these measures capture both subjective and objective aspects of rehabilitation [
15].
Despite their advantages, traditional ML models have notable challenges. Feature engineering requires domain expertise, and model interpretability remains a concern for clinicians. Additionally, the reliance on structured datasets limits the inclusion of unstructured data, such as clinician notes or imaging results, which often contain valuable information.
Deep learning has transformed predictive modeling across various domains of healthcare, including diagnostics, treatment planning, and outcome prediction. Unlike traditional ML, deep learning automatically extracts features from raw data, making it highly suitable for complex, high-dimensional datasets.
  Related Work
While deep learning applications in rehabilitation are still emerging, related studies in healthcare have shown its promise. Esteva et al. demonstrated the effectiveness of Convolutional Neural Networks (CNNs) for medical image classification, achieving dermatologist-level accuracy in skin cancer detection. Similarly, Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) networks have been used to analyze time-series data, such as patient vital signs, to predict sepsis and other conditions. These successes suggest significant potential for deep learning in rehabilitation outcome prediction [
16].
Hybrid deep learning architectures that combine CNNs and RNNs have gained attention for their ability to process multimodal data. For instance, CNNs can extract spatial features from imaging or structured datasets, while RNNs capture temporal patterns in sequential data like PROMs. Studies in related fields, such as cardiology and oncology, have demonstrated the effectiveness of such architectures in improving prediction accuracy [
17].
Despite its advantages, deep learning faces barriers to adoption in healthcare. These include the need for large datasets, computational resource requirements, and concerns over interpretability. Attention mechanisms and explainable AI (XAI) frameworks have been developed to address these challenges, providing insights into model predictions and improving clinician trust.
Several studies have compared traditional ML with deep learning to evaluate their respective strengths and weaknesses. For example, Huber et al. investigated the prediction of quality-of-life outcomes following hip and knee replacement surgeries using both ML and deep learning techniques. They found that while ML models provided baseline predictive capabilities, deep learning significantly improved accuracy, particularly when integrating PROMs and unstructured data [
18,
19,
20,
21,
22,
23,
24,
25,
26].
Metrics such as F1-score, mean absolute error (MAE), and receiver operating characteristic (ROC) curves are commonly used to compare models. Studies consistently report that deep learning models outperform ML models in terms of accuracy and robustness, particularly for large and diverse datasets [
27,
28].
One key finding across studies is the importance of model interpretability for clinical adoption. Deep learning models with attention mechanisms or visualization tools, such as saliency maps, are more likely to gain acceptance among clinicians. These tools allow users to identify which features contribute most to predictions, aligning model outputs with clinical reasoning.
While the body of work on predictive modeling in rehabilitation is growing, several gaps remain. Most studies focus on small, localized datasets, limiting the generalizability of findings. Additionally, few studies explore the integration of real-time data, such as sensor readings, into predictive models. This represents an important avenue for future research, particularly given the increasing use of wearable technology in rehabilitation settings.
Building on the limitations and insights from existing literature, this study makes several novel contributions. It extends the application of deep learning to rehabilitation outcome prediction by developing hybrid CNN-RNN architectures capable of processing multimodal data. Furthermore, it addresses the issue of interpretability by incorporating attention mechanisms, providing clinicians with actionable insights into the factors driving model predictions. By comparing the performance of deep learning models with traditional ML baselines, this study aims to establish new benchmarks for predictive accuracy and clinical applicability in rehabilitation medicine.
  3. Methodology
A systematic and structured approach was employed to evaluate the effectiveness of deep learning models in predicting rehabilitation outcomes. The methodology comprised multiple stages, including data collection, preprocessing, feature selection and augmentation, and model architecture overview. The following outlines each of these stages in detail.
  3.1. Dataset Collection
The dataset consists of patient data from various rehabilitation treatment protocols. The primary goal is to predict rehabilitation outcomes, such as the range of motion (ROM) and Health Assessment Questionnaire (HAQ) scores for continuous outcomes, and rehabilitation success categories (e.g., “No Improvement,” “Moderate Improvement,” “Significant Improvement”) for classification tasks.
Key Features in the Dataset:
- ▪
- Range of Motion (ROM): The degree of joint movement after a surgical procedure or trauma, typically measured in degrees. 
- ▪
- The dataset consists of 1047 rehabilitation patient records. 
- ▪
- HAQ Disability Score: A score representing a patient’s level of functional disability. 
- ▪
- WOMAC Pain Score: A pain-related score used to assess joint pain during movement. 
- ▪
- Timed Up and Go (TUG) Test: A functional mobility measure, with higher times indicating worse mobility. 
- ▪
- Age: The age of the patient, which often influences rehabilitation outcomes. 
- ▪
- Treatment Group: This feature includes the type of surgery or rehabilitation treatment, such as hip or knee arthroplasty. 
The dataset is structured as a collection of records where each record represents a unique patient, and each patient has associated features, as described above. The labels for regression tasks (ROM and HAQ) are continuous, while the labels for classification tasks are categorical.
This study utilized a publicly available dataset, which was de-identified prior to analysis to ensure patient privacy. As such, no direct patient consent was required. However, for future clinical implementations, obtaining explicit informed consent will be essential. Additionally, this research adheres to HIPAA (Health Insurance Portability and Accountability Act) and GDPR (General Data Protection Regulation) standards, ensuring the secure handling of patient data. Future real-world deployment would require further compliance measures such as data encryption, restricted access controls, and audit trails. Potential biases in the dataset, particularly related to demographic imbalances in age, gender, and treatment access, may influence the model’s predictions. To address this, future studies will incorporate bias detection techniques and fairness assessments to improve equity in rehabilitation outcome predictions.
  3.2. Data Preprocessing
Effective preprocessing is essential for preparing the data in a suitable form for machine learning models. In this study, several preprocessing techniques were applied to address common issues, such as missing values and varying feature scales. For handling missing data, imputation was applied to numerical columns (e.g., ROM, HAQ) by replacing missing values with the mean value of the respective feature, ensuring the overall distribution of the data remained intact. For categorical columns where imputation could lead to unrealistic values (e.g., missing treatment group information), rows with missing values were dropped to maintain complete data for these features. Additionally, to prevent features with different units or scales from disproportionately affecting the model, min-max scaling was applied to normalize features such as ROM, HAQ, and WOMAC Pain Score to a range between 0 and 1.
This helps standardize numerical features, ensuring that no feature dominates the model due to its scale.
Z-score Normalization: For features like Age, which can have a wide range of values, we applied Z-score normalization:
        where μ is the mean and σ is the standard deviation of the feature. This technique transforms features to have zero mean and unit variance, making them suitable for training deep learning models.
For categorical variables such as “Treatment Group” and “Rehabilitation Success Category,” one-hot encoding was applied. This technique transforms categorical variables into binary vectors, where each category is represented by a vector containing a 1 in the position corresponding to that category and 0s in all other positions. For instance, if the treatment groups include Hip Arthroplasty (HIPA), Knee Arthroplasty (KNEEA), and Trauma (TRAUMA), the encoding would be as follows:
This step ensures that the model can understand and work with categorical information.
Given that the model involves sequential data, such as rehabilitation steps or repeated measures across time, the preprocessing pipeline was designed to handle time series data effectively. This included two key techniques: windowing and padding. Time-series data, such as progress measurements across multiple sessions, was divided into fixed-length windows, where each window contained a subset of the time steps (e.g., one week or one month of data). For time series with unequal lengths, padding was applied to ensure consistent input dimensions. This involved adding zeros to the beginning or end of the sequence to match the maximum length in the dataset. To evaluate model performance, the dataset was divided into three subsets: a training set (70%) for model training, a validation set (15%) for hyperparameter tuning and model selection during training, and a test set (15%) to evaluate the final model performance. This splitting process helps prevent overfitting and ensures the model can generalize well to unseen data.
  3.3. Feature Selection and Augmentation
To enhance the model’s performance, feature selection was conducted based on the importance of features in predicting rehabilitation outcomes. Initially, a correlation matrix was used to identify highly correlated features, and those with correlation coefficients exceeding 0.9 were removed to prevent multicollinearity. Additionally, feature importance was assessed using SHapley Additive exPlanations (SHAP) values, which quantified the contribution of each feature to the model’s predictions. This process allowed us to focus on key features such as ROM, Age, and HAQ scores while discarding less influential ones.
In cases where certain treatment groups had limited data, data augmentation techniques were applied to artificially expand the dataset. Specifically, the Synthetic Minority Over-sampling Technique (SMOTE) was used for the classification task, particularly for imbalanced classes like “No Improvement.” SMOTE generated synthetic samples for underrepresented classes, helping to address the class imbalance and providing the model with sufficient examples to learn from [
27]. Here is a table presenting the sample sizes of majority and minority classes before applying SMOTE:
This 
Table 1 highlights the class imbalance, where the “No Improvement” (NI) category is underrepresented, which justifies the use of SMOTE to balance the dataset and enhance model training.
  3.4. Model Architecture Overview
The Model Architecture Overview outlines the methodology used to develop deep learning models for predicting rehabilitation outcomes, utilizing a hybrid architecture that combines Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN). This design was chosen to leverage both spatial and temporal information inherent in the dataset. The CNN layers focus on analyzing and processing spatial features, such as individual measurements or attributes, while the RNN layers capture temporal dependencies, addressing the sequential nature of patient data over time. By combining both spatial and temporal learning capabilities, this architecture effectively predicts rehabilitation outcomes. In the following sections, we will detail the model design, the training processes, and the evaluation methods used to assess performance, ensuring a comprehensive understanding of how the hybrid CNN-RNN model functions to handle complex patient data.
  3.4.1. Convolutional Neural Network (CNN)
The CNN part of the model is used to extract spatial features from structured input data, particularly in the case of multivariate time series or sequence data. CNNs are highly effective in capturing local patterns in data through convolutional layers.
The architecture of the CNN part consists of:
- ▪
- Input Layer: The input layer consists of preprocessed data (e.g., ROM, HAQ scores, Age, etc.) in a sequence format for each patient. 
- ▪
- Convolutional Layers: These layers apply filters to the input data to extract local patterns. The number of filters and kernel size are hyperparameters that are optimized during training. 
- ▪
- Max-Pooling: This operation reduces the spatial dimensions of the feature maps, preserving the most important features while reducing computational complexity. 
  3.4.2. Recurrent Neural Network (RNN)
The RNN component captures temporal dependencies in the data, which is especially important for sequential features, such as changes in ROM or HAQ scores over time. We use Long Short-Term Memory (LSTM) units, a type of RNN that helps to mitigate the vanishing gradient problem and can remember long-term dependencies.
The RNN part consists of:
- ▪
- LSTM Layers: These layers process sequential data, with each LSTM cell maintaining a memory of past inputs, allowing the model to capture temporal patterns in the rehabilitation progress. 
- ▪
- Fully Connected Layers: After passing through the RNN layers, the data is fed into one or more dense layers to make predictions. 
  3.4.3. Hybrid CNN-RNN Architecture
The hybrid CNN-RNN architecture combines the CNN’s ability to extract spatial features with the RNN’s capacity to model temporal sequences. The architecture is structured as follows:
- ▪
- CNN Layers: To extract spatial features from the structured input data. 
- ▪
- Flattening Layer: The output from the CNN is flattened into a 1D vector, which is then passed to the RNN layers. 
- ▪
- LSTM Layers: These layers process the sequence of data, capturing the temporal dynamics of rehabilitation. 
- ▪
- Dense Layer: A fully connected layer that outputs the final prediction (e.g., rehabilitation outcome). 
- ▪
- Output Layer: The output layer consists of a SoftMax or sigmoid activation function, depending on whether the task is regression or classification. 
  3.5. Model Training
The model training process involves feeding the preprocessed data into the hybrid CNN-RNN model and optimizing its parameters through backpropagation. The key steps in model training are outlined below:
The choice of loss function depends on the task (regression or classification):
- ▪
- For Regression Tasks (e.g., predicting ROM or HAQ scores), the Mean Squared Error (MSE) loss function is used: - 
        where  yi-  is the actual value and  -  is the predicted value for the  i- th sample. 
- ▪
- For Classification Tasks (e.g., predicting rehabilitation success), the Categorical Cross-Entropy loss function is used:
         - 
        where y i-  is the actual label and  -  is the predicted probability for the class i.
 
To minimize the loss function, the Adam optimizer is used, which adapts the learning rate based on the gradient of the loss. The Adam optimizer is an extension of stochastic gradient descent and has been shown to perform well in deep learning tasks:
        where:
        
- ▪
- θt is the parameter at time step t, 
- ▪
- mt is the first-moment estimate (mean of the gradients), 
- ▪
-  is the second-moment estimate (variance of the gradients), 
- ▪
- ϵ is a small constant to prevent division by zero, 
- ▪
- η is the learning rate. 
Hyperparameters such as the learning rate, number of layers, number of filters in the CNN, number of LSTM units, and batch size are tuned using the Grid Search or Random Search approach. The best combination of hyperparameters is selected based on the model’s performance on the validation set.
To prevent overfitting, early stopping is employed. The training process is halted if the validation loss does not improve after a predefined number of epochs (e.g., 10 epochs). This ensures that the model generalizes well to unseen data.
To evaluate the performance of the trained model, the following metrics are used based on the type of task (regression or classification):
For regression tasks, the model is evaluated using the following metrics:
- ▪
- Mean Absolute Error (MAE): - 
        where y i-  is the actual value and  -  is the predicted value for the  i- th sample. 
- ▪
- R-squared (R 2- ):
         - 
        where  -  is the mean of the actual values. The R 2-  score provides an indication of how well the model explains the variance in the data. 
Here is a summary table presenting different hyperparameter configurations and their impact on model performance (
Table 2 and 
Table 3), demonstrating the rationale behind our final model selection:
Balanced Performance: The selected hyperparameter configuration achieved the highest accuracy (84.2%), F1-score (0.86), and lowest MAE (4.25) while maintaining reasonable training efficiency.
Computational Efficiency: While Configuration 3 showed slightly better MAE, it required more training time without a substantial improvement in classification accuracy. The final model provides an optimal trade-off between performance and efficiency.
Generalization Ability: The selected configuration demonstrated better validation performance, indicating reduced overfitting compared to models with excessive LSTM units or larger CNN filters.
This systematic tuning process ensured that the final selected model delivers optimal predictive performance while remaining computationally feasible for real-world applications.
  3.6. Model Evaluation
For classification tasks, the model’s performance is evaluated using several key metrics, which help assess its ability to correctly classify instances into their respective categories. The most commonly used classification metrics are:
- ▪
- Accuracy: This measures the overall percentage of correct predictions made by the model. It is the most straightforward metric, indicating how often the model’s predictions match the true labels. 
- ▪
- Precision: This measures the model’s ability to correctly identify positive instances, specifically the proportion of true positive predictions out of all positive predictions made by the model. High precision indicates that when the model predicts a positive outcome, it is more likely to be correct. 
- ▪
- Recall: Also known as sensitivity or true positive rate, recall measures how well the model identifies all relevant positive instances. It is the proportion of true positive predictions out of all actual positive instances. High recall means the model is good at identifying most of the positive cases. 
- ▪
- F1-Score: The F1-score is the harmonic mean of precision and recall, providing a balance between the two. It is particularly useful when the data is imbalanced, meaning one class (positive or negative) is much more frequent than the other. A high F1-score indicates that both precision and recall are balanced and perform well. 
These metrics are often used together to provide a comprehensive picture of how well the model is performing, especially when dealing with imbalanced datasets or multi-class classification problems.
  3.7. Software and Hardware Environment
The experiments in this study were conducted using a robust computational environment designed to support the deep learning models utilized. The software environment was primarily built on Python 3.13, leveraging the TensorFlow and Keras libraries for the implementation and training of the hybrid CNN-RNN model. Other supporting libraries, such as NumPy, Pandas, and Matplotlib 3.7.5, were used for data manipulation, analysis, and visualization, respectively. The models were trained on a machine equipped with an Intel Core i7 processor, 32GB of RAM, and an NVIDIA GTX 1080 GPU, which provided the necessary computational power for processing large volumes of data and performing time-intensive tasks like training deep learning models. The software packages were installed and run on an Ubuntu 20.04 operating system, ensuring a stable and efficient environment for model development and experimentation. This configuration allowed for the seamless execution of complex computations while ensuring quick model training and evaluation.
  4. Results
This section presents the results of the proposed hybrid CNN-RNN model for rehabilitation outcome prediction. The discussion elaborates on the model’s performance in both regression and classification tasks, comparisons with traditional machine learning models, and the implications of its findings. Furthermore, the section includes an analysis of feature importance using SHAP, highlights the model’s implications in clinical contexts, and concludes with limitations and future perspectives.
  4.1. Performance Evaluation of the Models
To evaluate the performance of the hybrid CNN-RNN model, we compared it against several traditional machine learning models: Random Forest, Extra Trees, and Linear Regression for regression tasks, and Random Forest, Extra Trees, and Support Vector Classifier (SVC) for classification tasks. The following 
Table 4 summarize the quantitative performance metrics:
- ▪
- MAE and MSE: The hybrid CNN-RNN model consistently outperformed traditional machine learning models in terms of both Mean Absolute Error (MAE) and Mean Squared Error (MSE) for both ROM and HAQ predictions. The lower MAE and MSE values indicate that the deep learning model provides more accurate and precise predictions, particularly for both continuous outcomes. 
- ▪
- R2 Score: The R2 scores of 0.87 (for ROM) and 0.91 (for HAQ) demonstrate that the deep learning model explains a significant portion of the variance in rehabilitation outcomes, which is notably higher compared to other models like Random Forest (0.78 for ROM) and Linear Regression (0.69 for ROM). 
The classification task aimed to predict rehabilitation success categories: “No Improvement (WO),” “Moderate Improvement (MI),” and “Significant Improvement (SI).” The model’s performance was evaluated using accuracy, precision, recall, F1-score, and AUC-ROC (
Table 5).
- ▪
- Accuracy: The hybrid CNN-RNN model achieved the highest accuracy of 84.2%, outperforming other models such as Random Forest (79.1%) and Support Vector Classifier (75.6%). 
- ▪
- Precision and Recall: The model demonstrated strong precision (0.88) and recall (0.85) values, particularly for predicting “Significant Improvement” (SI), indicating its effectiveness in identifying patients who will have a successful rehabilitation outcome. 
- ▪
- F1-score and AUC-ROC: The F1-score of 0.86 for the “Significant Improvement” category and an AUC-ROC of 0.92 highlight the model’s strong performance across all categories. 
We further analyzed the performance of the hybrid CNN-RNN model across different treatment groups (hip arthroplasty, knee arthroplasty, and trauma patients). The results are shown below:
  4.2. SHAP Analysis
SHAP values were utilized to assess the contribution of each feature to the model’s predictions. This feature importance analysis highlights the variables most influential in predicting rehabilitation outcomes (
Table 6).
- ▪
- The hybrid CNN-RNN model outperformed traditional models across all treatment groups, with a particularly strong performance for Hip Arthroplasty (HIPA) (88.4%) and Knee Arthroplasty (KNEEA) (85.7%). These results suggest that structured rehabilitation protocols and more consistent clinical measures in these treatment groups provide more favorable conditions for model prediction. 
- ▪
- The Trauma Knee and Trauma Hip groups, which exhibit greater variability in patient conditions and rehabilitation protocols, also show improvements in accuracy with the hybrid model, indicating that the deep learning approach is more adaptable to diverse patient populations. 
The attention mechanism applied to the hybrid model reveals the features and time steps that have the most significant influence on predictions. Below 
Table 7 is an example of the attention distribution for a “Significant Improvement (SI)” classification:
- ▪
- Range of Motion (ROM) and Age were the most important features influencing the model’s prediction of “Significant Improvement” for hip arthroplasty patients, which is consistent with clinical understanding that younger patients with better baseline ROM tend to show better recovery outcomes. 
- ▪
- The HAQ Disability Score and WOMAC Pain Score played key roles for knee arthroplasty patients, highlighting that pain levels and functional disability are critical indicators of rehabilitation success. 
Feature importance analysis was performed using SHAP values to assess the contribution of each feature to the model’s predictions. The results are as follows (
Table 8):
- ▪
- The SHAP analysis confirms that ROM, Age, and HAQ Disability Score are the top three features influencing the model’s predictions. This analysis supports the model’s ability to focus on clinically relevant factors and provides transparency in understanding how predictions are made. 
- ▪
- TUG Test Time had a smaller contribution in the final prediction, which suggests that while it is an important metric for functional mobility, it may be less influential compared to other features for certain treatment groups. 
To summarize the overall effectiveness of the hybrid CNN-RNN model, we compare its performance with traditional models across various metrics.
  4.3. CNN-RNN Model Implications
The hybrid CNN-RNN model demonstrated superior performance across both regression and classification tasks, making it a promising tool for rehabilitation outcome prediction. The high accuracy of 84.2% in the classification task and the impressive R
2 scores of 0.87 for ROM and 0.91 for HAQ indicate that the model effectively captures the complexity of rehabilitation recovery patterns (
Table 9).
One key takeaway from the results presented in 
Table 6 is the model’s robustness across different patient treatment groups, with the highest accuracy observed for Hip Arthroplasty (88.4%) and Knee Arthroplasty (85.7%). These results emphasize the model’s adaptability to structured rehabilitation protocols, suggesting that it can be used as a valuable decision-support tool in clinical settings.
The percentage symbol (%) in the Accuracy column represents the model’s accuracy rate, expressed as a percentage. This metric was computed by dividing the number of correct predictions by the total number of predictions and multiplying by 100 to represent it as a percentage.
  4.4. Clinical Integration of Predictive Insights
The proposed deep learning model has strong potential for real-world clinical applications in rehabilitation planning. By integrating predictive insights into clinical workflows, healthcare providers can personalize treatment strategies, optimize resource allocation, and enable early intervention for high-risk patients.
One practical approach is integrating the model into Electronic Health Records (EHRs), allowing real-time updates based on new patient data. This would enable automated risk stratification, guiding clinicians in adjusting rehabilitation plans dynamically. Additionally, a visual dashboard displaying patient-specific risk scores and key predictors could improve clinician decision-making by offering clear explanations of why certain patients are at higher risk of poor recovery.
Furthermore, the model could be leveraged to trigger early intervention alerts when a patient’s predicted rehabilitation trajectory deviates significantly from expected outcomes. By identifying at-risk individuals early, clinicians can modify treatment plans accordingly, ensuring that patients receive timely, targeted interventions to maximize recovery success.
Future work should focus on developing user-friendly AI tools that enhance rehabilitation management while maintaining interpretability and clinician trust. Integrating explainable AI (XAI) techniques, such as SHAP-based feature explanations and attention mechanisms, will be crucial for ensuring that model outputs are transparent and actionable for medical professionals.
  4.5. Limitations and Future Perspectives
Despite the promising results, several limitations need to be addressed. First, the model’s reliance on the availability of high-quality patient data, particularly for features like ROM and HAQ, could limit its applicability in settings where such data is sparse. Additionally, the model may struggle with patients who exhibit unusual or highly variable recovery patterns, as seen in the Trauma Knee and Trauma Hip groups.
Another limitation is the need for continuous retraining of the model to adapt to evolving rehabilitation protocols and patient populations. While the current study used historical data, future studies should explore real-time applications and continuous learning approaches to ensure the model remains accurate over time.
Future research could focus on the following directions:
Data Augmentation: Improving the robustness of the model by augmenting the dataset with synthetic patient data to simulate more diverse recovery patterns.
Model Expansion: Extending the model to include additional features, such as genetic predisposition, to better predict outcomes in a broader patient population.
Integration in Clinical Practice: Implementing the model as part of a decision-support system in clinical settings to provide real-time predictions and guide rehabilitation strategies.
  5. Conclusions
The proposed hybrid CNN-RNN deep learning model significantly enhances the prediction of rehabilitation outcomes by combining Convolutional Neural Networks (CNNs) for feature extraction and Recurrent Neural Networks (RNNs) for modeling temporal dependencies in rehabilitation data. This dual approach enables the model to capture both spatial and temporal patterns, making it highly effective in clinical environments where tracking patient progress over time is crucial.
Our model outperforms traditional methods, achieving notable results in both regression (MAE = 4.25, R2 = 0.87) and classification tasks (accuracy = 84.2%, AUC-ROC = 0.92). These results demonstrate the model’s ability to make more accurate predictions of rehabilitation outcomes, facilitating more personalized care and tailored rehabilitation plans for patients.
A key strength of the model is its robustness across diverse patient groups, including those undergoing hip and knee arthroplasty. Its generalizability allows for broader application in clinical settings, enhancing its potential for widespread use. Moreover, the model’s attention mechanism contributes to its interpretability, enabling clinicians to understand the importance of different features in making predictions. This is further supported by SHAP analysis, which highlighted significant features like Range of Motion (ROM) and HAQ scores as key predictors of rehabilitation success.
The model’s ability to predict rehabilitation outcomes with high accuracy and provide insights into feature importance makes it a valuable tool for personalized rehabilitation planning. Clinicians can use this model to optimize treatment strategies, improving patient recovery. Its capacity to handle both regression and classification tasks further increases its versatility in clinical practice, providing a comprehensive view of patient progress.
  Future Directions
To further enhance the model’s predictive capabilities and clinical applicability, several key areas should be explored:
- Incorporating medical imaging (e.g., MRI or X-rays) through convolutional layers could provide additional spatial information about musculoskeletal conditions. Additionally, integrating real-time sensor data from wearable devices (e.g., accelerometers and gait analysis sensors) could enable continuous monitoring of patient progress, allowing for dynamic model updates. 
- Future work should focus on optimizing the model for real-time clinical applications by reducing computational complexity and ensuring compatibility with electronic health record (EHR) systems. Strategies such as model quantization and federated learning could help maintain performance while reducing hardware requirements. 
- While SMOTE was used for class balancing, further exploration of advanced techniques such as generative adversarial networks (GANs) for synthetic data augmentation could improve the model’s robustness for underrepresented patient groups. Additionally, external validation using larger, diverse datasets across different healthcare institutions would strengthen its generalizability. 
- Developing user-friendly interpretability tools, such as interactive dashboards for visualizing SHAP values and attention-based feature importance, could improve clinician trust and facilitate the adoption of AI-driven rehabilitation models in practice. 
By addressing these aspects, future research can refine the model’s utility, ensuring it translates effectively from research to real-world clinical settings while maintaining accuracy, efficiency, and interpretability.