Comparative Evaluation of Machine Learning Models Using Structured and Unstructured Clinical Data for Predicting Unplanned General Medicine Readmissions in a Tertiary Hospital in Australia

Sharma, Yogesh; Thompson, Campbell; Mangoni, Arduino A.; Horwood, Chris; Woodman, Richard

doi:10.3390/computers15030138

Open AccessArticle

Comparative Evaluation of Machine Learning Models Using Structured and Unstructured Clinical Data for Predicting Unplanned General Medicine Readmissions in a Tertiary Hospital in Australia

by

Yogesh Sharma

^1,2,*

,

Campbell Thompson

³

,

Arduino A. Mangoni

²

,

Chris Horwood

⁴ and

Richard Woodman

²

¹

Department of Acute and General Medicine, Flinders Medical Centre, Adelaide, SA 5042, Australia

²

College of Medicine & Public Health, Flinders University, Adelaide, SA 5042, Australia

³

Discipline of Medicine, The University of Adelaide, Adelaide, SA 5005, Australia

⁴

Clinical Improvement Unit, Flinders Medical Centre, Adelaide, SA 5042, Australia

^*

Author to whom correspondence should be addressed.

Computers 2026, 15(3), 138; https://doi.org/10.3390/computers15030138

Submission received: 22 January 2026 / Revised: 20 February 2026 / Accepted: 21 February 2026 / Published: 26 February 2026

(This article belongs to the Special Issue Artificial Intelligence (AI) in Medical Informatics)

Download

Browse Figures

Versions Notes

Abstract

Background: Unplanned 30-day hospital readmissions, a key healthcare quality metric, are common and costly. Prediction models built on structured data often perform modestly, and the added value of unstructured clinical notes remains unclear. Methods: This retrospective cohort study included 4135 general medicine admissions to a tertiary Australian hospital between July 2022 and June 2023. Structured predictors included demographics, comorbidities, frailty, prior healthcare utilisation, length-of-stay, inflammatory markers, socioeconomic indicators, and lifestyle factors. We developed deep learning models using structured data alone, unstructured text alone, and a combined multimodal architecture integrating both modalities. For benchmarking, multiple classical machine learning models trained on structured features were evaluated using identical data splits, including logistic regression, XGBoost, random forest, gradient boosting, extra trees, and HistGradient Boosting. Model performance was assessed on a hold-out test set using ROC-AUC, accuracy, precision, recall, and F1-score. Results: Unplanned readmissions occurred in 24.3% of admissions. Among classical machine learning models, logistic regression achieved the highest discrimination (ROC-AUC 0.64), with no substantial improvement observed from ensemble methods. Structured-only deep learning achieved ROC-AUC 0.62. Unstructured text-only and multimodal models achieved ROC-AUCs of 0.52 and 0.58, respectively. Although overall discrimination of the multimodal model was lower than structured-only performance, it demonstrated improved sensitivity and F1-score for identifying patients who were readmitted. Prior hospitalisations, emergency department visits, and comorbidity burden were the strongest predictors. Conclusions: Structured EMR variables remain the main drivers of 30-day readmission risk. More complex classical machine learning models did not outperform logistic regression, and incorporating unstructured clinical text provided only modest improvement in identifying high-risk patients without enhancing overall discrimination.

Keywords:

readmissions; deep learning models; bidirectional encoder representations from transformers; convolutional neural network; area under the receiver operating characteristic curve

1. Introduction

Unplanned hospital readmissions remain a major challenge for healthcare systems, imposing substantial clinical and economic burdens on patients and health services [1,2]. Accurate identification of patients at risk of 30-day readmission is essential for implementing targeted interventions such as personalised discharge planning, enhanced follow-up, and community-based support [3]. From a clinical perspective, early risk prediction can improve continuity of care and reduce preventable readmissions [4], while from a health-system perspective, precision risk stratification can optimise resource allocation.

Despite extensive research, predictive model performance for 30-day readmissions has generally been modest. A systematic review of 60 studies (73 models) reported only moderate discrimination (mean AUC 0.70; range 0.21–0.88) [5]. Most traditional models rely solely on structured, human-curated variables [6] or machine-derived features extracted automatically from electronic medical records (EMRs) [7,8]. While structured EMR features capture key clinical and administrative indicators, they may overlook the contextual richness of unstructured clinical narratives that contain semantic information about functional status, clinical reasoning, and social determinants of health [9].

1.1. Related Work

Prior research on readmission prediction has explored structured EMR variables, unstructured clinical text, and multimodal integration. Below, we summarise the key findings across these approaches.

1.1.1. Structured Data and Traditional ML Models

Early hospital readmission prediction models were grounded in classical statistical and machine learning methods applied to structured clinical data. Logistic regression remains a common baseline due to its interpretability, though linear assumptions can limit performance in complex datasets [6]. Tree-based ensembles such as random forests and gradient boosting yield modest improvements but are constrained by feature engineering and limited ability to capture latent nonlinear interactions [10,11]. Studies consistently show that prior healthcare utilisation, comorbidity burden, and frailty are among the most predictive structured features [12,13].

1.1.2. Deep Learning on Structured EMR Variables

Deep learning architectures—including feedforward networks, recurrent neural networks (RNNs), long short-term memory networks (LSTMs), and transformer-based models—capture nonlinear temporal patterns in structured EMR data and semantic relationships in EMR data [14,15]. However, studies using structured EMR features alone often demonstrate only modest incremental gains over well-tuned classical models [16,17,18,19].

1.1.3. Unstructured Clinical Text and Early Natural Language Processing (NLP) Models

Unstructured clinical text complements structured data by encoding semantic and contextual information from narrative notes. Early NLP approaches using bag-of-words or term frequency–inverse document frequency (TF-IDF) with linear classifiers yielded limited improvements [20]. Deep learning models—including convolutional neural networks (CNNs), LSTMs, and transformer-based embeddings such as Bidirectional Encoder Representations from Transformers (Bio-ClinicalBERT)—better capture syntactic and semantic nuances [20,21]. Despite these advances, performance gains remain context-dependent, and structured-data models often outperform text-only models [22,23].

1.1.4. Multimodal Integration

Multimodal frameworks integrate structured and unstructured data to leverage complementary strengths. Fusion strategies—such as feature concatenation, hypergraph integration, and transformer-based embeddings—have improved predictive performance in tasks including 30-day readmission and in-hospital mortality [22,24,25,26]. However, systematic reviews highlight substantial variability in modelling approaches, validation strategies, and reported performance across studies, with many studies lacking rigorous external evaluation [27,28]. Consequently, the generalizable incremental value of unstructured clinical text remains unclear.

1.2. Study Motivation and Contributions

While structured EMR variables and classical machine learning remain strong baselines, evidence regarding the added value of unstructured text remains inconsistent, particularly within general medicine populations. Most prior studies [29,30,31,32] focus on disease-specific cohorts or employ heterogeneous evaluation frameworks, limiting direct comparison across modelling strategies.

Accordingly, in this study, we developed, evaluated, and compared multiple deep learning approaches for predicting 30-day unplanned readmissions among patients admitted under a general medicine service at an Australian tertiary hospital. These included:

A structured-only model using curated EMR variables;
A text-only model based on fine-tuned Bio-ClinicalBERT embeddings of clinical notes;
Several multimodal architectures integrating structured data with text embeddings, including fully connected CNNs and LSTM models.

For benchmarking, we also evaluated multiple classical machine learning models trained on structured features using identical data splits, including logistic regression XGBoost, Random Forest, Gradient Boosting, Extra Trees and HistGradient Boosting.

The primary objective was to determine whether incorporating unstructured clinical text embeddings improves predictive performance compared with structured EMR data alone, while providing post hoc explainability of key predictors using Shapley Additive explanations (SHAP).

The dataset comprised adult general medicine admissions at an Australian tertiary hospital, including both structured EMR variables and unstructured free-text clinical documentation. Structured features encompassed demographics, comorbidity burden, frailty, laboratory results, prior healthcare utilisation, and socioeconomic indicators, while unstructured data included admission notes, progress notes, allied health documentation, and discharge summaries. The outcome of interest was unplanned all-cause readmission within 30 days of discharge.

2. Materials and Methods

2.1. Study Design and Data Source

This retrospective cohort study utilised de-identified EMR data from general medicine inpatients admitted to a tertiary hospital in Australia between 1 July 2022 and 30 June 2023. All adults ≥18 years admitted under the general medicine service were included; patients missing a Hospital Frailty Risk Score (HFRS) were excluded. As the study used only de-identified retrospective data, it was deemed exempt from formal ethics approval. Both structured and unstructured data were analysed.

2.2. Data Variables, Preprocessing and Feature Representation

Structured variables were selected based on established associations with readmission risk from South Australia’s Sunrise Clinical Manager electronic medical record system, which contains comprehensive clinical information (history, diagnoses, investigations, medications, observations, and My Health Record access). Variables included age, sex, Charlson Comorbidity Index, and frailty (HFRS) [33]. Additional predictors incorporated were Emergency Department (ED) presentations in the preceding six months, hospital admissions in the prior year, index admission length of stay (LOS), C-reactive protein (CRP) at admission, Index of Relative Socio-economic Disadvantage (IRSD), and lifestyle risk factors (smoking, alcohol abuse) [34,35,36,37].

Unstructured variables comprised clinical narratives from EMRs, including admission notes, ward round and progress notes, nursing and allied health documentation, and discharge summaries. These texts capture patient complexity and clinical decision-making not represented in structured data. Text preprocessing was applied exclusively to unstructured clinical narratives prior to language model input. Clinical notes were de-identified at source and concatenated at the admission level to form a single document per hospitalisation. Minimal preprocessing was performed to preserve clinical context, including removal of formatting artefacts, non-printing characters, and excess whitespace. No manual exclusion of templated text was undertaken to avoid loss of prognostic signal.

Structured data preprocessing refers to preparation of structured EMR variables for model development. To prevent data leakage and ensure robust evaluation, each of the three datasets were split into training (70%), validation (10%), and test (20%) sets, stratified by 30-day readmission status to preserve the original class distribution. The test set was held out and never used during model training. Missing values for CRP (6.5%) were imputed using a single imputation approach based on an iterative chained-equations algorithm (MICE). This method is well-suited to continuous laboratory values with low levels of missingness. Patients missing HFRSs (14.1%) were excluded, because the HFRS is a composite frailty index derived from diagnostic codes; when these codes are absent, the frailty status cannot be reliably reconstructed, and imputation would not yield valid estimates.

Standardisation: following the train–test split of the structured dataset and prior to model training, continuous structured features were standardised, using z-score normalisation (mean = 0, SD = 1) based on the training set, while categorical variables were one-hot encoded. Unstructured embeddings were left unscaled during training.

Three datasets were created for model comparison:

Structured dataset: Demographic, clinical, utilisation, and laboratory variables.
Unstructured dataset: Bio-ClinicalBERT representations obtained via end-to-end fine tuning produced a 768-dimensional [CLS] representation per admission.
Combined dataset: Concatenation of structured variables and 768-dimensional embeddings to form a unified feature representation.

The overall workflow for predicting 30-day unplanned readmissions, including data preprocessing, feature representation, model development, training, and evaluation, is illustrated in Figure 1.

2.3. Text Representation Using Bio-ClinicalBERT

Transformer-based language models have revolutionised natural language processing by enabling contextual understanding of text. Unlike autoregressive models such as the Generative Pretrained Transformer (GPT), BERT models capture the semantic meaning [38]. Bio-ClinicalBERT is a domain-adapted version of BERT further pre-trained on the biomedical literature and clinical notes (e.g., PubMed, MIMIC-III), enhancing its ability to interpret medical terminology and context [39].

In this study, Bio-ClinicalBERT (emilyalsentzer/Bio_ClinicalBERT) was fine-tuned end-to-end on de-identified clinical narratives to predict 30-day readmission. Clinical notes were tokenised using the model’s native tokenizer with padding and truncation to a fixed maximum length of 256 tokens. No hierarchical note modelling was performed; each admission was represented as a single concatenated text sequence. The final hidden representation of the [CLS] token from the last transformer layer was passed to a linear classification head for binary prediction. All transformer layers and the classification head were updated during training.

2.4. Model Development

Three deep learning models were developed to predict 30-day unplanned hospital readmissions:

Structured-only model: A feedforward neural network trained exclusively on structured EMR variables, including demographic, clinical, laboratory, and healthcare utilisation features. Structured features are numerical or categorical variables derived from the database, not free-text notes.
Unstructured text-only model: An end-to-end Bio-ClinicalBERT model fine-tuned on concatenated clinical narratives (admission notes, progress notes, allied health documentation, discharge summaries) to generate a 768-dimensional [CLS] embedding per admission, which is used for binary readmission prediction.
Combined multimodal model: Integrates the two complementary data types—structured EMR features and unstructured text embeddings—by concatenating the 768-dimensional [CLS] embedding with structured features as input to a feedforward neural network. Here, “multimodal” refers to the combination of numeric/categorical structured features and textual embeddings, not multiple sensory modalities.

Additional Multimodal Architectures

We also explored alternative deep learning architectures applied to the same combined structured plus text representation. Specifically, two additional models were implemented:

Combined multimodal (CNN) model: The concatenated structured variables and 768-dimensional [CLS] embedding were passed through one-dimensional convolutional layers with ReLU activation, followed by max-pooling and fully connected layers prior to sigmoid output.
Combined multimodal (LSTM) model: The concatenated representation was reshaped and processed through a LSTM network, with the final hidden state used for binary classification via a sigmoid output layer.

Model architecture and hyperparameters are summarised in Table 1. Deep learning models were implemented using PyTorch (version 3.12.4) [40]. Structured data were modelled using a fully connected feedforward neural network, while unstructured clinical text was modelled using an end-to-end fine-tuned Bio-ClinicalBERT transformer.

For the structured-only model, the network architecture comprised:

Input layer: 11 structured features;
Hidden layers: Two fully connected layers with 256 and 64 neurons, respectively, each using ReLU activation;
Output layer: A single neuron with sigmoid activation for binary classification.

For the text-only model, Bio-ClinicalBERT (emilyalsentzer/Bio_ClinicalBERT) was fine-tuned with a task-specific classification head. Clinical text was tokenised with padding and truncation to 256 tokens, and the pooled [CLS] representation from the final transformer layer was passed directly to the output layer.

For the combined multimodal model, the 768-dimensional [CLS] embedding from Bio-ClinicalBERT was concatenated with structured variables to form a unified input, which was then processed through fully connected layers identical to the structured model.

Models were trained using binary cross-entropy loss with inverse class-frequency weighting, where class weights were calculated from the training set to upweight the minority class (30-day readmissions) and mitigate class imbalance. This approach was selected to improve sensitivity without altering the underlying data distribution, in preference to oversampling methods that may increase overfitting or more complex loss functions such as focal loss. The Adam optimiser [41] was used with a learning rate of 0.001. All models were trained for 10 epochs with a batch size of 32.

Model checkpointing: Validation AUC was monitored at each epoch, and the model with the highest validation AUC was saved. This approach mitigates overfitting even though training continued for 10 epochs, effectively serving the role of early stopping. Training and validation loss across epochs were plotted to demonstrate convergence and model selection.

Model evaluation: Final performance was evaluated on the held-out test set. The primary metric was ROC-AUC:

AUC = ʃ_{0}^{1} TPR ({FPR}^{- 1} (t)) dt

where

TPR = \frac{TP}{TP + FN}

and

FPR = \frac{FP}{FP + TN}

.

Additional evaluation metrics included:

Accuracy = \frac{TP + TN}{TP + TN + FP + FN}

Precision = \frac{TP}{TP + FP}

Recall (Sensitivity) = \frac{TP}{TP + FN}

F 1 - score = 2 \times Precision \times \frac{Recall}{Precision + Recall}

Because the default threshold of 0.5 yielded low precision and recall, the optimal classification threshold was determined using Youden’s J statistic on the validation set:

J = Sensitivity + Specificity - 1

Calibration was assessed using the Brier score:

Brier Score = \frac{1}{N} * Σ {(y_{i} - p_{i})}^{2}

where y_i is the observed outcome and p_i is the predicted probability for the i-th admission. Calibration plots were also generated to visually assess agreement between predicted and observed outcomes. ROC curves were used to visualise discrimination for all models.

2.5. Classical Machine Learning Baselines, Model Interpretation, and Statistical Analysis

Classical ML baselines: Logistic regression (L2-regularised, balanced class weights) and XGBoost (500 estimators, learning rate 0.05, max depth 3, subsample 0.8, column subsample 0.8, scale_pos_weight for class imbalance) were trained exclusively on structured EMR variables using the same train/validation/test splits. These models provide performance benchmarks against the deep learning approaches.

Additional ML models for broader evaluation: Several additional classical machine learning models were also implemented on structured EMR variables, including Random Forest, Gradient Boosting, Extra Trees, and HistGradient Boosting. Hyperparameters were tuned using five-fold cross-validation on the training set. This broader evaluation enables a more comprehensive assessment of structured data predictive performance.

Feature Importance and explainability: SHAP values were generated for the structured EMR deep learning model to quantify each feature’s contribution to the predicted outcome, enabling transparent identification of the most important predictors. For this study:

The structured model was evaluated using the DeepExplainer [42] on a subset of 500 background samples from the training set.
SHAP summary (beeswarm) plots were generated to visualise feature contributions, and mean absolute SHAP values were used to rank the importance of each variable.

Statistical analysis: Continuous variables were presented as mean ± SD or median [IQR]; categorical variables as frequencies (%). Comparisons used independent t-tests or Wilcoxon rank-sum tests for continuous variables and chi-square tests for categorical variables. The outcome was 30-day unplanned hospital readmission (binary: 1 = readmitted, 0 = not readmitted), with p < 0.05 considered statistically significant. Analyses were performed using Python 3.12.4 (PyTorch, HuggingFace Transformers, scikit-learn). The source code for all models is provided in Supplementary File S1.

3. Results

A total of 5689 admissions occurred over the study period, of which 752 (13.2%) represented multiple readmissions; these were excluded to focus on each patient’s first readmission. The HFRS was unavailable for 802 admissions (14.1%), primarily due to incomplete diagnostic coding required for score derivation. These admissions were excluded from model development, resulting in a final analytic cohort enriched for patients with complete administrative data. C-reactive protein (CRP) values were missing for 350 patients (6.5%) and were imputed using the single imputation method derived from the MICE algorithm. The final analytic cohort included 4135 patients, of whom 1006 (24.3%) experienced a 30-day unplanned readmission.

The baseline characteristics of patients with and without 30-day readmission are presented in Table 2. Compared with non-readmitted patients, those readmitted had significantly higher comorbidity (Charlson Comorbidity Index) and frailty (HFRS), were from more socioeconomically disadvantaged areas, had a greater number of emergency department visits in the preceding six months, more hospital admissions in the prior year, and longer index admission LOS. Readmitted patients were also more likely to be smokers and have a history of alcohol abuse (all p < 0.05).

3.1. Model Training and Evaluation

3.1.1. Classical Machine Learning Models

Baseline models

To contextualise deep learning performance, two classical machine learning models were trained using the same structured EMR features: logistic regression and XGBoost.

Logistic regression achieved a test ROC-AUC of 0.64, with accuracy 0.58, precision 0.32, recall 0.66, and F1-score 0.43 (Table 3 and Figure 2).
XGBoost achieved a test ROC-AUC of 0.60, with accuracy 0.62, precision 0.32, recall 0.51, and F1-score 0.40 (Table 3 and Figure 3).

Additional classical ML models (Table 3 and Figure A1)

To provide broader benchmarking, four additional classical machine learning models were evaluated on the same structured EMR dataset:

Random Forest: ROC-AUC 0.61, accuracy 0.75, precision 0.50, recall 0.15, F1-score 0.23.
Gradient Boosting: ROC-AUC 0.61, accuracy 0.73, precision 0.40, recall 0.17, F1-score 0.24.
Extra Trees: ROC-AUC 0.61, accuracy 0.74, precision 0.41, recall 0.16, F1-score 0.23.
HistGradient Boosting: ROC-AUC 0.62, accuracy 0.73, precision 0.40, recall 0.16, F1-score 0.23.

Although these models achieved relatively high overall accuracy (0.73–0.75), this was driven primarily by the correct classification of non-readmitted patients. Sensitivity for readmitted patients remained low (recall 0.15–0.17), resulting in modest F1-scores.

Overall, no classical ML model substantially outperformed logistic regression in terms of discrimination, suggesting that structured EMR variables provided moderate but limited predictive signal for 30-day readmission.

These results indicate that structured EMR features drive the majority of predictive performance, while XGBoost was slightly more conservative than logistic regression, prioritising specificity over sensitivity.

Model definitions:
- Structured EMR: Demographic, clinical, laboratory, and healthcare utilisation variables.
- Classical machine learning baselines: Logistic regression and XGBoost trained using structured EMR variables only.
- DL–Structured: Feedforward neural network trained exclusively on structured EMR variables.
- DL–Text (Bio-ClinicalBERT): End-to-end fine-tuned Bio-ClinicalBERT model trained on concatenated free-text clinical notes.
- DL–Multimodal: Deep learning model integrating structured EMR variables with Bio-ClinicalBERT text embeddings.

Performance metrics were calculated on the held-out test set using the optimal classification threshold determined on the validation set.

3.1.2. Deep Learning Models

Structured Data Model:

The feedforward neural network trained on structured EMR variables achieved a final test ROC-AUC of 0.62. Classification metrics indicated high specificity for non-readmitted patients but limited sensitivity for readmitted patients. Overall accuracy was 0.74, with a precision of 0.42, recall of 0.14, and F1-score of 0.22 for readmitted patients (Figure 4 and Table 3). The Brier score was 0.247, indicating moderate calibration of predicted probabilities. The calibration curve (Figure A2) for the structured model demonstrated reasonable alignment at lower predicted risk levels, with increasing overprediction at higher probabilities. Despite this, the monotonic increase in observed event rates indicates preserved risk stratification, supporting its use as a baseline predictive model.

These findings suggest that structured EMR variables capture the majority of predictive signal for 30-day unplanned readmissions, providing a solid baseline for comparison with models incorporating unstructured text or multimodal inputs.

Unstructured Text Model:

The model trained exclusively on fine-tuned Bio-ClinicalBERT embeddings from clinical narratives performed poorly, achieving an ROC-AUC of 0.52. Using an optimised threshold, the model achieved an accuracy of 0.46, precision of 0.26, recall of 0.86, and F1-score of 0.39 for readmitted patients, indicating that unstructured text alone provided moderate predictive signal when threshold optimisation was applied (Figure 5 and Table 3). The Brier score was 0.23, reflecting poor probability calibration. The calibration plot showed that predicted probabilities consistently lay below the ideal diagonal, reflecting systematic underestimation of readmission risk (Figure A3). Together with the near chance discrimination, this suggests that unstructured clinical text alone provided limited predictive signal for 30-day hospital readmission.

Combined Structured and Text Model:

The multimodal model integrating structured variables with fine-tuned Bio-ClinicalBERT embeddings achieved an ROC-AUC of 0.58. Using an optimised threshold, the model achieved an accuracy of 0.63, precision of 0.33, recall of 0.49, and F1-score of 0.40 for readmitted patients, suggesting that the addition of unstructured text embeddings provided modest incremental predictive value beyond structured data alone (Figure 6 and Table 3). Calibration assessment yielded a Brier score of 0.22, indicating suboptimal probabilistic performance. The calibration plot (Figure A4) shows that predicted probabilities tend to underestimate readmission risk, particularly at higher predicted probability ranges.

Additional Multimodal Architectures (CNN and LSTM):

Alternative multimodal architectures were evaluated using the same combined structured + text representation.

The CNN-based multimodal model achieved an ROC-AUC of 0.54, with accuracy 0.47, precision 0.26, recall 0.66, and F1-score 0.38 (Table 3 and Figure A5).

The LSTM-based multimodal model achieved an ROC-AUC of 0.54, with accuracy 0.49, precision 0.27, recall 0.63, and F1-score 0.38 (Table 3 and Figure A6).

Neither architecture improved discrimination relative to the feedforward multimodal model, indicating that additional architectural complexity did not enhance predictive performance in this dataset.

Collectively, the findings suggest that while structured clinical variables capture the majority of predictive signal for 30-day unplanned readmissions, incorporating unstructured clinical text embeddings can modestly enhance identification of high-risk patients.

3.2. Feature Importance Analysis (SHAP)

To provide post hoc explainability for the structured EMR model, SHAP values were computed on a 500-subject subset of the data. As shown in Table 4 and illustrated in the summary plots (Figure 7 and Figure 8), prior healthcare utilisation—specifically the total number of hospital admissions in the preceding year and the number of emergency department visits in the previous six months—emerged as the most influential predictors of 30-day unplanned readmission. Comorbidity burden, measured using the Charlson Index, was also a dominant contributor. Additional factors, including LOS, age, and frailty (HFRS), contributed meaningfully to predictions, whereas lifestyle variables such as smoking and alcohol abuse had smaller effects. Overall, the SHAP analysis reinforces the primary finding that structured clinical variables capture the majority of predictive signal for 30-day readmission in this cohort.

4. Discussion

In this study, we developed and evaluated deep learning models using structured EMR data, unstructured clinical text, and a combination of both to predict 30-day unplanned readmissions among general medicine inpatients at a tertiary hospital in Australia. Our findings indicate that models trained solely on structured data achieved moderate predictive performance (ROC-AUC 0.629; accuracy 0.745; F1-score 0.22 for readmitted patients). The unstructured clinical text-only model, using fine-tuned Bio-ClinicalBERT embeddings performed poorly (ROC-AUC 0.524), and using an optimised threshold (Youden’s J = 0.0547; threshold 0.3403), it improved sensitivity for readmitted patients (recall 0.83, F1-score 0.39), although overall discrimination remained limited. The combined model, integrating structured variables with Bio-ClinicalBERT embeddings, achieved the highest ROC-AUC of 0.589, but this was lower than the structured-only model (0.629), highlighting a trade-off between improved sensitivity/F1 and overall discrimination; using an optimised threshold (Youden’s J = 0.1746; threshold 0.4855), it demonstrated balanced sensitivity and specificity (recall 0.49, F1-score 0.40), indicating that the addition of unstructured text provided modest incremental predictive value beyond structured data alone.

We also implemented various classical machine learning baselines using the same structured features. Logistic regression achieved a test ROC-AUC of 0.64 with F1-score of 0.43, while XGBoost achieved a test ROC-AUC of 0.60 with an F1-score of 0.40. Logistic regression captured a higher proportion of readmitted patients (recall 0.66) compared with XGBoost (recall 0.51), which made more conservative predictions. These results demonstrate that structured EMR features are the primary drivers of predictive performance, while deep learning models provide incremental improvements in sensitivity and F1-score. This direct comparison supports our conclusion that the modest gains from deep learning are due to the architecture’s ability to extract subtle patterns, but the majority of discrimination is determined by feature information rather than model complexity.

These findings align with prior research. For example, in a cohort of 1629 heart failure patients, models using unstructured data alone achieved an AUC of 0.522 compared with 0.649 for structured data and 0.645 for combined models, indicating minimal improvement from text embeddings [22]. Similarly, an Irish study [23] involving over 50,000 hospital admissions across a range of diagnoses compared multiple deep learning approaches using unstructured data to predict 30-day readmissions—including CNN, LSTM and transformers—and found that their performance (AUCROC 0.66 for CNN, 0.68 for LSTM, and 0.67 for transformers) was inferior to that of models based on structured variables, such as logistic regression (AUCROC 0.70) and CatBoost (AUCROC 0.71). Furthermore, a systematic review of 126 prognostic models found that while combining structured and unstructured data sometimes enhanced prediction, models using only unstructured clinical text rarely outperformed structured-data models, particularly in inpatient settings [20].

Several factors may explain the limited utility of unstructured text in our study. Clinical notes are highly heterogeneous and often include templated or repetitive sections, as well as administrative and non-clinical content, which can obscure meaningful clinical information relevant to readmission risk [33]. Moreover, many key predictors of readmission—such as prior healthcare utilisation, comorbidities, frailty, and socioeconomic status—are already comprehensively represented within structured EMR fields, thereby diminishing the incremental contribution of textual features. This redundancy, combined with the absence of manual feature engineering typically used in other machine learning approaches, may have limited the predictive strength of the text-based models [31]. Furthermore, deep learning models such as LSTM networks are prone to overfitting on smaller datasets, potentially reducing their generalisability and resulting in poorer performance compared to gradient-boosting methods such as XGBoost or LightGBM [34]. Class imbalance may have also contributed to suboptimal performance; despite the use of inverse class weighting, deep learning models often struggle to extract distinctive representations for the minority (readmission) class when trained solely on textual inputs [35].

Nonetheless, our findings do not discount the potential value of unstructured data in all contexts. In disease-specific cohorts or when structured data are incomplete, embeddings from clinical notes may provide meaningful predictive information. However, for general medicine inpatients with comprehensive structured EMR capture, our results suggest that the majority of predictive signal resides in structured features, with deep learning providing modest incremental improvements.

Strengths and Limitations

The key strengths of this study include the integration of structured and unstructured EMR data using modern deep learning methods, enabling direct comparison of their relative predictive performance in a large, real-world cohort. Fine-tuned Bio-ClinicalBERT embeddings captured semantic information from clinical documentation, while structured variables such as comorbidity, frailty, and prior healthcare utilisation provided a robust baseline. Classical machine learning baselines allowed quantification of incremental value provided by deep learning architectures. Performance was evaluated using standardised metrics (ROC-AUC, precision, recall, F1-score), ensuring comparability with previous studies.

However, several limitations should be noted. The dataset, although sizeable, may have been insufficient for optimal fine-tuning of transformer-based models, which typically require much larger text corpora. Class imbalance likely contributed to reduced sensitivity for predicting readmissions; despite applying inverse class weighting, deep learning models often struggle to identify subtle patterns associated with the minority (readmission) class when trained solely on textual inputs. The unstructured notes were not manually curated, and the inclusion of templated or administrative content may have diluted the signal-to-noise ratio. Exclusion of 14.1% of admissions due to missing HFRS may have introduced selection bias and limited generalisability. Similarly, single imputation for missing CRP values (6.5%) could have led to information loss, potentially affecting model performance. Patients with incomplete diagnostic coding may differ systematically in illness severity, healthcare utilisation, or documentation practices. As a result, model performance may be overestimated when applied to settings with less complete administrative data. Future work should explore alternative frailty measures, multiple imputation strategies, or model architectures capable of handling missing frailty data to improve robustness and external validity. Additionally, only the first eligible admission per patient was included to maintain independence of observations, which may bias the model toward first-time readmission risk and limit generalisability to multiple or recurrent readmissions. Although deep learning provided modest incremental value, the majority of discrimination was driven by structured EMR features, as confirmed by the logistic regression and XGBoost baselines, emphasising the importance of feature quality alongside model architecture. Finally, the analysis was conducted at a single centre, potentially limiting generalisability to other healthcare systems or patient populations, and model interpretability remains a challenge for deep learning methods, particularly for unstructured text models.

The multimodal architecture employed a late-fusion concatenation strategy, combining Bio-ClinicalBERT embeddings with structured EMR variables through a fully connected network. While more complex fusion approaches exist (e.g., attention-based or hierarchical fusion), our design served as a transparent and reproducible baseline, isolating the incremental value of unstructured text. Broader benchmarking against additional emerging large-scale architectures or hybrid approaches is an important direction for future research.

5. Conclusions

In summary, structured EMR variables remain the most informative predictors of 30-day readmission in this population. Deep learning approaches incorporating unstructured clinical text provide modest incremental value, particularly in improving identification of high-risk patients (higher sensitivity and F1 score) but do not substantially enhance overall discrimination (AUC-ROC) when structured data are comprehensive. Future research should focus on hybrid or attention-based models that selectively prioritise clinically relevant textual features, as well as on external validation across diverse hospital settings to assess generalisability and clinical utility.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/computers15030138/s1, Supplementary File S1: Source codes for different models.

Author Contributions

Conceptualization, Y.S. and R.W.; methodology, Y.S.; software, Y.S.; validation, Y.S., C.H. and R.W.; formal analysis, Y.S.; investigation, Y.S.; resources, Y.S.; data curation, C.H.; writing—original draft preparation, Y.S.; writing—review and editing, A.A.M. and C.T.; visualisation; supervision, R.W.; project administration, Y.S.; funding acquisition, Y.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Flinders Foundation, grant number 64945659.

Institutional Review Board Statement

Ethical review and approval were waived for this study by the Southern Adelaide Clinical Human Research Ethics Committee (SAC HREC) as it was deemed as a quality improvement project.

Informed Consent Statement

The requirement for informed consent was waived by the ethics committee because this retrospective study used de-identified data and posed minimal risk to participants.

Data Availability Statement

Data available from corresponding author on reasonable request and if permission granted by ethics committee.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AUC	Area Under the Receiver Operating Characteristic Curve
BERT	Bidirectional Encoder Representations from Transformers
CNN	Convolutional Neural Network
CRP	C-reactive Protein
ED	Emergency Department
EMR	Electronic Medical Record
F1-score	Harmonic mean of precision and recall
GPT	Generative Pretrained Transformer
HFRS	Hospital Frailty Risk Score
IQR	Interquartile Range
IRSD	Index of Relative Socio-economic Disadvantage
LSTM	Long Short-Term Memory
MICE	Multivariate Imputation by Chained Equations
NLP	Natural Language Processing
PyTorch	Python-based Deep Learning Framework
ReLU	Rectified Linear Unit
RNN	Recurrent Neural Network
ROC	Receiver Operating Characteristic
ROC-AUC	Area Under the Receiver Operating Characteristic Curve
SD	Standard Deviation
SHAP	Shapley Additive exPlanations

Appendix A

Figure A1. Receiver operating characteristic (ROC) curves for Random Forest, Gradient Boosting, Extra Trees and HistGradient Boosting.

Figure A2. Calibration plot for structured model.

Figure A3. Calibration plot for unstructured model.

Figure A4. Calibration plot for combined model.

Figure A5. Receiver operating characteristic (ROC) curve for the combined CNN model integrating structured EMR variables and unstructured clinical text embeddings.

Figure A6. Receiver operating characteristic (ROC) curve for the combined LSTM model integrating structured EMR variables and unstructured clinical text embeddings.

References

Allaudeen, N.; Vidyarthi, A.; Maselli, J.; Auerbach, A. Redefining readmission risk factors for general medicine patients. J. Hosp. Med. 2011, 6, 54–60. [Google Scholar] [CrossRef]
James, J.; Tan, S.; Stretton, B.; Kovoor, J.G.; Gupta, A.K.; Gluck, S.; Gilbert, T.; Sharma, Y.; Bacchi, S. Why do we evaluate 30-day readmissions in general medicine? A historical perspective and contemporary data. Intern. Med. J. 2023, 53, 1070–1075. [Google Scholar] [CrossRef]
Naylor, M.D.; Brooten, D.; Campbell, R.; Jacobsen, B.S.; Mezey, M.D.; Pauly, M.V.; Schwartz, J.S. Comprehensive discharge planning and home follow-up of hospitalized elders: A randomized clinical trial. JAMA 1999, 281, 613–620. [Google Scholar] [CrossRef] [PubMed]
Tsai, T.C.; Orav, E.J.; Jha, A.K. Care fragmentation in the postdischarge period: Surgical readmissions, distance of travel, and postoperative mortality. JAMA Surg. 2015, 150, 59–64. [Google Scholar] [CrossRef] [PubMed]
Zhou, H.; Della, P.R.; Roberts, P.; Goh, L.; Dhaliwal, S.S. Utility of models to predict 28-day or 30-day unplanned hospital readmissions: An updated systematic review. BMJ Open 2016, 6, e011060. [Google Scholar] [CrossRef] [PubMed]
Goldstein, B.A.; Navar, A.M.; Pencina, M.J.; Ioannidis, J.P. Opportunities and challenges in developing risk prediction models with electronic health records data: A systematic review. J. Am. Med. Inform. Assoc. 2017, 24, 198–208. [Google Scholar] [CrossRef] [PubMed]
Xiao, C.; Ma, T.; Dieng, A.B.; Blei, D.M.; Wang, F. Readmission prediction via deep contextual embedding of clinical concepts. PLoS ONE 2018, 13, e0195024. [Google Scholar] [CrossRef]
Farhan, W.; Wang, Z.; Huang, Y.; Wang, S.; Wang, F.; Jiang, X. A Predictive Model for Medical Events Based on Contextual Embedding of Temporal Sequences. JMIR Med. Inform. 2016, 4, e39. [Google Scholar] [CrossRef]
Lybarger, K.; Dobbins, N.J.; Long, R.; Singh, A.; Wedgeworth, P.; Uzuner, Ö.; Yetisgen, M. Leveraging natural language processing to augment structured social determinants of health data in the electronic health record. J. Am. Med. Inform. Assoc. 2023, 30, 1389–1397. [Google Scholar] [CrossRef]
Shickel, B.; Tighe, P.J.; Bihorac, A.; Rashidi, P. Deep EHR: A Survey of Recent Advances in Deep Learning Techniques for Electronic Health Record (EHR) Analysis. IEEE J. Biomed. Health Inform. 2018, 22, 1589–1604. [Google Scholar] [CrossRef]
Morgan, D.J.; Bame, B.; Zimand, P.; Dooley, P.; Thom, K.A.; Harris, A.D.; Bentzen, S.; Ettinger, W.; Garrett-Ray, S.D.; Tracy, J.K.; et al. Assessment of Machine Learning vs Standard Prediction Rules for Predicting Hospital Readmissions. JAMA Netw. Open 2019, 2, e190348. [Google Scholar] [CrossRef]
Sharma, Y.; Thompson, C.; Mangoni, A.A.; Shahi, R.; Horwood, C.; Woodman, R. Performance of Machine Learning Models in Predicting 30-Day General Medicine Readmissions Compared to Traditional Approaches in Australian Hospital Setting. Healthcare 2025, 13, 1223. [Google Scholar] [CrossRef]
Hasan, O.; Meltzer, D.O.; Shaykevich, S.A.; Bell, C.M.; Kaboli, P.J.; Auerbach, A.D.; Wetterneck, T.B.; Arora, V.M.; Zhang, J.; Schnipper, J.L. Hospital readmission in general medicine patients: A prediction model. J. Gen. Intern. Med. 2010, 25, 211–219. [Google Scholar] [CrossRef]
Rajkomar, A.; Oren, E.; Chen, K.; Dai, A.M.; Hajaj, N.; Hardt, M.; Liu, P.J.; Liu, X.; Marcus, J.; Sun, M.; et al. Scalable and accurate deep learning with electronic health records. Npj Digit. Med. 2018, 1, 18. [Google Scholar] [CrossRef]
Futoma, J.; Morris, J.; Lucas, J. A comparison of models for predicting early hospital readmissions. J. Biomed. Inform. 2015, 56, 229–238. [Google Scholar] [CrossRef]
Ashfaq, A.; Sant’Anna, A.; Lingman, M.; Nowaczyk, S. Readmission prediction using deep learning on electronic health records. J. Biomed. Inform. 2019, 97, 103256. [Google Scholar] [CrossRef] [PubMed]
Lu, H.; Ehwerhemuepha, L.; Rakovski, C. A comparative study on deep learning models for text classification of unstructured medical notes with various levels of class imbalance. BMC Med. Res. Methodol. 2022, 22, 181. [Google Scholar] [CrossRef] [PubMed]
Huang, K.; Altosaar, J.; Ranganath, R. ClinicalBERT: Modeling Clinical Notes and Predicting Hospital Readmission. arXiv 2019, arXiv:1904.05342. [Google Scholar] [CrossRef]
Alsentzer, E.; Murphy, J.; Boag, W.; Weng, W.H.; Jindi, D.; Naumann, T.; McDermott, M. Publicly Available Clinical BERT Embeddings. In Proceedings of the 2nd Clinical Natural Language Processing Workshop; Association for Computational Linguistics: Minneapolis, MN, USA, 2019; pp. 72–78. Available online: https://aclanthology.org/W19-1909/ (accessed on 19 February 2026).
Seinen, T.M.; Fridgeirsson, E.A.; Ioannou, S.; Jeannetot, D.; John, L.H.; Kors, J.A.; Markus, A.F.; Pera, V.; Rekkas, A.; Williams, R.D.; et al. Use of unstructured text in prognostic clinical prediction models: A systematic review. J. Am. Med. Inform. Assoc. 2022, 29, 1292–1302. [Google Scholar] [CrossRef]
Brown, J.R.; Ricket, I.M.; Reeves, R.M.; Shah, R.U.; Goodrich, C.A.; Gobbel, G.; Stabler, M.E.; Perkins, A.M.; Minter, F.; Cox, K.C.; et al. Information Extraction From Electronic Health Records to Predict Readmission Following Acute Myocardial Infarction: Does Natural Language Processing Using Clinical Notes Improve Prediction of Readmission? J. Am. Heart Assoc. 2022, 11, e024198. [Google Scholar] [CrossRef]
Mahajan, S.M.; Ghani, R. Combining Structured and Unstructured Data for Predicting Risk of Readmission for Heart Failure Patients. In MEDINFO 2019: Health and Wellbeing e-Networks for All; Studies in Health Technology and Informatics; IOS Press: Amsterdam, The Netherlands, 2019; Volume 264, pp. 238–242. [Google Scholar] [CrossRef]
Pham, M.K.; Mai, T.T.; Crane, M.; Ebiele, M.; Brennan, R.; Ward, M.E.; Geary, U.; McDonald, N.; Bezbradica, M. Forecasting Patient Early Readmission from Irish Hospital Discharge Records Using Conventional Machine Learning Models. Diagnostics 2024, 14, 2405. [Google Scholar] [CrossRef]
Zhang, D.; Yin, C.; Zeng, J.; Yuan, X.; Zhang, P. Combining structured and unstructured data for predictive models: A deep learning approach. BMC Med. Inform. Decis. Mak. 2020, 20, 280. [Google Scholar] [CrossRef]
Pandey, S.R.; Tile, J.D.; Oghaz, M.M.D. Predicting 30-day hospital readmissions using ClinicalT5 with structured and unstructured electronic health records. PLoS ONE 2025, 20, e0328848. [Google Scholar] [CrossRef]
Cui, H.; Fang, X.; Xu, R.; Kan, X.; Ho, J.C.; Yang, C. Multimodal Fusion of EHR in Structures and Semantics: Integrating Clinical Records and Notes with Hypergraph and LLM. In MEDINFO 2025—Healthcare Smart × Medicine Deep; Studies in Health Technology and Informatics; IOS Press: Amsterdam, The Netherlands, 2025; Volume 329, pp. 753–757. [Google Scholar] [CrossRef]
Mahmoudi, E.; Kamdar, N.; Kim, N.; Gonzales, G.; Singh, K.; Waljee, A.K. Use of electronic medical records in development and validation of risk prediction models of hospital readmission: Systematic review. BMJ 2020, 369, m958. [Google Scholar] [CrossRef] [PubMed]
Huang, Y.; Talwar, A.; Chatterjee, S.; Aparasu, R.R. Application of machine learning in predicting hospital readmissions: A scoping review of the literature. BMC Med. Res. Methodol. 2021, 21, 96. [Google Scholar] [CrossRef] [PubMed]
Ru, B.; Tan, X.; Liu, Y.; Kannapur, K.; Ramanan, D.; Kessler, G.; Lautsch, D.; Fonarow, G. Comparison of Machine Learning Algorithms for Predicting Hospital Readmissions and Worsening Heart Failure Events in Patients With Heart Failure With Reduced Ejection Fraction: Modeling Study. JMIR Form. Res. 2023, 7, e41775. [Google Scholar] [CrossRef]
Si, Y.; Du, J.; Li, Z.; Jiang, X.; Miller, T.; Wang, F.; Zheng, W.J.; Roberts, K. Deep representation learning of patient data from Electronic Health Records (EHR): A systematic review. J. Biomed. Inform. 2021, 115, 103671. [Google Scholar] [CrossRef]
Shin, S.; Austin, P.C.; Ross, H.J.; Abdel-Qadir, H.; Freitas, C.; Tomlinson, G.; Chicco, D.; Mahendiran, M.; Lawler, P.R.; Billia, F.; et al. Machine learning vs. conventional statistical models for predicting heart failure readmission and mortality. ESC Heart Fail. 2021, 8, 106–115. [Google Scholar] [CrossRef]
Sarijaloo, F.; Park, J.; Zhong, X.; Wokhlu, A. Predicting 90 day acute heart failure readmission and death using machine learning-supported decision analysis. Clin. Cardiol. 2021, 44, 230–237. [Google Scholar] [CrossRef]
Sharma, Y.; Horwood, C.; Hakendorf, P.; Shahi, R.; Thompson, C. External Validation of the Hospital Frailty-Risk Score in Predicting Clinical Outcomes in Older Heart-Failure Patients in Australia. J. Clin. Med. 2022, 11, 2193. [Google Scholar] [CrossRef]
Hu, J.; Gonsahn, M.D.; Nerenz, D.R. Socioeconomic status and readmissions: Evidence from an urban teaching hospital. Health Aff. 2014, 33, 778–785. [Google Scholar] [CrossRef]
Mudge, A.M.; Kasper, K.; Clair, A.; Redfern, H.; Bell, J.J.; Barras, M.A.; Dip, G.; Pachana, N.A. Recurrent readmissions in medical patients: A prospective study. J. Hosp. Med. 2011, 6, 61–67. [Google Scholar]
Sharma, Y.; Miller, M.; Kaambwa, B.; Shahi, R.; Hakendorf, P.; Horwood, C.; Thompson, C. Factors influencing early and late readmissions in Australian hospitalised patients and investigating role of admission nutrition status as a predictor of hospital readmissions: A cohort study. BMJ Open 2018, 8, e022246. [Google Scholar] [CrossRef]
Brand, C.; Sundararajan, V.; Jones, C.; Hutchinson, A.; Campbell, D. Readmission patterns in patients with chronic obstructive pulmonary disease, chronic heart failure and diabetes mellitus: An administrative dataset analysis. Intern. Med. J. 2005, 35, 296–299. [Google Scholar] [CrossRef]
Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. arXiv 2020, arXiv:2005.14165. [Google Scholar] [CrossRef]
Thompson, D.C.; Mofidi, R. Natural Language Processing framework for identifying abdominal aortic aneurysm repairs using unstructured electronic health records. Sci. Rep. 2025, 15, 26388. [Google Scholar] [CrossRef]
Novac, O.C.; Chirodea, M.C.; Novac, C.M.; Bizon, N.; Oproescu, M.; Stan, O.P.; Gordan, C.E. Analysis of the Application Efficiency of TensorFlow and PyTorch in Convolutional Neural Network. Sensors 2022, 22, 8872. [Google Scholar] [CrossRef]
Rumpf, S.; Zufall, N.; Rumpf, F.; Gschwendtner, A. A Performance Comparison of Different YOLOv7 Networks for High-Accuracy Cell Classification in Bronchoalveolar Lavage Fluid Utilising the Adam Optimiser and Label Smoothing. J. Imaging Inform. Med. 2025, 38, 2367–2380. [Google Scholar] [CrossRef]
Chiang, Y.-Y.; Chen, C.-L.; Chen, Y.-H. Deep learning evaluation of glaucoma detection using fundus photographs in highly myopic populations. Biomedicines 2024, 12, 1394. [Google Scholar] [CrossRef]

Figure 1. Workflow for predicting 30-day unplanned hospital readmissions. The figure illustrates the overall process from data collection, preprocessing, and feature representation to model development, training, evaluation, and interpretation for structured-only, text-only, and combined multimodal models.

Figure 2. Receiver operating characteristic (ROC) curve for logistic regression model.

Figure 3. Receiver operating characteristic (ROC) curve for XGBoost.

Figure 4. Receiver operating characteristic (ROC) curve for the structured data model.

Figure 5. Receiver operating characteristic (ROC) curve for the unstructured text model.

Figure 6. Receiver operating characteristic (ROC) curve for the combined model integrating structured EMR variables and unstructured clinical text embeddings.

Figure 7. SHAP summary plot depicting the relative importance of structured EMR features in predicting 30-day unplanned hospital readmissions.

Figure 8. Bar plot of mean absolute SHAP values for structured EMR features, illustrating the overall contribution of each variable to the prediction of 30-day unplanned hospital readmissions.

Table 1. Deep learning model architectures.

Model	Input Features	Hidden Layers	Output	Activation	Notes
Structured-only	11 structured EMR variables	2 fully connected layers: 256, 64 neurons	1	Sigmoid	ReLU activation in hidden layers; class-weighted binary cross-entropy loss
Text-only (Bio-ClinicalBERT)	Clinical notes (Bio-ClinicalBERT embeddings)	None (transformer-based)	1	Sigmoid	End-to-end fine-tuned Bio-ClinicalBERT with task-specific classification head; max token length = 256
Combined multimodal (Feedforward)	Structured variables + 768-d Bio-ClinicalBERT [CLS] embedding	2 fully connected layers: 256, 64 neurons	1	Sigmoid	Concatenated structured + text embeddings processed same as structured-only model
Combined multimodal (CNN)	Structured variables + 768-d Bio-ClinicalBERT [CLS] embedding	1D convolutional layer(s) + max-pooling + fully connected layer(s)	1	Sigmoid	Convolution applied to concatenated feature vector before classification
Combined multimodal (LSTM)	Structured variables + 768-d Bio-ClinicalBERT [CLS] embedding	LSTM layer + fully connected layer	1	Sigmoid	Final LSTM hidden state used for binary classification

EMR, electronic medical record; CNN, convolutional neural network; LSTM, long short-term memory; [CLS], classification token; ReLU, rectified linear unit.

Table 2. Characteristics of patients according to 30-day readmission status.

Variable	Total No (n = 4135)	No Readmission (n = 3129)	30-Day Readmission (n = 1006)	p-Value
Age in years, median (IQR)	74.0 (59.0–83.0)	73.0 (59.0–83.0)	74.0 (60.0–83.0)	0.416
Sex male n (%)	1918 (46.4%)	1452 (46.4)	466 (46.3)	0.850
Charlson index, median (IQR)	1.0 (1.0–2.0)	1.0 (0.0–2.0)	1.0 (0.0–3.0)	<0.001
HFRS, median (IQR)	4.9 (2.6–7.3)	4.9 (2.6–7.1)	5.2 (2.6–8.1)	<0.001
IRSD, mean (SD)	997.3 (59.1)	998.9 (58.1)	992.4 (62.1)	0.002
ED visits in previous 6 months, median (IQR)	1.0 (0.0–2.0)	1.0 (0.0–2.0)	1.5 (0.0–4.0)	<0.001
Hospital admissions in last 1 year, median (IQR)	0.0 (0.0–2.0)	0.0 (0.0–1.0)	1.0 (0.0–3.0)	<0.001
Smokers n (%)	297 (7.2%)	196 (6.3)	101 (10.0)	<0.001
Alcohol abuse n (%)	451 (10.9)	275 (8.8)	176 (17.5)	<0.001
CRP, median (IQR)	18.2 (3.8–63.1)	18.6 (3.7–64.4)	16.4 (4.0–60.6)	0.190
LOS, median (IQR)	3.3 (1.8–6.5)	3.2 (1.8–6.1)	3.9 (2.0–7.5)	<0.001

IQR, interquartile range; HFRS, Hospital Frailty Risk Score; SD, standard deviation; ED, emergency department; CRP, C-reactive protein; LOS, length of hospital stay.

Table 3. Performance metrics of machine learning and deep learning models for predicting 30-day unplanned readmissions.

Model Category	Model	Data Modality	AUC-ROC	Accuracy	Precision	Recall	F1 Score
Classical ML baselines	Logistic regression	Structured EMR only variables	0.64	0.58	0.32	0.66	0.43
	XGBoost	Structured EMR only	0.60	0.62	0.32	0.51	0.40
Additional ML models	Random Forest	Structured EMR only	0.61	0.75	0.50	0.15	0.23
	Gradient Boosting	Structured EMR only	0.61	0.73	0.40	0.17	0.24
	Extra trees	Structured EMR only	0.61	0.74	0.41	0.16	0.23
	HistGradient Boosting	Structured EMR only	0.62	0.73	0.40	0.16	0.23
Deep learning models	DL-Structured	Structured EMR only	0.62	0.74	0.42	0.14	0.22
	DL-Unstructured-text (Bio-ClinicalBERT)	Unstructured clinical text only	0.52	0.46	0.25	0.83	0.39
	DL-Multimodal	Structured variables + clinical text	0.58	0.63	0.33	0.49	0.40
	CNN (combined)	Structured variables + clinical text	0.54	0.47	0.26	0.66	0.38
	LSTM (combined)	Structured variables + clinical text	0.54	0.49	0.27	0.63	0.38

ROC-AUC, area under the receiver operating characteristic curve; EMR, electronic medical records; XGBoost, extreme gradient boosting; DL, deep learning; CNN, convolutional neural network; LSTM, long short-term memory.

Table 4. SHAP-derived feature importance for structured variables predicting 30-day unplanned readmissions.

Feature	Mean SHAP Value *	Relative Importance (%)
Total hospital admissions (prior year)	0.0627	17.1
Number of ED visits in previous 6 months	0.0627	17.1
Charlson index	0.0318	8.7
Hospital length of stay	0.0241	6.6
Age	0.0164	4.5
HFRS	0.0163	4.5
IRSD	0.0105	2.9
CRP	0.0094	2.6
Sex	0.0081	2.2
Alcohol abuse	0.0058	1.6
Smoking	0.0037	1.0

ED, emergency department; HFRS; Hospital Frailty Risk Score; IRSD, Index of Relative Socio-economic disadvantage; CRP, C-reactive protein; * SHAP values represent the mean absolute contribution of each feature to the model predictions. Features are ranked from most to least important based on mean SHAP value. Relative importance is calculated as each feature’s SHAP value divided by the sum of all features’ SHAP values.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Sharma, Y.; Thompson, C.; Mangoni, A.A.; Horwood, C.; Woodman, R. Comparative Evaluation of Machine Learning Models Using Structured and Unstructured Clinical Data for Predicting Unplanned General Medicine Readmissions in a Tertiary Hospital in Australia. Computers 2026, 15, 138. https://doi.org/10.3390/computers15030138

AMA Style

Sharma Y, Thompson C, Mangoni AA, Horwood C, Woodman R. Comparative Evaluation of Machine Learning Models Using Structured and Unstructured Clinical Data for Predicting Unplanned General Medicine Readmissions in a Tertiary Hospital in Australia. Computers. 2026; 15(3):138. https://doi.org/10.3390/computers15030138

Chicago/Turabian Style

Sharma, Yogesh, Campbell Thompson, Arduino A. Mangoni, Chris Horwood, and Richard Woodman. 2026. "Comparative Evaluation of Machine Learning Models Using Structured and Unstructured Clinical Data for Predicting Unplanned General Medicine Readmissions in a Tertiary Hospital in Australia" Computers 15, no. 3: 138. https://doi.org/10.3390/computers15030138

APA Style

Sharma, Y., Thompson, C., Mangoni, A. A., Horwood, C., & Woodman, R. (2026). Comparative Evaluation of Machine Learning Models Using Structured and Unstructured Clinical Data for Predicting Unplanned General Medicine Readmissions in a Tertiary Hospital in Australia. Computers, 15(3), 138. https://doi.org/10.3390/computers15030138

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Comparative Evaluation of Machine Learning Models Using Structured and Unstructured Clinical Data for Predicting Unplanned General Medicine Readmissions in a Tertiary Hospital in Australia

Abstract

1. Introduction

1.1. Related Work

1.1.1. Structured Data and Traditional ML Models

1.1.2. Deep Learning on Structured EMR Variables

1.1.3. Unstructured Clinical Text and Early Natural Language Processing (NLP) Models

1.1.4. Multimodal Integration

1.2. Study Motivation and Contributions

2. Materials and Methods

2.1. Study Design and Data Source

2.2. Data Variables, Preprocessing and Feature Representation

2.3. Text Representation Using Bio-ClinicalBERT

2.4. Model Development

Additional Multimodal Architectures

2.5. Classical Machine Learning Baselines, Model Interpretation, and Statistical Analysis

3. Results

3.1. Model Training and Evaluation

3.1.1. Classical Machine Learning Models

3.1.2. Deep Learning Models

3.2. Feature Importance Analysis (SHAP)

4. Discussion

Strengths and Limitations

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI