Predicting Mortality and Readmission in Obstructive Sleep Apnea via LLM-Expanded Clinical Concepts

Ahmed, Awwal; Rispoli, Anthony; Wasieloski, Carrie; Khurram, Ifrah; Zamora-Resendiz, Rafael; Morrow, Destinee; Dong, Aijuan; Crivelli, Silvia

doi:10.3390/bdcc10030097

Open AccessArticle

Predicting Mortality and Readmission in Obstructive Sleep Apnea via LLM-Expanded Clinical Concepts

by

Awwal Ahmed

^1,†,

Anthony Rispoli

^1,†,

Carrie Wasieloski

^1,†,

Ifrah Khurram

²,

Rafael Zamora-Resendiz

³,

Destinee Morrow

³,

Aijuan Dong

^1,*

and

Silvia Crivelli

^3,*

¹

Hood College, Frederick, MD 21701, USA

²

San Juan Bautista School of Medicine, Caguas, PR 00727, USA

³

Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA

^*

Authors to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Big Data Cogn. Comput. 2026, 10(3), 97; https://doi.org/10.3390/bdcc10030097

Submission received: 19 December 2025 / Revised: 6 March 2026 / Accepted: 19 March 2026 / Published: 21 March 2026

Download

Browse Figures

Versions Notes

Abstract

Obstructive Sleep Apnea (OSA) is a common sleep disorder associated with serious health risks. This study leverages large language models (LLMs) to process and interpret clinical narratives in electronic health records. It develops clinically meaningful lexicons for predicting mortality and readmission risk, as well as for multiclass diagnostic classification in OSA patients. Using LLM-expanded lexicons, logistic regression models achieved ROC–AUC scores of 0.844 for 6-month all-cause post-discharge mortality, 0.817 for 1-year all-cause post-discharge mortality, and 0.729 for all-cause hospital readmissions following the first discharge. Diagnostic performance was highest with smaller n-gram representations, indicating that additional contextual length did not improve performance. Compared with frequency-based n-gram models, LLM-expanded lexicons yielded sparser feature sets with lower computational cost and comparable performance. Our findings highlight the potential of LLM-expanded lexicons to enhance OSA diagnosis and clinical risk stratification.

Keywords:

obstructive sleep apnea; large language models; natural language processing; mortality; hospital readmission; machine learning; clinical prediction; electronic health records

1. Introduction

Obstructive sleep apnea (OSA) is a sleep-related breathing disorder characterized by complete (apnea) or partial (hypopnea) obstruction of the upper airways, leading to sleep disruptions and intermittent hypoxemia, i.e., a low level of oxygen in the blood. In the United States, the overall prevalence is 32.4% among adults aged 20 years and older, with 39.1% among males and 26.0% in females, after adjusting for obesity [1], and almost one billion people are affected globally [2]. In addition, OSA is associated with an increased risk of various cardiovascular (CV) [3,4,5,6], metabolic [5], and neurological conditions [7,8].

The substantial morbidity and mortality associated with OSA and the cost of treatment require accurate risk stratification to optimize patient care and resource allocation. Current clinical risk assessment tools, such as the STOP-BANG questionnaire [9] and the NoSAS score [10], are designed primarily for OSA screening rather than mortality or readmission prediction. Although traditional polysomnographic parameters such as the apnea-hypopnea index (AHI) provide diagnostic information, they have limited prognostic value for predicting adverse clinical outcomes in OSA patients.

Previous machine learning approaches for OSA outcome prediction have predominantly relied on structured patient data, such as demographics, vital signs, and laboratory values. However, using solely structured Electronic Health Records (EHR) data can lead to biased results and suboptimal predictive performance, as substantial clinical information often resides in unstructured clinical narratives [11]. Recent studies have shown that unstructured clinical notes improve the prediction of mortality and hospital readmission [12,13]. Artificial intelligence (AI)-based medical systems increasingly use natural language processing (NLP) with pretrained language models to process and interpret clinical narratives in EHRs [14,15,16].

Large language models (LLMs) offer several advantages over traditional machine learning (ML) and conventional NLP approaches for healthcare applications. First, LLMs learn contextualized representations that capture semantic meaning, nuanced language variability, and complex clinical phenomena—including long-range dependencies, negation, and temporality [17,18,19]. Second, through large-scale pretraining on biomedical and clinical corpora, clinical LLMs enable effective transfer learning and improved performance on downstream healthcare tasks with limited labeled data, outperforming classical ML models that rely heavily on manual feature engineering or fixed vocabularies [20,21]. Finally, LLMs have shown the ability to generalize across diverse medical tasks without requiring task-specific architectures and training, providing a flexible framework for clinical NLP applications [22].

These advantages motivated our use of pretrained clinical language models to enrich the representation of clinical concepts. Starting from physician-provided seed terms for categorizing OSA and its comorbidities, we expanded the lexicon by identifying additional relevant medical terms based on their semantic similarity to the seed terms using LLMs. Unlike traditional NLP pipelines, which typically rely on fixed vocabularies, rule-based mapping, or task-specific feature engineering, this approach identifies synonymous, abbreviated, and context-dependent expressions commonly found in discharge documentation. Thus, the main advantages of our approach include broader coverage of clinical terminology, reduced reliance on manual feature curation, and leveraging knowledge from pretrained language models.

Similar terminology expansion strategies using distributional semantics and embedding-based nearest-neighbor retrieval have been studied in clinical text [23,24,25]. However, prior work has not systematically combined expert-curated seed terms with LLM-based expansion and applied the resulting lexicon to OSA-focused predictive analyses. Our contributions include (i) a framework for using LLMs to expand clinical lexicons and then use them to represent unstructured discharge notes; (ii) demonstrating LLM-expanded lexicons transfer across task—from diagnosis to outcome prediction; and (iii) showing that LLM-expanded lexicons enable sparser, more efficient models than traditional frequency-based n-gram approaches.

The remainder of this paper is organized as follows. Section 2 reviews related work, Section 3 describes the dataset, LLM-based lexicon expansion process, and modeling framework, Section 4 presents experimental results and comparisons with n-gram baselines, and Section 5 discusses the main findings, limitations, and future directions.

2. Related Work

We review prior work on LLMs in healthcare in Section 2.1 and NLP methods for predicting clinical outcomes in Section 2.2. While LLMs have been widely applied to clinical prediction, few studies have examined whether domain-specific LLM-expanded lexicons can transfer across tasks. This study assesses whether lexicons developed for OSA diagnosis can effectively predict mortality and hospital readmission.

2.1. LLMs in Health Care

LLMs are a type of AI model that is trained on vast amounts of typically unlabeled data [26]. While traditional AI models are often single-task systems, foundation models (FMs) which are LLMs trained on large datasets, can be subsequently finetuned to perform many different downstream tasks. FMs represent a paradigm shift in AI model development [26]. This allows a single LLM to be reused across a range of tasks with minimal adaptation or retraining. However, LLMs typically have a substantially greater number of parameters than traditional AI models—sometimes in the hundreds of billions. This requires significant computational resources for training [27].

Recent advances in LLMs, the exponential growth of medical literature, and the widespread availability of large-scale EHRs have set the stage for clinical LLMs to revolutionize medical practice. Noteworthy applications of LLMs in healthcare include named entity recognition and relation extraction (e.g., BioBERT [19] and BlueBERT [28]), medical question answering and inference (e.g., GatorTron [20] and Med-PaLM [21]), discharge summaries (e.g., ChatGPT-3.5 [29]), diagnosis classification (e.g., ClinicalBERT [18]), and various others [20,30]. Recent systematic reviews and studies demonstrate LLMs’ efficacy in interview dialogue summarization [31] and disease diagnosis and treatment [32], and highlight the need for human-centric LLMs for personalized medicine and equitable development and access [33].

Embeddings are numerical representations of words, phrases, or sentences that capture contextual information and understand relationships within large segments of text. They have been used in various tasks, such as text retrieval and ranking [34], text classification [35], sentiment analysis [36], clinical concept extraction [37], and patient risk stratification [38].

Building on previous work [39], this study focuses on embeddings extracted from LLMs. Starting with a set of initial medical terms (“seed terms”) for categorizing OSA and its associated comorbidities, we aim to expand the lexicon using LLMs. Specifically, we identify additional relevant medical terms by computing the cosine similarity between their embeddings and the seed terms.

2.2. Predicting Mortality and Hospital Readmission Through NLP Techniques

Recent studies have explored the application of NLP techniques to predict mortality and hospital readmission in healthcare settings. Some approaches used unstructured clinical notes only [40,41,42,43]. For example, Huang et al. pretrained BERT on clinical notes and fine-tuned it for improved 30-day hospital readmission prediction.

Yet, others integrate clinical text, vital signs, time series measurements, and imaging to create a comprehensive profile of a patient’s health status for more accurate predictions [44,45,46,47,48]. For example, Jin et al. performed named entity extraction and negation detection on clinical notes and trained a multimodal neural network that integrated time series signals and unstructured clinical text representations for predicting in-hospital mortality risk in ICU patients.

More recently, LLM ensembles have been applied directly to mortality prediction using unstructured medical notes from MIMIC-IV [49]. LLMs have also been used to annotate and extract structured variables from unstructured clinical narratives to support downstream outcome prediction [15,16]. These approaches treat LLMs as domain-aware information extractors, highlighting a shift from using LLMs as black-box generators to clinically-guided feature engineering tools.

Despite prior work on leveraging LLMs in clinical prediction tasks, few studies have explored whether domain-specific concepts—automatically expanded using LLMs—can be repurposed across clinical tasks such as outcome prediction. In this study, we aim to assess the predictive effectiveness of LLM-expanded medical terms, originally developed for OSA diagnosis, in predicting mortality and hospital readmission risks.

3. Materials and Methods

This section describes the data, methods, and evaluation framework. We present the MIMIC-IV dataset in Section 3.1, the software platform in Section 3.2, the LLM-based lexicon expansion and classification workflow in Section 3.3, and model training procedures in Section 3.4.

3.1. Dataset

The Medical Information Mart for Intensive Care (MIMIC)-IV database is utilized for this project [50]. It comprises deidentified EHR for patients admitted to the Beth Israel Deaconess Medical Center Intensive Care Unit or ICU between 2008 and 2019. MIMIC-IV v2.2, released in January 2023, consists of records for 299,712 patients and 431,231 admissions.

In addition to OSA, we examined the following associated comorbidities: diabetes mellitus type 2 (T2DM), hypertension (HTN), heart failure (HF), and atrial fibrillation (AF). Table 1 shows the number of International Classification of Diseases (ICD) codes and seed terms identified by physicians for each health condition.

As an example, the following ICD codes are used to identify patients with OSA: 327.20 (Organic sleep apnea, unspecified), 327.23 (Obstructive sleep apnea [adult, pediatric]), 327.29 (Other organic sleep apnea), 780.51 (Insomnia with sleep apnea), 780.53 (Hypersomnia with sleep apnea), 780.57 (Sleep apnea [NOS]), G4730 (Sleep apnea, unspecified), G4733 (Obstructive sleep apnea [adult, pediatric]), and G4739 (Other sleep apnea). A patient is considered to have a positive diagnosis for a specific health condition if they possess at least one corresponding ICD code. Table 2 provides basic demographic data of patients of interest.

The dataset described in this section serves as the source population for all analyses in this study. Task-specific cohorts for mortality prediction, readmission prediction, and the n-gram size analysis are derived from this shared population using outcome-based inclusion criteria and are described in detail in Section 4.2, Section 4.3, and Section 4.4, respectively.

3.2. Software and Platform

The major software packages used in this study include pandas 2.3.3, NumPy 2.1.2, matplotlib 3.10.6, mpi4py 4.1.0, PyTorch 2.8.0+cu129, and scikit-learn 1.7.2. Pretrained biomedical language models included BlueBERT, GatorTron-medium, and BioClinicalBERT, all used as released without additional fine-tuning. Computing resources were provided by the National Energy Research Scientific Computing Center (NERSC) at Lawrence Berkeley National Laboratory, which supported large-scale language model inference, complex machine learning workloads, and processing of EHR data.

3.3. Process Flow

Our previous study demonstrated LLM-based lexicon development for OSA sub-phenotyping [51]. In this study, we focus on applying these LLM-expanded lexicons for mortality and readmission prediction. The flowchart in Figure 1 outlines the sequence of tasks or processes that are executed to achieve the objective described in Section 1. Next, we describe these tasks.

3.3.1. Bi/Tri/Four-Grams Extracted from Discharge Notes

The initial step is to extract bigrams (pairs of consecutive words), trigrams (triplets of consecutive words), and four-grams (four consecutive words) from all patient discharge notes to capture commonly used phrases. MIMIC-IV v2.2 contains 331,794 discharge notes. The mean number of characters per note is 10,551. The longest and shortest discharge notes have 60,381 and 353 characters, respectively. The numbers of bigrams, trigrams, and four-grams extracted from this step are 3,096,096, 5,407,839, and 4,792,806, respectively. These n-grams (i.e., bigrams, trigrams, and four-grams) are candidates for expanding lexicons in this study.

3.3.2. Seed Terms Provided by Physicians

Physicians involved in this study provided seed terms, i.e., relevant medical terms or phrases, for OSA and comorbidities of interest. The number of seed terms per condition is listed in Table 1. For example, the following terms are among the 38 terms for OSA: poorly refreshing sleep, obstructive sleep apnea (OSA), obesity hypoventilation syndrome (OHS), unrefreshing sleep, sleepiness, excessive daytime sleepiness (EDS), and snoring.

3.3.3. Expanding Lexicon via LLMs

The goal of this step is to identify informative n-grams from discharge notes for categorizing OSA and its associated comorbidities. The general approach to selection is by comparing how similar these n-grams are to the seed terms of corresponding conditions. The similarity between an n-gram and a seed term is measured by the cosine similarity between their embeddings, which are semantic representations extracted from an LLM. Specifically, given an n-gram t and a seed term s, the cosine similarity between t and s is measured using

S_{c} (t, s) = \frac{V (t) \cdot V (s)}{2 \times ∥ V (t) ∥ \cdot ∥ V (s) ∥}

(1)

where

V (t)

and

V (s)

are LLM embedding vectors of the n-gram t and seed term s, respectively, and

∥ \cdot ∥

denotes the Euclidean norm.

3.3.4. Expanded Lexicon

For each n-gram, there are 88 similarity scores, corresponding to the 88 seed terms (Table 1). The importance or relevance of each n-gram to a specific health condition (OSA or comorbidity) is measured by the average of all similarity scores between the n-gram and the seed terms associated with that condition. As a result, each n-gram has five similarity scores, one for each health condition.

The similarity scores of n-grams are then ranked individually for each condition, and the rankings of bigrams, trigrams, and four-grams are separated as well. Thus, each health condition yields three distinct ranked lists, from which a number of n-grams are selected as textual features for prediction tasks.

3.3.5. Patient Discharge Notes Extracted Using ICD Codes

Discharge notes for patients with OSA and/or comorbidities are extracted based on ICD codes (See Table 2 for summary). The process involved merging information from multiple tables or files. Discharge notes are long-form narratives that describe the reason for a patient’s admission to the hospital, their hospital course, and any relevant discharge instructions. Each discharge note corresponds to a single hospital stay, and a patient may have multiple discharge notes if he or she has more than one hospital stay.

3.3.6. Classifying with Logistic Regression

Logistic Regression (LR) is a statistical technique in machine learning used to model the relationship between a set of independent variables and a categorical dependent variable. LR estimates the probability of an observation belonging to each class, handling both binary and multiclass classification through multinomial extensions. It is selected for this study due to its simplicity and its ability to provide interpretable insights into the relative importance of different text features (i.e., n-grams) for predicting class membership.

Each discharge note was labeled based on the study-specific classification. For example, in the mortality study, each discharge note was labeled based on the patient’s status as either alive or deceased.

To represent each discharge note as input features for LR, a “bag-of-ngrams” encoding was applied, treating each n-gram as a binary presence/absence feature variable. The selection of n-grams was determined by the health conditions under study. For instance, in a study involving patients with OSA and HF using trigrams, the feature set comprised the top-n trigrams from each condition’s sorted trigram list, merged with duplicates removed. Each discharge note was then encoded as a feature vector based on the presence or absence of these selected trigrams.

3.3.7. Prediction Results

The expanded lexicons for characterizing OSA and associated comorbidities were evaluated for their predictive power in three classification tasks: (1) mortality prediction (alive vs. deceased), (2) readmission prediction (readmitted vs. not readmitted), and (3) n-gram size impact study using diagnostic labels (OSA only, HF only, and OSA & HF), as assessed in Section 4.

3.4. Model Training and Evaluation

LR hyperparameters were optimized using repeated stratified 5-fold cross-validation, which ensures balanced class distribution across folds and prevents data leakage. Grid search evaluated regularization strength (

C \in {0.001, 0.01, 0.05, 0.1, 1}

), penalty (L2), solver (LBFGS and SAGA), and class weighting (None or balanced), selecting the configuration that maximized the mean area under the receiver operating characteristic curve (ROC-AUC) for binary tasks and weighted AUROC (wAUC) for multiclass tasks across validation folds. wAUC was selected over macro AUC because it weights each class by its prevalence, providing a clinically relevant measure that reflects actual patient distribution, whereas macro AUC treats all classes equally regardless of size. The same hyperparameter search space, random seed (random_state = 42), and cross-validation splits were held constant across all classification tasks and feature representations to ensure reproducibility. Final models were trained using scikit-learn’s LogisticRegression with the selected hyperparameters.

Pretrained language models were used only for medical lexicon expansion; no fine-tuning or task-specific training of LLM parameters was performed.

4. Results and Discussion

We present results for three prediction tasks: mortality in Section 4.2, hospital readmission in Section 4.3, and diagnostic classification examining n-gram size effects in Section 4.4. Section 4.1 provides an overview of the LLMs, cohort construction, and model comparison approach used across experiments.

4.1. Overview

The following LLMs were used for lexicon expansion. For mortality, we used GatorTron Medium (3.9 B parameters, trained on EHR data from the University of Florida Health system, PubMed, and MIMIC) and BlueBERT (336 M parameters, trained on MIMIC and PubMed) to generate trigrams. BlueBERT was also used to generate four-grams for readmission. Additionally, BioClinicalBERT (110 M parameters, initialized from BioBERT and trained on MIMIC-III) was used to examine the impact of n-gram size on diagnostic performance for OSA. These models were selected to balance model complexity and compatibility with MIMIC-IV data.

As outlined in Section 3.3.6, for each task-specific study cohort, an equal number of LLM-selected n-grams were chosen from OSA and related comorbidities, then merged into a unified feature set. These n-grams were treated as binary features based on their presence in discharge notes. Models were trained and evaluated following the same process as described in Section 3.4.

To further investigate the impact of LLM-expanded n-grams on mortality and readmission prediction, we include a generic n-gram baseline, referred to as the Top-ngram approach. Rather than using LLM-selected trigrams or four-grams, this method relies on the most frequently occurring trigrams or four-grams in discharge notes from the MIMIC-IV dataset. We compared the LLM-expanded n-gram approach with the Top-ngram model in both predictive performance and computational characteristics. Model complexity is assessed using three indicators: total number of n-gram count, non-zero coefficients after model fitting, and empirical runtime. These measures represent feature-space dimensionality, model sparsity, and computational cost, respectively.

4.2. Mortality Prediction

To build the cohort, we extracted patients and their hospital admissions related to OSA, T2DM, or HTN using ICD codes determined by physicians. Table 3 summarizes the number of patients and corresponding hospital admissions for each group. Note that each admission is associated with a single discharge note.

For the mortality study cohort, only the last hospital admission for each patient was included. Discharge notes were labeled as deceased if the patient died within either 6 months or 1 year following discharge. The counts reported in Table 4 represent the number of alive and deceased cases across the different patient groups.

The mortality cohort was evaluated using stratified 5-fold cross-validation with three repeats. Because only one admission per patient was included, the evaluation is patient-independent by construction. The same splits were applied consistently across all models and feature representations.

Two approaches were explored, as described in the Overview section. The Top-ngram approach selected the 200,000 most frequent trigrams from all discharge notes as features. In contrast, the LLM-based approach (i.e., GatorTron Medium and BlueBERT) merged top-ranked trigrams from each condition-specific LLM-expanded list, yielding 206,858 features for GatorTron Medium and 204,003 features for BlueBERT.

Table 5 shows that the Top-ngram approach outperformed both the GatorTron Medium and BlueBERT models in predicting 6-month and 1-year mortality. To better understand these results, we further examined the model parameters. Specifically, we analyzed the number of unique trigrams used in each model (i.e., Total Trigrams) and the number of trigrams with non-zero coefficients in the fitted LR models (i.e., Non-zero) (Table 5). A non-zero coefficient indicates that the trigram contributes to the model’s predictions.

The Top-ngram approach retained nearly all 200,000 trigrams (99.9% non-zero coefficients), whereas GatorTron Medium and BlueBERT produced sparser models with only 78–79% non-zero coefficients. Higher sparsity (more coefficients set to zero) is desirable as it indicates the model relies on a smaller, semantically focused feature subset, improving interpretability and computational efficiency. Empirically, models using LLM-expanded lexicons were approximately twice as fast as the Top-ngram approach, consistent with their sparser feature representations.

Despite substantial class imbalance (12.6–19.4% mortality) (Table 4), Table 6 shows all approaches achieved good performance with precision 3× above baseline prevalence and high specificity (0.81–0.91). LLM approaches achieved comparable ROC-AUC with better computational efficiency while Top-ngram showed better precision–recall balance, which may be important in clinical settings where both unnecessary interventions and missed high-risk patients carry costs.

4.3. Hospital Readmission Prediction

To build the readmission cohort, we extracted patients with OSA or AF using physician-assigned ICD codes. Table 7 provides information on the patient composition for each group.

As part of the preprocessing for the readmission analysis, hospital admissions were first chronologically ordered for each patient to establish a timeline of visits. The final admission for each patient was identified, and a distinction was made between intermediate and last admissions. To ensure that the analysis focused only on admissions where readmission was possible, records corresponding to a patient’s final admission were excluded if the patient had died following that visit. This step helps avoid bias by removing cases where readmission was not a possibility due to death. Patients’ discharge notes were labeled as 1 if they were readmitted and labeled as 0 if not readmitted or deceased. The counts reported in Table 8 represent the number of discharge notes across the different patient groups.

For readmission prediction, we applied stratified group 5-fold cross-validation (three repeats) with patient-level grouping to prevent data leakage. Model training and hyperparameter optimization followed the procedures described in Section 3.4.

Two approaches were used to generate four-gram features for logistic regression. The Top-ngram approach selected the 200,000 most frequent four-grams from all discharge notes as features. The LLM-based approach merged top-ranked four-grams from each condition-specific BlueBERT-expanded list, yielding 208,455 features after removing duplicates. Figure 2 shows the predictive power of four-grams in readmission prediction improves for longer windows, indicating they are potentially better in capturing chronic disease burden and long-term risk. “Anytime” here refers to hospital admissions that occur at any time after the initial hospital discharge.

Table 9 shows that Top-ngram and BlueBERT approaches achieved comparable performance (ROC-AUC 0.736 vs. 0.729). As observed in the mortality study, the BlueBERT approach yielded a sparser model (72.13% non-zero coefficients vs. 99.83% for Top-ngram). The computational cost also differed substantially: the BlueBERT-expanded model was approximately five times faster than the Top-ngram model.

With a more balanced class distribution (62.7% readmission rate), Table 10 shows both approaches performed well. Precision was 0.76–0.77 and F1-scores were 0.72–0.74. Top-ngram showed higher recall (0.719 vs. 0.687) but slightly lower specificity (0.624 vs. 0.649).

4.4. Comparing the Impact of N-Gram Size

The goal of this study is to analyze how the size of LLM-expanded n-grams influences diagnostic performance when the language model is held fixed. BioClinicalBERT was selected based on our prior systematic evaluation of six pretrained language models for OSA classification [51], which identified it as best balancing representational quality and computational efficiency. BioClinicalBERT [18] is used throughout the study to isolate the effect of n-gram size.

Similar to the mortality and readmission studies, the patient cohort was extracted based on OSA and HF ICDs (Table 11).

An equal number of bi/tri/four-grams was chosen from the ranked bi/tri/four-gram lists of OSA and HF. Similar to the previous sections, each selected n-gram serves as a unique feature for the bag-of-words classification model, capturing the presence or absence of each n-gram in a given discharge note. The discharge notes were labeled with OSA only (i.e., without HF), HF only (i.e., without OSA), or both OSA & HF based on the ICD code. LR was then employed for a 3-class 1-to-many classification. Classification performance was measured by wAUC rather than macro AUC to account for class imbalance (Table 11), as wAUC weights classes by prevalence to better reflect the actual patient distribution.

Stratified 5-fold cross-validation (three repeats) at admission-level was applied to this task-specific cohort, consistently across all n-gram sizes. Figure 3 shows the wAUC range from approximately 0.74 to 0.9. It also suggests that the wAUC scores for all three types of n-grams increase as the number of n-grams grows. The bigram-based model generally outperforms the trigram- and four-gram-based models, indicating that the additional context provided by trigrams and four-grams does not contribute to higher predictive performance for this analysis. This is somewhat unexpected, as trigram- and four-gram-based models were initially assumed to offer richer contextual information due to their longer phrase structure.

For diagnosis prediction, we applied stratified patient-level grouped 5-fold cross-validation (three repeats) to prevent data leakage from the same patient. Model training and hyperparameter optimization followed the procedures described in Section 3.4. Table 12 shows wAUC is consistently higher than macro AUROC across all n-gram types, reflecting class imbalance and supporting the use of wAUC as the primary metric. Bigrams outperform trigrams and four-grams across all measures, including wAUC (0.863 vs. 0.820 vs. 0.803) and weighted F1 (0.747 vs. 0.696 vs. 0.683). The gap between weighted and macro-averaged metrics indicates difficulty in predicting minority classes, and the monotonic decline in performance with increasing n-gram size confirms that additional context from longer n-grams does not improve diagnostic performance.

5. Conclusions

Research indicates that OSA increases the risk of cardiovascular and metabolic complications. This study used LLMs and NLP to develop a lexicon specific to OSA and its associated comorbidities for predicting patient mortality and hospital readmission risk, as well as for performing multiclass diagnostic classification of OSA and HF.

5.1. Major Findings

LLMs can identify informative lexicons for predicting mortality and hospital readmission in OSA patients, achieving ROC-AUC scores of 0.844 for 6-month post-discharge mortality (Table 5) and 0.729 for all-cause hospital readmissions following the first discharge (Table 9). In the mortality study, the GatorTron Medium-expanded lexicon performed slightly better than the lexicon expanded with BlueBERT, which could be because more complex models are more adept at selecting higher-quality n-grams, leading to a more accurate characterization of health status.

The Top-ngram approach achieved comparable or better results but required greater computational cost. This increased computational cost stems from using nearly all available n-grams, whereas LLM-expanded lexicons focus on a smaller, semantically informed feature subset (Table 5 and Table 9). Although the LLM approach began with a similar number of n-grams, only a smaller fraction had non-zero coefficients, yielding sparser models. This sparsity likely explains why LLM models ran approximately two to five times faster than Top-ngram models.

LLMs can identify informative lexicons for OSA diagnosis, achieving wAUC scores of 0.9 or slightly higher (Figure 3). wAUC scores for all three n-grams increase as the number of n-gram counts grow, with the bigram model outperforming trigram and four-gram models. This suggests the extra context provided by trigrams and four-grams did not help with predictive performance.

5.2. Limitations and Future Work

In this study, we explored the effectiveness of employing LLM-expanded lexicons for predicting patient outcomes and performing multiclass diagnostic classification of OSA and HF. Several limitations should be acknowledged.

Data and Labeling Limitations. First, discharge notes for patients with OSA and comorbidities were labeled using ICD codes, primarily designed for billing purposes and not necessarily indicative of the patient’s final diagnosis. Second, our analysis relied solely on data from a single dataset (MIMIC-IV), which may limit generalizability to other healthcare settings with different patient populations, documentation practices, or coding conventions. To enhance model validation, we plan to incorporate two or more instances of diagnostic codes for a specific health condition when labeling [52], collaborate with physicians to create a ground truth dataset [53], and validate our approach on other datasets pending data availability and access agreements.

Model and Evaluation Limitations. We employed logistic regression to support interpretability, but this approach may not capture complex nonlinear relationships between features. In addition, our models use only unstructured clinical notes. While we reported standard performance metrics, we did not assess clinical utility or real-world deployment feasibility. Future work will explore more advanced models, combine structured data (e.g., demographics, time series) with unstructured data from EHRs, and conduct real-world evaluations.

Generalizability and N-gram Size. The n-gram size study demonstrated that bigrams outperform trigrams and four-grams for OSA/HF diagnosis, but this finding may not generalize to other patient populations or prediction tasks. We plan to systematically investigate how n-gram size affects performance across different comorbidity combinations (OSA with T2DM, HTN, AF) and prediction tasks (mortality, readmission). Additionally, we will explore lexicons using combinations of n-gram sizes rather than single n-gram types, and develop methods to automatically select optimal n-gram sizes based on task characteristics.

Interpretability and Clinical Adoption. While sparse models offer some interpretability through feature weights, understanding which specific clinical concepts drive predictions and how they interact remains challenging. We plan to develop explainable methods that improve interpretability of model outputs, including feature attribution techniques and visualization tools for clinicians, and conduct user studies with physicians to evaluate clinical acceptability and actionability of model predictions.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/bdcc10030097/s1, Supplementary Materials S1: Complete set of International Classification of Diseases (ICD) codes used to identify patients with obstructive sleep apnea (OSA), type 2 diabetes mellitus (T2DM), hypertension (HTN), heart failure (HF), and atrial fibrillation (AF). Supplementary Materials S2: All clinician-selected seed terms associated with OSA, T2DM, HTN, HF, and AF.

Author Contributions

Data curation, formal analysis, and methodology, A.A., A.R., C.W., I.K., R.Z.-R., and A.D.; writing—original draft preparation, A.D.; writing—review and editing, S.C.; conceptualization, D.M., R.Z.-R., A.D., and S.C.; supervision, A.D. and S.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by The National Energy Research Scientific Computing Center, The U.S. Department of Energy, The Sustainable Research Pathways Program, and The Hood College Volpe Scholarship.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable. This study used deidentified electronic health records from the MIMIC-IV database.

Data Availability Statement

The MIMIC-IV database is available at https://physionet.org/content/mimiciv/ (accessed on 15 December 2025) upon completion of required training and approval. The Supplementary Materials associated with this article are provided alongside the manuscript.

Acknowledgments

We are grateful to the physicians from Veterans Affairs’ Million Veterans Project MVP063, and in particular to the PI, Eilis Boudreau, for providing medical guidance.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AI	Artificial Intelligence
AF	Atrial Fibrillation
AHI	Apnea-Hypopnea Index
CV	Cardiovascular
EDS	Excessive Daytime Sleepiness
EHR	Electronic Health Records
FM	Foundation Model
HF	Heart Failure
HTN	Hypertension
ICD	International Classification of Diseases
ICU	Intensive Care Unit
LLM	Large Language Model
LR	Logistic Regression
MIMIC	Medical Information Mart for Intensive Care
NERSC	National Energy Research Scientific Computing Center
NLP	Natural Language Processing
OHS	Obesity Hypoventilation Syndrome
OSA	Obstructive Sleep Apnea
T2DM	Type 2 Diabetes Mellitus
wAUC	Weighted Area Under the Curve

References

Sönmez, I.; Dupuy, A.V.; Kristina, S.Y.; Cronin, J.; Yee, J.; Azarbarzin, A. Unmasking obstructive sleep apnea: Estimated prevalence and impact in the United States. Respir. Med. 2025, 248, 108348. [Google Scholar] [CrossRef] [PubMed]
Benjafield, A.V.; Ayas, N.T.; Eastwood, P.R.; Heinzer, R.; Ip, M.S.; Morrell, M.J.; Nunez, C.M.; Patel, S.R.; Penzel, T.; Pépin, J.L.; et al. Estimation of the global prevalence and burden of obstructive sleep apnoea: A literature-based analysis. Lancet Respir. Med. 2019, 7, 687–698. [Google Scholar] [CrossRef] [PubMed]
Loke, Y.K.; Brown, J.W.L.; Kwok, C.S.; Niruban, A.; Myint, P.K. Association of obstructive sleep apnea with risk of serious cardiovascular events: A systematic review and meta-analysis. Circ. Cardiovasc. Qual. Outcomes 2012, 5, 720–728. [Google Scholar] [CrossRef] [PubMed]
Seiler, A.; Camilo, M.; Korostovtseva, L.; Haynes, A.G.; Brill, A.K.; Horvath, T.; Egger, M.; Bassetti, C.L. Prevalence of sleep-disordered breathing after stroke and TIA: A meta-analysis. Neurology 2019, 92, e648–e654. [Google Scholar] [CrossRef]
Sulit, L.; Storfer-Isser, A.; Kirchner, H.L.; Redline, S. Differences in polysomnography predictors for hypertension and impaired glucose tolerance. Sleep 2006, 29, 777–783. [Google Scholar] [CrossRef]
Xia, W.; Huang, Y.; Peng, B.; Zhang, X.; Wu, Q.; Sang, Y.; Luo, Y.; Liu, X.; Chen, Q.; Tian, K. Relationship between obstructive sleep apnoea syndrome and essential hypertension: A dose-response meta-analysis. Sleep Med. 2018, 47, 11–18. [Google Scholar] [CrossRef] [PubMed]
Olaithe, M.; Bucks, R.S.; Hillman, D.R.; Eastwood, P.R. Cognitive deficits in obstructive sleep apnea: Insights from a meta-review and comparison with deficits observed in COPD, insomnia, and sleep deprivation. Sleep Med. Rev. 2018, 38, 39–49. [Google Scholar] [CrossRef]
Yaffe, K.; Laffan, A.M.; Harrison, S.L.; Redline, S.; Spira, A.P.; Ensrud, K.E.; Ancoli-Israel, S.; Stone, K.L. Sleep-Disordered Breathing, Hypoxia, and Risk of Mild Cognitive Impairment and Dementia in Older Women. JAMA 2011, 306, 613–619. [Google Scholar] [CrossRef]
Chung, F.; Yegneswaran, B.; Liao, P.; Chung, S.A.; Vairavanathan, S.; Islam, S.; Khajehdehi, A.; Shapiro, C.M. STOP questionnaire: A tool to screen patients for obstructive sleep apnea. Anesthesiology 2008, 108, 812–821. [Google Scholar] [CrossRef]
Marti-Soler, H.; Hirotsu, C.; Marques-Vidal, P.; Vollenweider, P.; Waeber, G.; Preisig, M.; Tafti, M.; Tufik, S.B.; Bittencourt, L.; Tufik, S.; et al. The NoSAS score for screening of sleep-disordered breathing: A derivation and validation study. Lancet Respir. Med. 2016, 4, 742–748. [Google Scholar] [CrossRef]
Weiskopf, N.G.; Weng, C. Methods and dimensions of electronic health record data quality assessment: Enabling reuse for clinical research. J. Am. Med. Inform. Assoc. 2013, 20, 144–151. [Google Scholar] [CrossRef]
Chiu, C.C.; Wu, C.M.; Chien, T.N.; Kao, L.J.; Li, C.; Chu, C.M. Integrating structured and unstructured EHR data for predicting mortality by machine learning and latent Dirichlet allocation method. Int. J. Environ. Res. Public Health 2023, 20, 4340. [Google Scholar] [CrossRef] [PubMed]
Zhang, D.; Yin, C.; Zeng, J.; Yuan, X.; Zhang, P. Combining structured and unstructured data for predictive models: A deep learning approach. BMC Med. Inform. Decis. Mak. 2020, 20, 280. [Google Scholar] [CrossRef]
Rajkomar, A.; Oren, E.; Chen, K.; Dai, A.M.; Hajaj, N.; Hardt, M.; Liu, P.J.; Liu, X.; Marcus, J.; Sun, M.; et al. Scalable and accurate deep learning with electronic health records. NPJ Digit. Med. 2018, 1, 18. [Google Scholar] [CrossRef] [PubMed]
Fensore, C.; Carrillo-Larco, R.M.; Patel, S.A.; Morris, A.A.; Ho, J.C. Large Language Models for Integrating Social Determinant of Health Data: A Case Study on Heart Failure 30-Day Readmission Prediction. arXiv 2024, arXiv:2407.09688. [Google Scholar] [CrossRef]
Park, S.; Wee, C.W.; Choi, S.H.; Kim, K.H.; Chang, J.S.; Yoon, H.I.; Lee, I.J.; Kim, Y.B.; Cho, J.; Keum, K.C.; et al. RT-Surv: Improving Mortality Prediction After Radiotherapy with Large Language Model Structuring of Large-Scale Unstructured Electronic Health Records. arXiv 2024, arXiv:2408.05074. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]
Alsentzer, E.; Murphy, J.R.; Boag, W.; Weng, W.H.; Jin, D.; Naumann, T.; McDermott, M. Publicly available clinical BERT embeddings. arXiv 2019, arXiv:1904.03323. [Google Scholar] [CrossRef]
Lee, J.; Yoon, W.; Kim, S.; Kim, D.; Kim, S.; So, C.H.; Kang, J. BioBERT: A pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 2020, 36, 1234–1240. [Google Scholar] [CrossRef]
Yang, X.; Chen, A.; PourNejatian, N.; Shin, H.C.; Smith, K.E.; Parisien, C.; Compas, C.; Martin, C.; Costa, A.B.; Flores, M.G.; et al. A large language model for electronic health records. NPJ Digit. Med. 2022, 5, 194. [Google Scholar] [CrossRef]
Singhal, K.; Azizi, S.; Tu, T.; Mahdavi, S.S.; Wei, J.; Chung, H.W.; Scales, N.; Tanwani, A.; Cole-Lewis, H.; Pfohl, S.; et al. Large language models encode clinical knowledge. Nature 2023, 620, 172–180. [Google Scholar] [CrossRef]
Nori, H.; King, N.; McKinney, S.M.; Carignan, D.; Horvitz, E. Capabilities of GPT-4 on medical challenge problems. arXiv 2023, arXiv:2303.13375. [Google Scholar] [CrossRef]
Ahltorp, M.; Skeppstedt, M.; Kitajima, S.; Henriksson, A.; Rzepka, R.; Araki, K. Expansion of medical vocabularies using distributional semantics on Japanese patient blogs. J. Biomed. Semant. 2016, 7, 58. [Google Scholar] [CrossRef] [PubMed]
Fan, Y.; Pakhomov, S.; McEwan, R.; Zhao, W.; Lindemann, E.; Zhang, R. Using word embeddings to expand terminology of dietary supplements using clinical notes. JAMIA Open 2019, 2, 246–253. [Google Scholar] [CrossRef]
Kugic, A.; Pfeifer, B.; Schulz, S.; Kreuzthaler, M. Embedding-based terminology expansion via secondary data sources. J. Biomed. Inform. 2023, 147, 104497. [Google Scholar] [CrossRef]
Bommasani, R.; Hudson, D.A.; Adeli, E.; Altman, R.; Arora, S.; von Arx, S.; Bernstein, M.S.; Bohg, J.; Bosselut, A.; Brunskill, E.; et al. On the opportunities and risks of foundation models. arXiv 2021, arXiv:2108.07258. [Google Scholar] [CrossRef]
Le Scao, T.; Fan, A.; Akiki, C.; Pavlick, E.; Ilić, S.; Hesslow, D.; Castagné, R.; Luccioni, A.S.; Yvon, F.; Gallé, M.; et al. Bloom: A 176b-parameter open-access multilingual language model. arXiv 2022, arXiv:2211.05100. [Google Scholar]
Peng, Y.; Yan, S.; Lu, Z. Transfer learning in biomedical natural language processing: An evaluation of BERT and ELMo on ten benchmarking datasets. arXiv 2019, arXiv:1906.05474. [Google Scholar] [CrossRef]
Patel, S.B.; Lam, K. ChatGPT: The future of discharge summaries? Lancet Digit. Health 2023, 5, e107–e108. [Google Scholar] [CrossRef]
Luo, R.; Sun, L.; Xia, Y.; Qin, T.; Zhang, S.; Poon, H.; Liu, T.Y. BioGPT: Generative pre-trained transformer for biomedical text generation and mining. Briefings Bioinform. 2022, 23, bbac409. [Google Scholar] [CrossRef] [PubMed]
Kumar, V.; Rajawat, P.S.; Ntoutsi, E. Mitigating Semantic Drift: Evaluating LLMs’ Efficacy in Psychotherapy through MI Dialogue Summarization Leveraging MITI Code. In Proceedings of the International Joint Conference on Neural Networks (IJCNN); IEEE: New York, NY, USA, 2025; pp. 1–8. [Google Scholar] [CrossRef]
Yang, X.; Li, T.; Su, Q.; Liu, Y.; Kang, C.; Lyu, Y.; Zhao, L.; Nie, Y.; Pan, Y. Application of Large Language Models in Disease Diagnosis and Treatment. Chin. Med. J. 2025, 138, 130–142. [Google Scholar] [CrossRef] [PubMed]
Zhang, K.; Meng, X.; Yan, X.; Ji, J.; Liu, J.; Xu, H.; Zhang, H.; Liu, D.; Wang, J.; Wang, X.; et al. Revolutionizing Health Care: The Transformative Impact of Large Language Models in Medicine. J. Med. Internet Res. 2025, 27, e59069. [Google Scholar] [CrossRef]
Qadrud-Din, J.; Rabiou, A.B.; Walker, R.; Soni, R.; Gajek, M.; Pack, G.; Rangaraj, A. Transformer based language models for similar text retrieval and ranking. arXiv 2020, arXiv:2005.04588. [Google Scholar] [CrossRef]
Chae, Y.; Davidson, T. Large Language Models for Text Classification: From Zero-Shot Learning to Fine-Tuning; Open Science Foundation: Charlottesville, VA, USA, 2023. [Google Scholar]
Savelka, J.; Ashley, K.D. The unreasonable effectiveness of large language models in zero-shot semantic annotation of legal texts. Front. Artif. Intell. 2023, 6, 1279794. [Google Scholar] [CrossRef]
Si, Y.; Wang, J.; Xu, H.; Roberts, K. Enhancing clinical concept extraction with contextual embeddings. J. Am. Med. Inform. Assoc. 2019, 26, 1297–1304. [Google Scholar] [CrossRef]
Hernandez, B.; Stiff, O.; Ming, D.K.; Ho Quang, C.; Nguyen Lam, V.; Nguyen Minh, T.; Nguyen Van Vinh, C.; Nguyen Minh, N.; Nguyen Quang, H.; Phung Khanh, L.; et al. Learning meaningful latent space representations for patient risk stratification: Model development and validation for dengue and other acute febrile illness. Front. Digit. Health 2023, 5, 1057467. [Google Scholar] [CrossRef]
Morrow, E.; Zamora-Resendiz, R.; Beckham, J.C.; Kimbrel, N.A.; McMahon, B.H.; Crivelli, S. Life events extraction from healthcare notes for veteran acute suicide risk prediction. J. Am. Med. Inform. Assoc. (JAMIA) 2026, ocaf197. [Google Scholar] [CrossRef] [PubMed]
Boag, W.; Doss, D.; Naumann, T.; Szolovits, P. What’s in a note? Unpacking predictive value in clinical note representations. AMIA Summits Transl. Sci. Proc. 2018, 2018, 26. [Google Scholar]
Huang, K.; Altosaar, J.; Ranganath, R. Clinicalbert: Modeling clinical notes and predicting hospital readmission. arXiv 2019, arXiv:1904.05342. [Google Scholar]
Wu, J.; Ye, X.; Mou, C.; Dai, W. Fineehr: Refine clinical note representations to improve mortality prediction. In Proceedings of the 2023 11th International Symposium on Digital Forensics and Security (ISDFS); IEEE: New York, NY, USA, 2023; pp. 1–6. [Google Scholar]
Ye, J.; Yao, L.; Shen, J.; Janarthanam, R.; Luo, Y. Predicting mortality in critically ill patients with diabetes using machine learning and clinical notes. BMC Med. Inform. Decis. Mak. 2020, 20, 295. [Google Scholar] [CrossRef]
Ashfaq, A.; Sant’Anna, A.; Lingman, M.; Nowaczyk, S. Readmission prediction using deep learning on electronic health records. J. Biomed. Inform. 2019, 97, 103256. [Google Scholar] [CrossRef]
Chen, P.F.; Chen, L.; Lin, Y.K.; Li, G.H.; Lai, F.; Lu, C.W.; Yang, C.Y.; Chen, K.C.; Tzu-Yu, L. Predicting postoperative mortality with deep neural networks and natural language processing: Model development and validation. JMIR Med. Inform. 2022, 10, e38241. [Google Scholar] [CrossRef]
Jin, M.; Bahadori, M.T.; Colak, A.; Bhatia, P.; Celikkaya, B.; Bhakta, R.; Senthivel, S.; Khalilia, M.; Navarro, D.; Zhang, B.; et al. Improving hospital mortality prediction with medical named entities and multimodal learning. arXiv 2018, arXiv:1811.12276. [Google Scholar] [CrossRef]
Khadanga, S.; Aggarwal, K.; Joty, S.; Srivastava, J. Using clinical notes with time series data for ICU management. arXiv 2019, arXiv:1909.09702. [Google Scholar]
Parreco, J.; Hidalgo, A.; Kozol, R.; Namias, N.; Rattan, R. Predicting mortality in the surgical intensive care unit using artificial intelligence and natural language processing of physician documentation. Am. Surg. 2018, 84, 1190–1194. [Google Scholar] [CrossRef]
Nazih, W.; Abuhmed, T.; Alharbi, M.; El-Sappagh, S. Mortality Prediction for ICU Patients with Mental Disorders Using Large Language Models Ensemble and Unstructured Medical Notes. PLoS ONE 2025, 20, e0332134. [Google Scholar] [CrossRef]
Johnson, A.; Bulgarelli, L.; Pollard, T.; Horng, S.; Celi, L.A.; Mark, R. MIMIC-IV (version 2.2). PhysioNet 2023. [Google Scholar] [CrossRef]
Ahmed, A.; Rispoli, A.; Wasieloski, C.; Khurram, I.; Zamora-Resendiz, R.; Morrow, D.; Dong, A.; Crivelli, S. Deep Phenotyping of Obstructive Sleep Apnea and Comorbidities with Large Language Models. In Proceedings of the AIME24, Salt Lake City, UT, USA, 9–12 July 2024. [Google Scholar]
Keenan, B.T.; Kirchner, H.L.; Veatch, O.J.; Borthwick, K.M.; Davenport, V.A.; Feemster, J.C.; Gendy, M.; Gossard, T.R.; Pack, F.M.; Sirikulvadhana, L.; et al. Multisite validation of a simple electronic health record algorithm for identifying diagnosed obstructive sleep apnea. J. Clin. Sleep Med. 2020, 16, 175–183. [Google Scholar] [CrossRef] [PubMed]
Cade, B.E.; Hassan, S.M.; Dashti, H.S.; Kiernan, M.; Pavlova, M.K.; Redline, S.; Karlson, E.W. Sleep apnea phenotyping and relationship to disease in a large clinical biobank. JAMIA Open 2022, 5, ooab117. [Google Scholar] [CrossRef] [PubMed]

Figure 1. A Step-by-step Process Flow.

Figure 2. Comparison of ROC-AUC Scores across Five Primary Readmission Windows.

Figure 3. LLM Performance with Different N-gram Models. wAUC scores improve as the number of n-grams increases, with the bigram-based model achieving higher performance than the trigram- and four-gram-based models.

Table 1. Summary of ICD Codes and Seed Terms.

Condition	ICD Codes, N	Seed Terms, N
OSA	9	38
T2DM	157	13
HTN	68	10
HF	61	15
AF	16	12
Total	311	88

Note: The full list of ICD codes is provided in Supplementary Materials S1, and the complete list of seed terms in Supplementary Materials S2.

Table 2. Demographic Data for Patients with OSA and Comorbidities.

Characteristic	OSA	T2DM	HTN	HF	AF
Patients, N	13,942	21,666	74,080	21,076	25,743
Discharge Notes, N	29,892	53,446	161,245	49,479	55,418
Women, N (%)	5628 (40.4)	10,088 (46.6)	36,486 (49.3)	9944 (47.2)	11,215 (44.0)
White, N (%)	9895 (71.0)	13,802 (63.7)	51,238 (69.2)	15,269 (72.4)	19,980 (77.6)
Black, N (%)	1966 (14.1)	3688 (17.0)	9959 (13.4)	2476 (11.7)	1776 (6.9)
Other, N (%)	2081 (14.9)	4176 (19.3)	12,883 (17.4)	3331 (15.9)	3967 (15.5)

Table 3. Patient Composition for Mortality Prediction: OSA, T2DM, and HTN.

Group	Patients (N)	Hospital Admissions (N)
OSA Only (w/o T2DM and HTN)	6392	11,266
T2DM Only (w/o OSA and HTN)	5561	9372
HTN Only (w/o OSA and T2DM)	56,111	107,081
OSA & T2DM & HTN	2828	6000
Other ⁺	23,809	49,432
Total	81,096 *	183,151

⁺ Includes patients with two of the three health conditions. * A patient with multiple admissions or discharge notes may appear in more than one of the categories listed above at different points in time.

Table 4. The Mortality Study Cohort.

Status	6-Month Post-Discharge	1-Year Post-Discharge
Alive	66,813	61,623
Deceased	9606	14,796

Table 5. Mortality Prediction Results.

Approach	6-Month Post-Discharge			1-Year Post-Discharge
Approach	AUC	Total Trigrams	Non-Zero N (%)	AUC	Total Trigrams	Non-Zero N (%)
Top-ngram	0.899	200,000	199,936 (99.97)	0.871	200,000	199,939 (99.97)
GatorTron Medium	0.844	206,858	163,867 (79.22)	0.817	206,858	164,123 (79.34)
BlueBERT	0.821	204,003	159,980 (78.42)	0.803	204,003	159,763 (78.31)

AUC: ROC-AUC.

Table 6. Additional Performance Metrics for Mortality Prediction.

Outcome	Approach	Accuracy	Precision	Recall
6-month Post-discharge	Top-ngram	0.881 ± 0.003	0.520 ± 0.007	0.696 ± 0.010
	GatorTron	0.846 ± 0.002	0.421 ± 0.006	0.611 ± 0.013
	BlueBERT	0.834 ± 0.003	0.390 ± 0.008	0.574 ± 0.012
1-year Post-discharge	Top-ngram	0.808 ± 0.003	0.502 ± 0.008	0.760 ± 0.008
	GatorTron	0.786 ± 0.004	0.463 ± 0.008	0.656 ± 0.009
	BlueBERT	0.776 ± 0.003	0.446 ± 0.008	0.639 ± 0.009
Outcome	Approach	Specificity	F1-Score
6-month Post-discharge	Top-ngram	0.907 ± 0.003	0.595 ± 0.007
	GatorTron	0.879 ± 0.002	0.499 ± 0.007
	BlueBERT	0.871 ± 0.003	0.465 ± 0.008
1-year Post-discharge	Top-ngram	0.819 ± 0.004	0.605 ± 0.007
	GatorTron	0.817 ± 0.004	0.543 ± 0.008
	BlueBERT	0.809 ± 0.003	0.525 ± 0.009

Table 7. Patient Composition for Hospital Readmission: OSA and AF.

Group	Patients (N)	Discharge Notes (N)
OSA Only (w/o AF)	11,287	22,698
AF Only (w/o OSA)	23,539	48,224
OSA & AF	3405	7194
Total	38,231	78,116

Table 8. The Readmission Study Cohort.

Readmitted	Not Readmitted	Total
42,124	25,037	67,161

Table 9. Readmission Prediction Results.

Approach	AUC	Total Four-Grams, N	Non-Zero, N (%)
Top-ngram	0.736	200,000	199,629 (99.81)
BlueBERT	0.729	208,455	150,376 (72.13)

AUC: ROC-AUC.

Table 10. Additional Performance Metrics for Readmission Prediction. ROC-AUC values are reported in Table 9.

Approach	Accuracy	Precision	Recall	Specificity
Top-ngram	0.684 ± 0.004	0.763 ± 0.006	0.719 ± 0.006	0.624 ± 0.008
BlueBERT	0.673 ± 0.003	0.767 ± 0.005	0.687 ± 0.006	0.649 ± 0.007
Approach			F1-Score
Top-ngram			0.741 ± 0.004
BlueBERT			0.725 ± 0.004

Table 11. Patient Cohort for OSA and HF.

Diagnosis Label	Discharge Notes (N)
OSA Only (w/o HF)	21,929
HF Only (w/o OSA)	41,516
OSA & HF	7963
Total	71,408

Table 12. Comprehensive Performance Metrics for OSA/HF Diagnosis Across N-gram Types.

N-Gram	wAUC ^a	mAUC ^b	Accuracy	Precision ^c	Recall ^c
Bigrams	0.863 ± 0.004	0.836 ± 0.004	0.751 ± 0.004	0.645 ± 0.008	0.644 ± 0.005
Trigrams	0.820 ± 0.005	0.793 ± 0.005	0.699 ± 0.007	0.579 ± 0.010	0.582 ± 0.007
Four-grams	0.803 ± 0.005	0.775 ± 0.005	0.688 ± 0.006	0.566 ± 0.006	0.565 ± 0.004
N-Gram	F1 ^c	Precision ^d	Recall ^d	F1 ^d
Bigrams	0.642 ± 0.006	0.746 ± 0.005	0.751 ± 0.004	0.747 ± 0.004
Trigrams	0.579 ± 0.008	0.695 ± 0.007	0.699 ± 0.007	0.696 ± 0.007
Four-grams	0.563 ± 0.005	0.680 ± 0.008	0.688 ± 0.006	0.683 ± 0.007

^a wAUC: weighted AUC. ^b mAUC: macro AUC. ^c Macro-averaged metrics. ^d Weighted metrics (account for class imbalance).

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ahmed, A.; Rispoli, A.; Wasieloski, C.; Khurram, I.; Zamora-Resendiz, R.; Morrow, D.; Dong, A.; Crivelli, S. Predicting Mortality and Readmission in Obstructive Sleep Apnea via LLM-Expanded Clinical Concepts. Big Data Cogn. Comput. 2026, 10, 97. https://doi.org/10.3390/bdcc10030097

AMA Style

Ahmed A, Rispoli A, Wasieloski C, Khurram I, Zamora-Resendiz R, Morrow D, Dong A, Crivelli S. Predicting Mortality and Readmission in Obstructive Sleep Apnea via LLM-Expanded Clinical Concepts. Big Data and Cognitive Computing. 2026; 10(3):97. https://doi.org/10.3390/bdcc10030097

Chicago/Turabian Style

Ahmed, Awwal, Anthony Rispoli, Carrie Wasieloski, Ifrah Khurram, Rafael Zamora-Resendiz, Destinee Morrow, Aijuan Dong, and Silvia Crivelli. 2026. "Predicting Mortality and Readmission in Obstructive Sleep Apnea via LLM-Expanded Clinical Concepts" Big Data and Cognitive Computing 10, no. 3: 97. https://doi.org/10.3390/bdcc10030097

APA Style

Ahmed, A., Rispoli, A., Wasieloski, C., Khurram, I., Zamora-Resendiz, R., Morrow, D., Dong, A., & Crivelli, S. (2026). Predicting Mortality and Readmission in Obstructive Sleep Apnea via LLM-Expanded Clinical Concepts. Big Data and Cognitive Computing, 10(3), 97. https://doi.org/10.3390/bdcc10030097

Article Menu

Predicting Mortality and Readmission in Obstructive Sleep Apnea via LLM-Expanded Clinical Concepts

Abstract

1. Introduction

2. Related Work

2.1. LLMs in Health Care

2.2. Predicting Mortality and Hospital Readmission Through NLP Techniques

3. Materials and Methods

3.1. Dataset

3.2. Software and Platform

3.3. Process Flow

3.3.1. Bi/Tri/Four-Grams Extracted from Discharge Notes

3.3.2. Seed Terms Provided by Physicians

3.3.3. Expanding Lexicon via LLMs

3.3.4. Expanded Lexicon

3.3.5. Patient Discharge Notes Extracted Using ICD Codes

3.3.6. Classifying with Logistic Regression

3.3.7. Prediction Results

3.4. Model Training and Evaluation

4. Results and Discussion

4.1. Overview

4.2. Mortality Prediction

4.3. Hospital Readmission Prediction

4.4. Comparing the Impact of N-Gram Size

5. Conclusions

5.1. Major Findings

5.2. Limitations and Future Work

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI