A Bayesian Survival Analysis on Long COVID and Non-Long COVID Patients: A Cohort Study Using National COVID Cohort Collaborative (N3C) Data

Jiang, Sihang; Loomba, Johanna; Zhou, Andrea; Sharma, Suchetha; Sengupta, Saurav; Liu, Jiebei; Brown, Donald; on behalf of N3C Consortium,

doi:10.3390/bioengineering12050496

Open AccessArticle

A Bayesian Survival Analysis on Long COVID and Non-Long COVID Patients: A Cohort Study Using National COVID Cohort Collaborative (N3C) Data

by

Sihang Jiang

^1,*

,

Johanna Loomba

²,

Andrea Zhou

²

,

Suchetha Sharma

³,

Saurav Sengupta

³,

Jiebei Liu

¹,

Donald Brown

³

and

on behalf of N3C Consortium

^†

¹

School of Engineering and Applied Science, University of Virginia, Charlottesville, VA 22903, USA

²

Integrated Translational Health Research Institute of Virginia (iTHRIV), University of Virginia, Charlottesville, VA 22903, USA

³

School of Data Science, University of Virginia, Charlottesville, VA 22903, USA

^*

Author to whom correspondence should be addressed.

^†

Membership of the N3C Consortium is provided in the Acknowledgments.

Bioengineering 2025, 12(5), 496; https://doi.org/10.3390/bioengineering12050496

Submission received: 9 April 2025 / Revised: 5 May 2025 / Accepted: 6 May 2025 / Published: 7 May 2025

(This article belongs to the Section Biosignal Processing)

Download

Browse Figures

Review Reports Versions Notes

Abstract

Since the outbreak of the COVID-19 pandemic in 2020, numerous studies have focused on the long-term effects of COVID infection. On 1 October 2021, the Centers for Disease Control (CDC) implemented a new code in the International Classification of Diseases, Tenth Revision, Clinical Modification (ICD-10-CM) for reporting ‘Post COVID-19 condition, unspecified (U09.9)’. This change indicated that the CDC recognized Long COVID as a real illness with associated chronic conditions. The National COVID Cohort Collaborative (N3C) provides researchers with abundant electronic health record (EHR) data by harmonizing EHR data across more than 80 different clinical organizations in the United States. This paper describes the creation of a COVID-positive N3C cohort balanced by the presence or absence of Long COVID (U09.9) and evaluates whether or not documented Long COVID (U09.9) is associated with decreased survival length.

Keywords:

Bayesian survival analysis; log-normal model; Markov chain Monte Carlo; long COVID; N3C

1. Introduction

The outbreak of the COVID-19 pandemic since 2020 has impacted global populations. It is now important to focus on the long-term effects of COVID-19. According to the Centers for Disease Control and Prevention (CDC), Long COVID is broadly defined as signs, symptoms, and conditions that continue or develop after acute COVID-19 infection [1]. Some conditions can last weeks, months, or years. The diagnosis code (U09.9) for Long COVID implemented by the CDC in October of 2021 [2] makes it much easier to identify patients who clinicians believe are suffering from this illness.

With the stewardship of the National Center for Advancing Translational Sciences (NCATS) and data contributions from more than 80 institutions, the National COVID Cohort Collaborative (N3C) [3] is one of the largest publicly accessible collections of clinical data related to COVID-19 patients in the United States, including more than 8 million COVID-positive patients and more than 30 billion rows of electronic health record (EHR) data for cohort studies. Since the implementation of the Long COVID diagnosis code (U09.9), N3C researchers have focused on risk factors and features of Long COVID using machine learning methods [4,5,6,7,8]. However, there is limited research and knowledge related to the mortality risk of Long COVID within the confirmed COVID-19 population. This paper focuses on the survival trends of Long COVID (U09.9) patients along with a matched cohort of COVID-positive control patients and analyzes factors influencing the survival lengths under a Bayesian framework for parameter estimation, and it contributes to understanding Long COVID (U09.9) by quantifying survival trends with probability methods. This study addresses a critical gap in the current Long COVID research by providing empirical estimates of the probability distribution of survival durations among patients diagnosed with Long COVID (U09.9) and without Long COVID in the N3C, and it offers a foundational statistical framework for understanding the characteristics of Long COVID, especially the mortality risk.

2. Related Work

We previously developed Bayesian approaches to our survival analysis and also build on prior work related to risk factors for both mortality and Long COVID.

Bayesian survival analysis is a statistical approach focusing on time-to-event data in various fields such as public health and epidemiology, incorporating prior information and uncertainty under Bayesian risk analysis framework [9]. Common methods in Bayesian survival analysis include parametric and semi-parametric models, proportional and non-proportional hazards models, frailty models, cure rate models, etc. Bayesian methods offer flexibility in choosing model structures and allow for uncertainty quantification conveniently, and Markov chain Monte Carlo (MCMC) algorithms provide great tools for sampling from the posterior and predictive distributions.

Various studies on the survival of COVID patients have been conducted since the pandemic. A study in Brazilian patients with COVID showed that old age and cardiovascular disease are associated with higher mortality [10], and a study in patients with COVID admitted to the intensive care unit showed heterogeneity in survival [11]. Another study on COVID and kidney diseases mentioned that acute kidney disease and chronic kidney disease are associated with a higher risk of death, and COVID might lead to chronic kidney disease in survivors [12]. The Charlson Comorbidity Index (CCI) [13] is a weighted sum of common chronic conditions (including kidney and heart disease) that are predictive of inhospital and post-discharge mortality. In our analysis, we use the CCI score as part of our matching criteria to account for the impact of chronic comorbidities on survival.

According to the CDC [1], Long COVID is a disease that can result in chronic conditions that include respiratory and heart symptoms, neurological symptoms, digestive symptoms, etc. For some people, these symptoms can last for weeks, months, or even years. A 2022 study applied a variety of modeling approaches to identify risk factors for Long COVID [4]. Their models consistently showed that middle age (40 to 59 years) and hospitalization at the time of COVID-19 are associated with a higher likelihood of Long COVID diagnosis. In our survival analysis of Long COVID patients and matched controls, these risk factors of Long COVID are used both in the matching process while creating the cohort (e.g., age) and as features in the model (visit severity at the time of COVID).

In contrast to studies on risk factors of contracting Long COVID [4], this paper applies a Bayesian survival analysis approach to individuals diagnosed with Long COVID and other COVID patients, and it identifies the features associated with mortality risks in the Long COVID population. In addition, this paper investigates the distribution of survival lengths of COVID-positive Long COVID patients and COVID-positive control patients; and thus, it provides more comprehensive insights into the survival patterns.

3. Methodology

3.1. Dataset

We used patient records from N3C that span 1 January 2018 to 14 November 2024 (v186 release). N3C sites’ data were included in the analysis if and only if the site contributed data in 2024. Because data contribution dates still varied between sites, we used matching to control for differences in observation periods between sites.

3.2. Cohort Identification and Feature Development

In this study, the cohort consists of a group of COVID patients diagnosed with Long COVID (U09.9) and a matched cohort of COVID patients without this diagnosis. Patients of both groups have either a COVID-19 diagnosis (U07.1), at least one positive result on a SARS-CoV-2 polymerase chain reaction (PCR) or antigen (AG) test, or both. The earliest date of either is used as their COVID index date. All features used in this analysis are derived from the source data using the N3C Logic Liaison Templates [14]. The two-step matching approach [15] described below creates a balanced cohort by assigning each COVID-positive Long COVID patient to a COVID-positive control. In addition to carefully constructing our matched cohort, we also took steps to minimize attrition and immortal time biases as described below.

3.2.1. COVID-Positive Long COVID Group Inclusion Criteria

COVID-19-positive patient as defined with a positive SARS-CoV-2 PCR/AG test or a recorded U07.1 positive diagnosis. The earliest date of either is their COVID index date.
Patient was older than or 18 years old at the time of their COVID index.
Patient had a U09.9 code in their available health records. The first date of U09.9 was after their COVID index and after 1 October 2021 (date U09.9 was published for use).

3.2.2. COVID-Positive Control Group Inclusion Criteria (No Documented Long COVID)

COVID-19-positive patient as defined with a positive SARS-CoV-2 PCR/AG test or a recorded U07.1 positive diagnosis. The earliest date of either is their COVID index date.
Patient was older than or equal to 18 years old at the time of their COVID index.
Patient had no U09.9 code in their available health records.
Patient was selected as a matched control based on the matching process described below. This ensured a balance of cases and controls with regard to age, COVID era, patient engagement, and healthcare system (site).

3.2.3. Matching Process

In cohort studies, the sample is often not representative of the population at large and, thus, can introduce selection bias [16]. We know this to be true in observational health data where individuals pursuing healthcare are by definition sicker than the population at large. Similarly, patients who more frequently engage with the healthcare institutions are by default more likely to have a death recorded in the electronic health record. In addition, immortal time bias [17] should be taken into consideration in survival analysis. Some of the research participants are ‘immortal’ during a certain time period in that they must survive long enough to receive the intervention being studied. To obtain treated and control groups with similar covariate distributions, choosing well-matched samples of the original treated and control groups is a common and powerful technique to reduce bias due to the covariates. Various matching methods [18] have been applied while using observational data, and below we describe how the two-step matching process generates a balanced cohort while reducing selection bias and immortal time bias.

The first step is a 1-to-5 matching process without replacement using matching variables including patients’ age, health system site id, COVID index date, number of clinical visits prior to COVID, and Charlson comorbidity index (CCI) score [13], resulting in 1 Long COVID patient being matched with 5 COVID positive controls, and satisfying the following conditions:

The case and control are from the same health system, minimizing selection bias related to health system population and care practices.
The age difference between the case and control is less than or equal to 10 years, minimizing mortality and Long COVID diagnosis biases related to age.
The difference in the COVID index date between the case and control is less than or equal to 45 days, minimizing the bias related to the different phases of the pandemic. This also helps ensure similar post-COVID potential observation periods in our survival analysis.
The difference in the log of the pre-index number of visits between the case and control must be less than or equal to 1. This ensures a similar level of engagement with the health system, which can both be a proxy for acuity as well as predictive of the capture of mortality events.
The difference in the log of CCI score must be less than or equal to 0.5 where the CCI score is based on comorbidities recorded before and up through the COVID index date. This helps balance pre-existing mortality risk between the cases and controls.

The second step is to select 1 final COVID positive control patient out of the 5 matched COVID-positive control patients for each Long COVID patient. In each matching group, we select the control candidate with a documented visit that is closest to the Long COVID patient’s diagnosis date. The visit difference must be no more than 30 days. Therefore, we obtain a 1-to-1 cohort in Long COVID patients and matched COVID-positive control patients with similar attributes and observed survival through Long COVID patient diagnosis.

3.2.4. Survival Timeline Considerations

Although the matching process helps mitigate bias in multiple ways related to patient selection, we also had to be very careful when selecting start and end dates for survival timelines.

According to the CDC, Long COVID starts being identified after several weeks of COVID infection [1]. In this analysis of Long COVID patients and COVID-positive control patients, if the COVID index date was chosen as day 0 of survival analysis, then the fact that Long COVID patients have to survive long enough to receive a U09.9 code would introduce immortal time bias. Thus, we use the U09.9 code date of the Long COVID case as day 0 of the survival timeline for both the cases and their matched control. We drop a pair of patients if anyone in this pair has a death date earlier than the U09.9 code date.

We also recognize that the capture of death data in any electronic health record data is incomplete. In order to enhance our capture of this outcome, we leveraged augmented death data that are made available in N3C through Privacy Preserving Record Linkage (PPRL) [19]. External death dates found in obituaries and government records are linked to N3C patient records using these privacy-preserving techniques. Thus, our N3C death data are a combination of death records as recorded by sites in the electronic health record and harmonized to OMOP [20] and PPRL death records [19]. When more than one death date is reported, the earliest reasonable date is used. Death dates must be within our study period, which ended 14 November 2024 (date of the data extraction used).

The outcome variable, survival length, is defined as the time difference between each patient’s day 0 of survival (U09.9 diagnosis date used for each matched pair) and the patient’s death date or the end of the study period.

3.3. Survival Analysis

Survival analyses mainly focus on time-to-event data. Following the notation in [9], we introduce several important variables and functions in survival analysis under Bayesian paradigm. Let T be a continuous non-negative random variable representing the survival times of individuals in a population, defined over the interval

[0, \infty)

. Let

f (t)

denote the probability density function (pdf) of T, and the cumulative distribution function (cdf) of T is

F (t) = P (T < t) = \int_{0}^{t} f (u) d u

(1)

And the survivor function to describe the probability of surviving till time t is

S (t) = 1 - F (t) = P (T > t) .

(2)

The hazard function

h (t)

, which is the instant rate of failure at time t, is defined as

h (t) = lim_{Δ t \to 0} \frac{P (t < T \leq t + Δ t | T > t)}{Δ t} = \frac{f (t)}{S (t)} .

(3)

Censoring is common in survival data, and it occurs when incomplete information is available about the survival time of some individuals [21]. An observation is said to be right-censored at c if the exact value of the observation is not known but only that it is greater than or equal to c; an observation is said to be left-censored at c if it is known only that the observation is less than or equal to c; an observation is said to be interval-censored if it is known only that the observation is in the interval

(c_{1}, c_{2})

. Type-I censoring and Type-II censoring [22,23] are commonly used in different parametric models as well as in survival analysis.

3.4. Bayesian Parametric Models

Survival analysis [9] offers a wide range of parametric distributions, such as exponential, Weibull, and log-normal distributions, each suited to modeling different hazard rate patterns and survival time characteristics. The log-normal distribution is widely used in survival analysis [24,25] because its hazard function is inherently non-monotonic, rising to a peak before declining. Unlike simpler models such as the exponential distribution, which assumes a constant hazard rate, or the Weibull distribution, which assumes a strictly monotonic (increasing or decreasing) hazard, the log-normal model accommodates more complex risk dynamics over time. This flexibility is particularly important given the limited existing research on the survival characteristics of patients with Long COVID (U09.9).

Under Bayesian paradigm [9], given unknown parameters, we first set up a prior distribution and combine the likelihood function with the prior distribution to obtain the posterior distribution of parameters. Markov chain Monte Carlo (MCMC) methods are widely used in sampling from a complicated distribution, such as Gibbs sampling, Metropolis–Hasting algorithm, and Hamiltonian Monte Carlo [26,27]. PyMC [28] is a Python module allowing users to implement Bayesian statistical models with different parameters, prior distributions, and likelihood functions, as well as calculate the numerical results of the posterior estimation of parameters. In this work, PyMC 5.3.0 is applied in a Python 3.8 environment.

Suppose we have independent and identically distributed (i.i.d.) data for survival time

y = {(y_{1}, . . ., y_{n})}^{T}

with a right-censor indicator

ν = {(ν_{1}, . . ., ν_{n})}^{T}

where

ν_{i} = 1

if

y_{i}

is an observed failure time, and

ν_{i} = 0

if

y_{i}

is right-censored. Let

D = (n, y, ν)

; we consider the following models.

Log-Normal Model

The log-normal model is a two-parameter model. The survival time

y_{i}

has a log-normal distribution defined on

(0, + \infty)

with density function, mean, and variance

\begin{matrix} f (y_{i} | μ, σ) & = {(2 π)}^{- \frac{1}{2}} {(y_{i} σ)}^{- 1} \exp (- \frac{1}{2 σ^{2}} {(\log y_{i} - μ)}^{2}) \\ E (y_{i}) & = \exp (μ + \frac{σ^{2}}{2}) \\ Var (y_{i}) & = [\exp (σ^{2}) - 1] \exp (2 μ + σ^{2}) \end{matrix}

(4)

and survival function

S (y_{i} | μ, λ) = 1 - Φ (\frac{\log y_{i} - μ}{σ})

(5)

with

Φ (.)

representing the cdf of the standard normal distribution. The likelihood is

\begin{matrix} L (μ, σ | D) & = \prod_{i = 1}^{n} f {(y_{i} | μ, σ)}^{ν_{i}} S {(y_{i} | μ, σ)}^{1 - ν_{i}} \\ = {(2 π σ^{2})}^{- \frac{d}{2}} \exp (- \frac{1}{2 σ^{2}} \sum_{i = 1}^{n} ν_{i} {(\log y_{i} - μ)}^{2}) \\ \times \prod_{i = 1}^{n} y_{i}^{- ν_{i}} {(1 - Φ (\frac{\log y_{i} - μ}{σ}))}^{1 - ν_{i}} \end{matrix}

(6)

Let

τ = 1 / σ^{2}

, and a common prior distribution

p (μ, τ)

assumes a normal distribution on

μ

and a gamma distribution on

τ

. The posterior distribution is given by

p (μ, τ | D) \propto p (μ, τ) L (μ, τ | D) .

(7)

To build a regression model, we introduce covariates through

μ

and write

μ_{i} = x_{i}^{T} β

. Common prior distributions of

β

include a uniform improper prior and normal prior.

4. Results

4.1. Cohort Summary

Using v186 release tables (released on 14 November 2024) in N3C and sites with data contribution in 2024, there are 7,223,164 confirmed positive COVID patients, and among them, there are only 74,873 patients with the U09.9 diagnosis code. Cohort characteristics before the matching process and after the matching process are both included below. Table 1 and Table 2 are the attrition tables before the matching process, and Table 3 and Table 4 show characteristics of patients before the matching process and the final cohort.

After the matching process, the final cohort has 94,874 patients in total, with 47,437 Long COVID (U09.9) patients and 47,437 COVID-positive control patients.

4.2. Modeling

Following the notations in the methodology section,

y_{i}

is the survival time of a patient in the cohort, with the distribution

\begin{matrix} y_{i} & \sim LogNormal (μ, σ^{2}) \\ τ & = \frac{1}{σ^{2}} \end{matrix}

(8)

where

μ = β_{0} + β_{1} x_{i 1} + β_{2} x_{i 2} + β_{3} x_{i 3}

(9)

and

x_{i 1}

,

x_{i 2}

,

x_{i 3}

represent three features in the model: whether the patient has Long COVID (U09.9), whether the patient has obesity (BMI greater than or equal to 30 [29]), and whether the patient has mild severity at the time of COVID (mild COVID is defined as no emergency department visits nor hospitalization around COVID index date). The prior distributions of the parameters are as follows:

\begin{matrix} β_{0} & \sim N (0, 2) \\ β_{1} & \sim N (0, 2) \\ β_{2} & \sim N (0, 2) \\ β_{3} & \sim N (0, 2) \\ τ & \sim Gamma (1, 0.5) \end{matrix}

(10)

4.3. Parameter Estimation

Given prior distributions and likelihood specified by the modeling, the posterior of parameters was numerically estimated using Markov Chain Monte Carlo (MCMC) in PyMC [28]. Table 5 shows the posterior mean, standard deviation, and 95% high-density interval of parameters, and Figure 1 shows the density plots and trace plots of parameters in the MCMC sampling process. In the trace plots, all chains follow a similar pattern and occupy a similar region of the parameter space, and the lines look like fuzzy noise with no long flat areas or trends. The density plots are smooth and unimodal with a clear shape of the posterior, and each chain contributes similarly to the overall distribution. As a result, Figure 1 shows that these samples present adequate mixing in the posterior estimation.

5. Discussion

According to the posterior estimation of parameters, patients with the Long COVID (U09.9) indicator are more likely to have shorter survival lengths at a 95% significance level, patients with mild visit severity at the time of COVID are more likely to have longer survival lengths at a 95% significance level, and the obesity indicator is not significant at a 95% significance level concerning the survival length.

To diagnose MCMC convergence [30], several statistical and visual tools are commonly used. The Gelman–Rubin diagnostic compares variance within and between chains, with values close to 1.0 indicating convergence. The effective sample size (ESS) assesses the number of effectively independent draws, with a higher ESS suggesting better mixing. Other available tools include spectral density-based methods, the Raftery–Lewis diagnostic, kernel density-based methods, and autocorrelation plots. Figure 1 shows visual evidence for adequate mixing in the posterior estimation, and more quantitative MCMC diagnostic tools could be explored in PyMC.

The Kaplan–Meier estimator is a non-parametric method to estimate the survival probability from lifetime data [31]. Figure 2 shows the Kaplan–Meier curves of two groups in the cohort, and according to the curves, the survival probability of patients in both groups decreases as time goes by, and the Long COVID group has an even lower survival probability than the COVID-positive control group from day 0 of survival till the maximum observed survival length. The significance level of difference between the two groups in survival curves could be further examined using the log-rank test [32].

Since the CDC issued the U09.9 code in October 2021, the maximum possible survival length of a Long COVID patient is less than 1000 days. According to Figure 3, the distribution of survival length is left-skewed. This paper uses a parametric method [9], assuming the observations of survival length are from a log-normal distribution. Other distributions should be taken into consideration to accommodate the left skewness and the non-monotonicity of the hazard function.

In a parametric model, the cumulative distribution function of the survival length is assumed to be differentiable, but this assumption might not hold. Semi-parametric models mainly focus on the baseline hazard or cumulative hazard, and a common example is the piecewise constant hazard model, where the hazard rate might differ in time periods and different subgroups [33,34].

As mentioned in the cohort identification section, the purpose of the matching process and the choice of day 0 of each matched pair in the survival analysis is to create a balanced cohort and control the selection bias and the immortal bias [16,17]. In this analysis, the matching process guarantees that each matched pair of the Long COVID patient and the COVID-positive control patient has similar health risks represented by the CCI score, age, etc. However, we acknowledge other limitations related to the nature of observational health data. For example, because health records are incomplete and some data are not captured in our structured data, some comorbidities are not reflected in our CCI scores. More importantly, the rates of the U09.9 code in the pre-matched sample confirm another work [5] that suggests that Long COVID is underrepresented by this diagnosis code. Computable phenotypes of Long COVID provide another means of identifying Long COVID, making it possible to create a cohort of much more Long COVID patients than U09.9 patients. We elected to avoid using an N3C computable phenotype [35,36] due to the survival bias it would introduce, related primarily to the fact that the post-COVID visit frequency is an important feature in their model.

In a survival analysis, a competing risk, i.e., an event whose occurrence precludes the occurrence of the primary event of interest, could lead to inaccurate results. In studies on cardiovascular disease and depression, respectively [37,38], failure to account correctly for competing events can result in unexpected consequences, including an overestimation of the probability of the event and a mis-estimation of the magnitude of relative effects of covariates on the outcome. In this paper, the reasons for death in Long COVID patients and COVID-positive control patients are largely unavailable, and this remains a limitation in observational survival analysis.

6. Conclusions

This survival analysis established a balanced cohort of COVID-positive Long COVID patients and matched COVID-positive control patients with similar comorbidity risks and explored the distribution of the survival length and possible factors influencing the survival length. Our analyses revealed that the Long COVID (U09.9) diagnosis is associated with a shorter survival length at a 95% significance level; mild visit severity at the time of COVID is associated with a longer survival length at a 95% significance level. The obesity indicator is not significant, suggesting that the comorbidity burden was effectively balanced between cases and controls. These results provide a clearer picture of the survival trends of Long COVID (U09.9) patients and control patients, and they help with the further understanding of the clinical implications of Long COVID.

New methods to control selection and immortal bias in survival analyses, new Long COVID computable phenotypes, and other Bayesian parametric and semi-parametric methods in survival analyses can provide opportunities for further investigation.

Author Contributions

Conceptualization, D.B.; Data curation, N3C Consortium, S.J. and A.Z.; Funding acquisition, J.L. (Johanna Loomba) and D.B.; Methodology, S.J., J.L. (Johanna Loomba), S.S. (Suchetha Sharma), S.S. (Saurav Sengupta), and J.L. (Jiebei Liu); Project administration, J.L. (Johanna Loomba) and D.B.; Resources, A.Z. and S.S. (Suchetha Sharma); Software, S.J.; Supervision, J.L. (Johanna Loomba) and D.B.; Validation, S.S. (Saurav Sengupta) and J.L. (Jiebei Liu); Writing—original draft, S.J.; Writing—review and editing, J.L. (Johanna Loomba) and D.B. All authors have read and agreed to the published version of the manuscript.

Funding

The analyses described in this publication were conducted with data or tools accessed through the NCATS N3C Data Enclave https://covid.cd2h.org (accessed on 9 April 2025) and N3C Attribution & Publication Policy v 1.2-2020-08-25b supported by NCATS Contract No. 75N95023D00001, Axle Informatics Subcontract: NCATS-P00438-B, and iTHRIV Integrated Translational health Research Institute of Virginia UL1TR003015. This research was possible because of the patients whose information is included within the data and the organizations (https://ncats.nih.gov/n3c/resources/data-contribution/data-transfer-agreement-signatories (accessed on 9 April 2025)) and scientists who have contributed to the on-going development of this community resource [3].

Institutional Review Board Statement

The N3C Publication committee confirmed that this manuscript MSID:1996.791 is in accordance with N3C data use and attribution policies; however, this content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health or the N3C program.

Informed Consent Statement

The N3C data transfer to NCATS is performed under a Johns Hopkins University Reliance Protocol # IRB00249128 or individual site agreements with NIH. The N3C Data Enclave is managed under the authority of the NIH; information can be found at https://ncats.nih.gov/n3c/resources (accessed on 9 April 2025).

Data Availability Statement

Patient data are stored in N3C enclave, DUR-7937888. https://covid.cd2h.org/enclave/ (accessed on 9 April 2025).

Acknowledgments

We gratefully acknowledge the following core contributors to N3C: Adam B. Wilcox, Adam M. Lee, Alexis Graves, Alfred (Jerrod) Anzalone, Amin Manna, Amit Saha, Amy Olex, Andrea Zhou, Andrew E. Williams, Andrew Southerland, Andrew T. Girvin, Anita Walden, Anjali A. Sharathkumar, Benjamin Amor, Benjamin Bates, Brian Hendricks, Brijesh Patel, Caleb Alexander, Carolyn Bramante, Cavin Ward-Caviness, Charisse Madlock-Brown, Christine Suver, Christopher Chute, Christopher Dillon, Chunlei Wu, Clare Schmitt, Cliff Takemoto, Dan Housman, Davera Gabriel, David A. Eichmann, Diego Mazzotti, Don Brown, Eilis Boudreau, Elaine Hill, Emily Carlson Marti, Emily R. Pfaff, Evan French, Farrukh M Koraishy, Federico Mariona, Fred Prior, George Sokos, Greg Martin, Harold Lehmann, Heidi Spratt, Hemalkumar Mehta, J.W. Awori Hayanga, Jami Pincavitch, Jaylyn Clark, Jeremy Richard Harper, Jessica Islam, Jin Ge, Joel Gagnier, Johanna Loomba, John Buse, Jomol Mathew, Joni L. Rutter, Julie A. McMurry, Justin Guinney, Justin Starren, Karen Crowley, Katie Rebecca Bradwell, Kellie M. Walters, Ken Wilkins, Kenneth R. Gersing, Kenrick Dwain Cato, Kimberly Murray, Kristin Kostka, Lavance Northington, Lee Allan Pyles, Lesley Cottrell, Lili Portilla, Mariam Deacy, Mark M. Bissell, Marshall Clark, Mary Emmett, Matvey B. Palchuk, Melissa A. Haendel, Meredith Adams, Meredith Temple-O’Connor, Michael G. Kurilla, Michele Morris, Nasia Safdar, Nicole Garbarini, Noha Sharafeldin, Ofer Sadan, Patricia A. Francis, Penny Wung Burgoon, Philip R.O. Payne, Randeep Jawa, Rebecca Erwin-Cohen, Rena Patel, Richard A. Moffitt, Richard L. Zhu, Rishi Kamaleswaran, Robert Hurley, Robert T. Miller, Saiju Pyarajan, Sam G. Michael, Samuel Bozzette, Sandeep Mallipattu, Satyanarayana Vedula, Scott Chapman, Shawn T. O’Neil, Soko Setoguchi, Stephanie S. Hong, Steve Johnson, Tellen D. Bennett, Tiffany Callahan, Umit Topaloglu, Valery Gordon, Vignesh Subbian, Warren A. Kibbe, Wenndy Hernandez, Will Beasley, Will Cooper, William Hillegass, Xiaohan Tanner Zhang. Details of contributions are available at https://covid.cd2h.org/contributors/ (accessed on 9 April 2025). The following are institutions whose data have been released or are pending: Available: Advocate Health Care Network—UL1TR002389: The Institute for Translational Medicine (ITM) · Aurora Health Care Inc—UL1TR002373: Wisconsin Network For Health Research · Boston University Medical Campus—UL1TR001430: Boston University Clinical and Translational Science Institute · Brown University—U54GM115677: Advance Clinical Translational Research (Advance-CTR) · Carilion Clinic—UL1TR003015: iTHRIV Integrated Translational health Research Institute of Virginia · Case Western Reserve University—UL1TR002548: The Clinical & Translational Science Collaborative of Cleveland (CTSC) · Charleston Area Medical Center—U54GM104942: West Virginia Clinical and Translational Science Institute (WVCTSI) · Children’s Hospital Colorado—UL1TR002535: Colorado Clinical and Translational Sciences Institute · Columbia University Irving Medical Center—UL1TR001873: Irving Institute for Clinical and Translational Research · Dartmouth College—None (Voluntary) Duke University—UL1TR002553: Duke Clinical and Translational Science Institute · George Washington Children’s Research Institute—UL1TR001876: Clinical and Translational Science Institute at Children’s National (CTSA-CN) · George Washington University—UL1TR001876: Clinical and Translational Science Institute at Children’s National (CTSA-CN) · Harvard Medical School—UL1TR002541: Harvard Catalyst · Indiana University School of Medicine—UL1TR002529: Indiana Clinical and Translational Science Institute · Johns Hopkins University—UL1TR003098: Johns Hopkins Institute for Clinical and Translational Research · Louisiana Public Health Institute—None (Voluntary) · Loyola Medicine—Loyola University Medical Center · Loyola University Medical Center—UL1TR002389: The Institute for Translational Medicine (ITM) · Maine Medical Center—U54GM115516: Northern New England Clinical & Translational Research (NNE-CTR) Network · Mary Hitchcock Memorial Hospital & Dartmouth Hitchcock Clinic—None (Voluntary) · Massachusetts General Brigham—UL1TR002541: Harvard Catalyst · Mayo Clinic Rochester—UL1TR002377: Mayo Clinic Center for Clinical and Translational Science (CCaTS) · Medical University of South Carolina—UL1TR001450: South Carolina Clinical & Translational Research Institute (SCTR) · MITRE Corporation—None (Voluntary) · Montefiore Medical Center—UL1TR002556: Institute for Clinical and Translational Research at Einstein and Montefiore · Nemours—U54GM104941: Delaware CTR ACCEL Program · NorthShore University HealthSystem—UL1TR002389: The Institute for Translational Medicine (ITM) · Northwestern University at Chicago—UL1TR001422: Northwestern University Clinical and Translational Science Institute (NUCATS) · OCHIN—INV-018455: Bill and Melinda Gates Foundation grant to Sage Bionetworks · Oregon Health & Science University—UL1TR002369: Oregon Clinical and Translational Research Institute · Penn State Health Milton S. Hershey Medical Center—UL1TR002014: Penn State Clinical and Translational Science Institute · Rush University Medical Center—UL1TR002389: The Institute for Translational Medicine (ITM) · Rutgers, The State University of New Jersey—UL1TR003017: New Jersey Alliance for Clinical and Translational Science · Stony Brook University—U24TR002306 · The Alliance at the University of Puerto Rico, Medical Sciences Campus—U54GM133807: Hispanic Alliance for Clinical and Translational Research (The Alliance) · The Ohio State University—UL1TR002733: Center for Clinical and Translational Science · The State University of New York at Buffalo—UL1TR001412: Clinical and Translational Science Institute · The University of Chicago—UL1TR002389: The Institute for Translational Medicine (ITM) · The University of Iowa—UL1TR002537: Institute for Clinical and Translational Science · The University of Miami Leonard M. Miller School of Medicine—UL1TR002736: University of Miami Clinical and Translational Science Institute · The University of Michigan at Ann Arbor—UL1TR002240: Michigan Institute for Clinical and Health Research · The University of Texas Health Science Center at Houston—UL1TR003167: Center for Clinical and Translational Sciences (CCTS) · The University of Texas Medical Branch at Galveston—UL1TR001439: The Institute for Translational Sciences · The University of Utah—UL1TR002538: Uhealth Center for Clinical and Translational Science · Tufts Medical Center—UL1TR002544: Tufts Clinical and Translational Science Institute · Tulane University—UL1TR003096: Center for Clinical and Translational Science · The Queens Medical Center—None (Voluntary) · University Medical Center New Orleans—U54GM104940: Louisiana Clinical and Translational Science (LA CaTS) Center · University of Alabama at Birmingham—UL1TR003096: Center for Clinical and Translational Science · University of Arkansas for Medical Sciences—UL1TR003107: UAMS Translational Research Institute · University of Cincinnati—UL1TR001425: Center for Clinical and Translational Science and Training · University of Colorado Denver, Anschutz Medical Campus—UL1TR002535: Colorado Clinical and Translational Sciences Institute · University of Illinois at Chicago—UL1TR002003: UIC Center for Clinical and Translational Science · University of Kansas Medical Center—UL1TR002366: Frontiers: University of Kansas Clinical and Translational Science Institute · University of Kentucky—UL1TR001998: UK Center for Clinical and Translational Science · University of Massachusetts Medical School Worcester—UL1TR001453: The UMass Center for Clinical and Translational Science (UMCCTS) · University Medical Center of Southern Nevada—None (voluntary) · University of Minnesota—UL1TR002494: Clinical and Translational Science Institute · University of Mississippi Medical Center—U54GM115428: Mississippi Center for Clinical and Translational Research (CCTR) · University of Nebraska Medical Center—U54GM115458: Great Plains IDeA-Clinical & Translational Research · University of North Carolina at Chapel Hill—UL1TR002489: North Carolina Translational and Clinical Science Institute · University of Oklahoma Health Sciences Center—U54GM104938: Oklahoma Clinical and Translational Science Institute (OCTSI) · University of Pittsburgh—UL1TR001857: The Clinical and Translational Science Institute (CTSI) · University of Pennsylvania—UL1TR001878: Institute for Translational Medicine and Therapeutics · University of Rochester—UL1TR002001: UR Clinical & Translational Science Institute · University of Southern California—UL1TR001855: The Southern California Clinical and Translational Science Institute (SC CTSI) · University of Vermont—U54GM115516: Northern New England Clinical & Translational Research (NNE-CTR) Network · University of Virginia—UL1TR003015: iTHRIV Integrated Translational health Research Institute of Virginia · University of Washington—UL1TR002319: Institute of Translational Health Sciences · University of Wisconsin-Madison—UL1TR002373: UW Institute for Clinical and Translational Research · Vanderbilt University Medical Center—UL1TR002243: Vanderbilt Institute for Clinical and Translational Research · Virginia Commonwealth University—UL1TR002649: C. Kenneth and Dianne Wright Center for Clinical and Translational Research · Wake Forest University Health Sciences—UL1TR001420: Wake Forest Clinical and Translational Science Institute · Washington University in St. Louis—UL1TR002345: Institute of Clinical and Translational Sciences · Weill Medical College of Cornell University—UL1TR002384: Weill Cornell Medicine Clinical and Translational Science Center · West Virginia University—U54GM104942: West Virginia Clinical and Translational Science Institute (WVCTSI)· Submitted: Icahn School of Medicine at Mount Sinai—UL1TR001433: ConduITS Institute for Translational Sciences · The University of Texas Health Science Center at Tyler—UL1TR003167: Center for Clinical and Translational Sciences (CCTS) · University of California, Davis—UL1TR001860: UCDavis Health Clinical and Translational Science Center · University of California, Irvine—UL1TR001414: The UC Irvine Institute for Clinical and Translational Science (ICTS) · University of California, Los Angeles—UL1TR001881: UCLA Clinical Translational Science Institute · University of California, San Diego—UL1TR001442: Altman Clinical and Translational Research Institute · University of California, San Francisco—UL1TR001872: UCSF Clinical and Translational Science Institute· NYU Langone Health Clinical Science Core, Data Resource Core, and PASC Biorepository Core—OTA-21-015A: Post-Acute Sequelae of SARS-CoV-2 Infection Initiative (RECOVER)· Pending: Arkansas Children’s Hospital—UL1TR003107: UAMS Translational Research Institute · Baylor College of Medicine—None (Voluntary) · Children’s Hospital of Philadelphia—UL1TR001878: Institute for Translational Medicine and Therapeutics · Cincinnati Children’s Hospital Medical Center—UL1TR001425: Center for Clinical and Translational Science and Training · Emory University—UL1TR002378: Georgia Clinical and Translational Science Alliance · HonorHealth—None (Voluntary) · Loyola University Chicago—UL1TR002389: The Institute for Translational Medicine (ITM) · Medical College of Wisconsin—UL1TR001436: Clinical and Translational Science Institute of Southeast Wisconsin · MedStar Health Research Institute—None (Voluntary) · Georgetown University—UL1TR001409: The Georgetown-Howard Universities Center for Clinical and Translational Science (GHUCCTS) · MetroHealth—None (Voluntary) · Montana State University—U54GM115371: American Indian/Alaska Native CTR · NYU Langone Medical Center—UL1TR001445: Langone Health’s Clinical and Translational Science Institute · Ochsner Medical Center—U54GM104940: Louisiana Clinical and Translational Science (LA CaTS) Center · Regenstrief Institute—UL1TR002529: Indiana Clinical and Translational Science Institute · Sanford Research—None (Voluntary) · Stanford University—UL1TR003142: Spectrum: The Stanford Center for Clinical and Translational Research and Education · The Rockefeller University—UL1TR001866: Center for Clinical and Translational Science · The Scripps Research Institute—UL1TR002550: Scripps Research Translational Institute · University of Florida—UL1TR001427: UF Clinical and Translational Science Institute · University of New Mexico Health Sciences Center—UL1TR001449: University of New Mexico Clinical and Translational Science Center · University of Texas Health Science Center at San Antonio—UL1TR002645: Institute for Integration of Medicine and Science · Yale New Haven Hospital—UL1TR001863: Yale Center for Clinical Investigation.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Long COVID or Post-COVID Conditions. 2021. Available online: https://www.cdc.gov/covid/long-term-effects/ (accessed on 9 April 2025).
New ICD-10-CM Code for Post-COVID Conditions, Following the 2019 Novel Coronavirus (COVID-19). 2021. Available online: https://www.cdc.gov/nchs/data/icd/announcement-new-icd-code-for-post-covid-condition-april-2022-final.pdf (accessed on 9 April 2025).
Haendel, M.A.; Chute, C.G.; Bennett, T.D.; Eichmann, D.A.; Guinney, J.; Kibbe, W.A.; Payne, P.R.; Pfaff, E.R.; Robinson, P.N.; Saltz, J.H.; et al. The National COVID Cohort Collaborative (N3C): Rationale, design, infrastructure, and deployment. J. Am. Med. Inform. Assoc. 2021, 28, 427–443. [Google Scholar] [CrossRef] [PubMed]
Hill, E.L.; Mehta, H.B.; Sharma, S.; Mane, K.; Xie, C.; Cathey, E.; Loomba, J.; Russell, S.; Spratt, H.; DeWitt, P.E.; et al. Risk factors associated with post-acute sequelae of sars-cov-2 in an ehr cohort: A national covid cohort collaborative (n3c) analysis as part of the nih recover program. medRxiv 2022. [Google Scholar] [CrossRef]
Pfaff, E.R.; Madlock-Brown, C.; Baratta, J.M.; Bhatia, A.; Davis, H.; Girvin, A.; Hill, E.; Kelly, E.; Kostka, K.; Loomba, J.; et al. Coding long COVID: Characterizing a new disease through an ICD-10 lens. BMC Med. 2023, 21, 58. [Google Scholar] [CrossRef]
Reese, J.T.; Blau, H.; Casiraghi, E.; Bergquist, T.; Loomba, J.J.; Callahan, T.J.; Laraway, B.; Antonescu, C.; Coleman, B.; Gargano, M.; et al. Generalisable long COVID subtypes: Findings from the NIH N3C and RECOVER programmes. EBioMedicine 2023, 87, 104413. [Google Scholar] [CrossRef]
Jiang, S.; Loomba, J.; Sharma, S.; Brown, D. Vital measurements of hospitalized COVID-19 patients as a predictor of long COVID: An EHR-based cohort study from the RECOVER program in N3C. In Proceedings of the 2022 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Las Vegas, NV, USA, 6–8 December 2022; pp. 3023–3030. [Google Scholar]
Antony, B.; Blau, H.; Casiraghi, E.; Loomba, J.J.; Callahan, T.J.; Laraway, B.J.; Wilkins, K.J.; Antonescu, C.C.; Valentini, G.; Williams, A.E.; et al. Predictive models of long COVID. EBioMedicine 2023, 96, 104777. [Google Scholar] [CrossRef]
Ibrahim, J.G.; Chen, M.H.; Sinha, D.; Ibrahim, J.; Chen, M. Bayesian Survival Analysis; Springer: Berlin/Heidelberg, Germany, 2001; Volume 2. [Google Scholar]
Sousa, G.; Garces, T.; Cestari, V.; Florêncio, R.; Moreira, T.; Pereira, M. Mortality and survival of COVID-19. Epidemiol. Infect. 2020, 148, e123. [Google Scholar] [CrossRef] [PubMed]
Neville, T.H.; Hays, R.D.; Tseng, C.H.; Gonzalez, C.A.; Chen, L.; Hong, A.; Yamamoto, M.; Santoso, L.; Kung, A.; Schwab, K.; et al. Survival after severe COVID-19: Long-term outcomes of patients admitted to an intensive care unit. J. Intensive Care Med. 2022, 37, 1019–1028. [Google Scholar] [CrossRef]
Long, J.D.; Strohbehn, I.; Sawtell, R.; Bhattacharyya, R.; Sise, M.E. COVID-19 Survival and its impact on chronic kidney disease. Transl. Res. 2022, 241, 70–82. [Google Scholar] [CrossRef]
Quan, H.; Li, B.; Couris, C.M.; Fushimi, K.; Graham, P.; Hider, P.; Januel, J.M.; Sundararajan, V. Updating and validating the Charlson comorbidity index and score for risk adjustment in hospital discharge abstracts using data from 6 countries. Am. J. Epidemiol. 2011, 173, 676–682. [Google Scholar] [CrossRef]
Zhou, A.; French, E.; Moffitt, R.; Loomba, J. N3C Logic Liaison Templates: Transforming Complex Health Data Into Analytics Ready Tables. 2023. Available online: https://doi.org/10.18130/rv3j-ke83 (accessed on 9 April 2025).
Stuart, E.A.; King, G.; Imai, K.; Ho, D. MatchIt: Nonparametric preprocessing for parametric causal inference. J. Stat. Softw. 2011, 42, 1–28. [Google Scholar]
Jager, K.J.; Tripepi, G.; Chesnaye, N.C.; Dekker, F.W.; Zoccali, C.; Stel, V.S. Where to look for the most frequent biases? Nephrology 2020, 25, 435–441. [Google Scholar] [CrossRef] [PubMed]
Yadav, K.; Lewis, R.J. Immortal time bias in observational studies. JAMA 2021, 325, 686–687. [Google Scholar] [CrossRef] [PubMed]
Stuart, E.A. Matching methods for causal inference: A review and a look forward. Stat. Sci. Rev. J. Inst. Math. Stat. 2010, 25, 1. [Google Scholar] [CrossRef] [PubMed]
N3C Privacy-Preserving Record Linkage. Available online: https://covid.cd2h.org/PPRL/ (accessed on 9 April 2025).
OMOP Common Data Model. Available online: https://ohdsi.github.io/CommonDataModel/ (accessed on 9 April 2025).
Leung, K.M.; Elashoff, R.M.; Afifi, A.A. Censoring issues in survival analysis. Annu. Rev. Public Health 1997, 18, 83–104. [Google Scholar] [CrossRef]
Joarder, A.; Krishna, H.; Kundu, D. Inferences on Weibull parameters with conventional type-I censoring. Comput. Stat. Data Anal. 2011, 55, 1–11. [Google Scholar] [CrossRef]
Dutta, S.; Dey, S.; Kayal, S. Bayesian survival analysis of logistic exponential distribution for adaptive progressive Type-II censored data. Comput. Stat. 2023, 39, 2109–2155. [Google Scholar] [CrossRef]
Royston, P. The lognormal distribution as a model for survival time in cancer, with an emphasis on prognostic factors. Stat. Neerl. 2001, 55, 89–104. [Google Scholar] [CrossRef]
Gupta, R.C.; Kannan, N.; Raychaudhuri, A. Analysis of lognormal survival data. Math. Biosci. 1997, 139, 103–115. [Google Scholar] [CrossRef]
Betancourt, M. A conceptual introduction to Hamiltonian Monte Carlo. arXiv 2017, arXiv:1701.02434. [Google Scholar]
Hoffman, M.D.; Gelman, A. The No-U-Turn sampler: Adaptively setting path lengths in Hamiltonian Monte Carlo. J. Mach. Learn. Res. 2014, 15, 1593–1623. [Google Scholar]
Patil, A.; Huard, D.; Fonnesbeck, C.J. PyMC: Bayesian stochastic modelling in Python. J. Stat. Softw. 2010, 35, 1. [Google Scholar] [CrossRef]
Adult BMI. Available online: https://www.cdc.gov/bmi/faq/index.html (accessed on 9 April 2025).
Roy, V. Convergence diagnostics for markov chain monte carlo. Annu. Rev. Stat. Its Appl. 2020, 7, 387–412. [Google Scholar] [CrossRef]
Rich, J.T.; Neely, J.G.; Paniello, R.C.; Voelker, C.C.; Nussenbaum, B.; Wang, E.W. A practical guide to understanding Kaplan-Meier curves. Otolaryngol.—Head Neck Surg. 2010, 143, 331–336. [Google Scholar] [CrossRef] [PubMed]
Bland, J.M.; Altman, D.G. The logrank test. BMJ 2004, 328, 1073. [Google Scholar] [CrossRef] [PubMed]
Bouman, P.; Dukic, V.; Meng, X.L. A Bayesian multiresolution hazard model with application to an AIDS reporting delay study. Stat. Sin. 2005, 15, 325–357. [Google Scholar]
Dukić, V.; Dignam, J. Bayesian hierarchical multiresolution hazard model for the study of time-dependent failure patterns in early stage breast cancer. Bayesian Anal. 2007, 2, 591. [Google Scholar] [PubMed]
Pfaff, E.R.; Girvin, A.T.; Bennett, T.D.; Bhatia, A.; Brooks, I.M.; Deer, R.R.; Dekermanjian, J.P.; Jolley, S.E.; Kahn, M.G.; Kostka, K.; et al. Identifying who has long COVID in the USA: A machine learning approach using N3C data. Lancet Digit. Health 2022, 4, e532–e541. [Google Scholar] [CrossRef]
Crosskey, M.; McIntee, T.; Preiss, A.J.; Brannock, M.D.; Yoo, Y.J.; Hadley, E.C.; Blanceró, F.; Chew, R.; Loomba, J.; Bhatia, A.; et al. Reengineering a machine learning phenotype to adapt to the changing COVID-19 landscape: A study from the N3C and RECOVER consortia. medRxiv 2023. [Google Scholar] [CrossRef]
Austin, P.C.; Lee, D.S.; Fine, J.P. Introduction to the analysis of survival data in the presence of competing risks. Circulation 2016, 133, 601–609. [Google Scholar] [CrossRef]
Schuster, N.A.; Hoogendijk, E.O.; Kok, A.A.; Twisk, J.W.; Heymans, M.W. Ignoring competing events in the analysis of survival data may lead to biased results: A nonmathematical illustration of competing risk analysis. J. Clin. Epidemiol. 2020, 122, 42–48. [Google Scholar] [CrossRef]

Figure 1. Posterior density plots and trace plots of parameters by MCMC.

Figure 2. Kaplan–Meier curves of two groups in the cohort.

Figure 3. Survival length of the cohort.

Table 1. Long COVID group attrition table before matching.

Inclusion Criteria	Number of Patients
COVID-positive patients	7,223,164
patients with U09.9 code	74,873
U09.9 date later than COVID index date and 1 October 2021	72,881
age greater than or equal to 18	70,569

Table 2. COVID-positive control group attrition table before matching.

Inclusion Criteria	Number of Patients
COVID-positive patients	7,223,164
age greater than or equal to 18	6,074,612
patients with no U09.9 code	5,998,558
patients from clinical sites reporting U09.9	5,310,527

Table 3. Summary table before matching.

	COVID-Positive Control Group	Long COVID Group
number of patients	5,310,527	70,569
age (mean (SD))	48.22 (18.96)	54.87 (16.52)
sex (%)
Female	3,016,285 (56.8)	46,238 (65.5)
Male	2,288,951 (43.1)	24,297 (34.4)
race and ethnicity (%)
Asian Non-Hispanic	146,001 (2.7)	1917 (2.7)
Black or African American Non-Hispanic	617,374 (11.6)	7798 (11.1)
White Non-Hispanic	3,409,546 (64.2)	49,671 (70.4)
Other Non-Hispanic	200,365 (3.8)	1390 (2.0)
Hispanic or Latino Any Race	638,650 (12.0)	7061 (10.0)
Missing/Unknown	298,591 (5.6)	2732 (3.9)
CCI category (%)
0	3,517,745 (66.2)	29,447 (41.7)
1–2	1,040,193 (19.6)	22,266 (31.6)
3–4	372,760 (7.0)	9745 (13.8)
4+	311,516 (5.9)	8849 (12.5)
age category (%)
18–20	258,611 (4.9)	823 (1.2)
21–45	2,265,771 (42.7)	20,442 (29.0)
46–65	1,679,984 (31.6)	29,612 (42.0)
66+	1,106,161 (20.8)	19,692 (27.9)

Table 4. Final cohort summary table.

	COVID-Positive Control Group	Long COVID Group
number of patients	47,437	47,437
age (mean (SD))	56.35 (16.22)	56.24 (16.40)
sex (%)
Female	29,079 (61.3)	31,340 (66.1)
Male	18,344 (38.7)	16,078 (33.9)
race and ethnicity (%)
Asian Non-Hispanic	1472 (3.1)	1256 (2.6)
Black or African American Non-Hispanic	5671 (12.0)	5254 (11.1)
White Non-Hispanic	32,995 (69.6)	33,928 (71.5)
Other Non-Hispanic	887 (1.9)	904 (1.9)
Hispanic or Latino Any Race	4653 (9.8)	4396 (9.3)
Missing/Unknown	1759 (3.7)	1699 (3.6)
CCI category (%)
0	18,222 (38.4)	17,006 (35.8)
1–2	14,035 (29.6)	15,447 (32.6)
3–4	8103 (17.1)	7833 (16.5)
4+	7004 (14.8)	7052 (14.9)
age category (%)
18–20	380 (0.8)	434 (0.9)
21–45	12,134 (25.6)	12,371 (26.1)
46–65	20,006 (42.2)	20,039 (42.2)
66+	14,917 (31.4)	14,593 (30.8)

Table 5. Posterior estimation of parameters.

Parameter	Mean	Standard Deviation	95% High-Density Interval
$β_{0}$ (intercept)	6.449	0.006	[6.439, 6.459]
$β_{1}$ (U09.9)	−0.023	0.004	[−0.031, −0.014]
$β_{2}$ (obesity)	0.003	0.005	[−0.006, 0.011]
$β_{3}$ (mild visit severity)	0.164	0.005	[0.155, 0.173]
$τ$ (precision)	2.616	0.014	[2.588, 2.639]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Jiang, S.; Loomba, J.; Zhou, A.; Sharma, S.; Sengupta, S.; Liu, J.; Brown, D.; on behalf of N3C Consortium. A Bayesian Survival Analysis on Long COVID and Non-Long COVID Patients: A Cohort Study Using National COVID Cohort Collaborative (N3C) Data. Bioengineering 2025, 12, 496. https://doi.org/10.3390/bioengineering12050496

AMA Style

Jiang S, Loomba J, Zhou A, Sharma S, Sengupta S, Liu J, Brown D, on behalf of N3C Consortium. A Bayesian Survival Analysis on Long COVID and Non-Long COVID Patients: A Cohort Study Using National COVID Cohort Collaborative (N3C) Data. Bioengineering. 2025; 12(5):496. https://doi.org/10.3390/bioengineering12050496

Chicago/Turabian Style

Jiang, Sihang, Johanna Loomba, Andrea Zhou, Suchetha Sharma, Saurav Sengupta, Jiebei Liu, Donald Brown, and on behalf of N3C Consortium. 2025. "A Bayesian Survival Analysis on Long COVID and Non-Long COVID Patients: A Cohort Study Using National COVID Cohort Collaborative (N3C) Data" Bioengineering 12, no. 5: 496. https://doi.org/10.3390/bioengineering12050496

APA Style

Jiang, S., Loomba, J., Zhou, A., Sharma, S., Sengupta, S., Liu, J., Brown, D., & on behalf of N3C Consortium. (2025). A Bayesian Survival Analysis on Long COVID and Non-Long COVID Patients: A Cohort Study Using National COVID Cohort Collaborative (N3C) Data. Bioengineering, 12(5), 496. https://doi.org/10.3390/bioengineering12050496

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Bayesian Survival Analysis on Long COVID and Non-Long COVID Patients: A Cohort Study Using National COVID Cohort Collaborative (N3C) Data

Abstract

1. Introduction

2. Related Work

3. Methodology

3.1. Dataset

3.2. Cohort Identification and Feature Development

3.2.1. COVID-Positive Long COVID Group Inclusion Criteria

3.2.2. COVID-Positive Control Group Inclusion Criteria (No Documented Long COVID)

3.2.3. Matching Process

3.2.4. Survival Timeline Considerations

3.3. Survival Analysis

3.4. Bayesian Parametric Models

Log-Normal Model

4. Results

4.1. Cohort Summary

4.2. Modeling

4.3. Parameter Estimation

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI