1. Background
Since its emergence, the COVID-19 pandemic has transitioned from an acute global emergency to a persistent public health challenge, defined mainly by the enduring burden of Post-Acute Sequelae of SARS-CoV-2 (PASC), commonly known as long COVID. Characterised by a constellation of debilitating symptoms, including fatigue, dyspnoea, and cognitive dysfunction, that persist for months or years after the initial infection, long COVID affects an estimated 10–20% of survivors [
1,
2]. As the virus becomes endemic, understanding modifiable risk factors for these sequelae is essential for reducing the long-term burden on healthcare systems.
The rapid development and deployment of COVID-19 vaccines, utilising novel mRNA and adenoviral vector technologies, proved highly effective in preventing hospitalisation and mortality during the acute phase of the pandemic [
3]. However, the extent to which vaccination protects against the downstream development of long COVID remains a subject of debate. Large-scale registry studies suggest that vaccination prior to infection offers a partial protective effect, reducing the risk of long COVID by approximately 15–50% [
4,
5,
6]. Biological plausibility for this protection exists; pre-existing immunity may accelerate viral clearance and reduce the systemic inflammation associated with PASC [
1]. Conversely, other observational studies have reported negligible benefits or, paradoxically, higher rates of long COVID symptoms among vaccinated cohorts [
5].
These conflicting findings are likely driven by methodological limitations inherent to observational research, specifically temporal ambiguity and indication bias. In real-world datasets, “vaccinated” cohorts often include individuals immunised before infection, as well as those who sought vaccination after recovering from the acute phase. This conflation introduces reverse causality, where individuals with lingering symptoms are more likely to seek vaccination as a therapeutic measure, thereby artificially inflating the apparent risk in the vaccinated group [
3,
5]. Furthermore, while immunological data suggest that heterologous vaccine regimens may induce broader neutralising antibody responses than homologous regimens [
7,
8], it remains unclear whether this translates into superior clinical protection against long COVID.
This study aims to resolve these inconsistencies by analysing electronic health records from a cohort of adults hospitalised with COVID-19 in London. Unlike previous studies that pooled vaccination status, we employ a rigorous temporal stratification to distinguish between the preventative effects of pre-infection immunity and the health-seeking behaviours associated with post-infection vaccination. Utilising Bayesian logistic regression to handle small subgroups and potential data separation, we sought to (1) determine whether vaccination prior to infection reduces the odds of long COVID in a high-severity cohort; (2) quantify the extent of reverse causality by analysing post-infection vaccination patterns; and (3) investigate whether heterologous vaccine regimens offer superior protection compared to homologous regimens.
2. Materials and Methods
2.1. Study Design and Data Source
This retrospective observational cohort study was conducted at a single university hospital in London, United Kingdom. We analysed anonymised Electronic Patient Records (EPR) of adults diagnosed with COVID-19 between April 2020 and December 2022. Data entry and management were facilitated using Castor Electronic Data Capture (EDC) software v2021.4, AMS, NLand the study was reported according to the Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) guidelines.
Ethical approval for the research was obtained from the Health Research Authority (HRA) England and Health and Care Research Wales (HCRW), under REC reference 23/HRA/1637.
2.2. Study Population and Eligibility
The study included adult patients (≥18 years) with a confirmed SARS-CoV-2 infection by PCR or rapid antigen testing who required hospital admission during the acute phase of illness. Participants were excluded from the study if they lacked valid dates for either infection, hospital admission, or vaccination. Additionally, individuals who received their first vaccine dose within 0 to 13 days before infection were excluded. This criterion was implemented to eliminate cases with ambiguous immune status, specifically those in which the vaccine had been administered but had not yet provided biological protection, or in which the participant might have been incubating the virus at the time of vaccination.
2.3. Variable Definitions and Data Engineering
2.3.1. Outcome Variable
The primary outcome was long COVID, defined as a binary variable (Yes/No). This was determined based on clinician-documented diagnosis or patient-reported persistence of symptoms such as fatigue, dyspnoea, and cognitive dysfunction (brain fog) beyond 12 weeks post-infection, consistent with the World Health Organisation (WHO) clinical case definition. Where available, specific ICD-10 codes associated with post-COVID conditions, such as U09.9 “Post COVID-19 condition, unspecified,” were used to support classification. The combination of symptom duration, clinical descriptors, and ICD coding ensured consistent and reproducible identification of long COVID cases for analysis and comparison with other studies.
2.3.2. Exposure Variables and Cohort Stratification
To account for reverse causality, where symptomatic individuals may be more likely to seek vaccination, and to minimise immortal time bias, the participants were not analysed as a single pooled group. Instead, we divided the study population into two analytical cohorts based on the timing of vaccination relative to infection. The first cohort was the prevention analysis (Analysis A), which compared individuals who had been vaccinated at least 14 days before their initial confirmed COVID-19 infection with those who remained unvaccinated, thereby assessing the biological protective effect of prior immunity. The second cohort was the post-acute analysis (Analysis B), which compared individuals who received vaccination after their acute infection with unvaccinated individuals to examine health-seeking behaviours and the relationship between persistent symptoms and subsequent vaccine uptake. In the prevention cohort, vaccinated participants were further classified by vaccine regimen (homologous or heterologous) to evaluate potential differences in protective effect. The heterologous regimens were defined as those in which participants received COVID-19 vaccines from two or more different platforms, including mRNA-based vaccines (such as Pfizer–BioNTech BNT162b2 or Moderna mRNA-1273), adenoviral vector vaccines (such as Oxford-AstraZeneca ChAdOx1 nCoV-19), or protein subunit vaccines. In contrast, homologous regimens comprised participants who received all vaccine doses from the same platform, regardless of manufacturer. Assignment to the heterologous or homologous groups was based on documented vaccine brand information in the EPR.
2.3.3. Covariates and Justification for Inclusion/Exclusion
To estimate the independent effect of vaccination while controlling for confounding factors, several covariates were engineered and included in the final models. Acute severity was assessed using the length of hospital stay (LoS), calculated as the number of days between admission and discharge, since all participants were hospitalised and a binary hospitalisation status variable would not provide statistical variance. To address convergence issues caused by extreme outliers, such as stays longer than 100 days, the length of hospital stay was dichotomised into short (<4 days) and long (≥4 days) based on the cohort median. This dichotomised measure served as the primary proxy for acute disease severity. The initial comorbidity count, ranging from 0 to 12, yielded sparse data; for example, no unvaccinated patients had 11 comorbidities, rendering the model unstable. As a result, comorbidities were transformed into a binary variable, distinguishing between none (zero comorbidity) and any (one or more comorbidities).
Standard demographic confounders were also included in the analysis: age and body mass index (BMI), both as continuous numeric variables, and gender as a categorical factor. Smoking status and ethnicity were initially included in exploratory models; however, these variables resulted in “perfect separation,” with zero event counts in specific subgroups. This issue arose due to the relatively modest sample size. To maintain model stability and ensure sufficient statistical power to assess the primary exposure of interest, which was vaccination, these covariates were excluded from the final adjusted model.
2.4. Statistical Analysis
Data analysis was conducted using R Statistical Software (version 4.5.2). Continuous variables were summarised using means and standard deviations (SD), while categorical variables were presented as frequencies and percentages. Differences between groups were assessed using the Wilcoxon rank-sum test for continuous variables and Chi-square or Fisher’s exact tests for categorical variables, as appropriate.
2.4.1. Outcome Modelling Strategy
To evaluate the association between vaccination and long COVID, we employed multivariable Bayesian logistic regression using the bayesglm function from the arm package. This method was selected over standard maximum likelihood estimation (GLM) and Propensity Score Weighting to address issues of perfect separation (zero cell counts) and small sample sizes within specific subgroups, such as particular comorbidity profiles or extreme LoS. The Bayesian approach uses weakly informative priors to stabilise coefficient estimates, yielding robust Adjusted Odds Ratios (aORs) and 95% Confidence Intervals (CIs).
2.4.2. Primary Cohort Analysis
Two distinct multivariable models were developed to separate the preventative effect of vaccination from the possibility of reverse causality. The first, the prevention model, assessed the odds of developing long COVID among individuals who had received a vaccine at least 14 days before their initial infection, compared with those who remained unvaccinated. The second was the post-acute model, which examined the odds of long COVID in participants who were vaccinated only after their infection had begun, again using the unvaccinated group as the baseline for comparison.
2.4.3. Subgroup Analysis: Vaccine Regimen
To investigate whether heterologous vaccination provided superior protection compared with homologous regimens, a subgroup analysis was performed within the prevention cohort. The exposure variable was categorised into three levels: unvaccinated (reference group), homologous (received the same vaccine type/brand), and heterologous (received different vaccine types/brands). This model assessed whether the heterologous and homologous groups differed significantly from the unvaccinated baseline or from each other, while adjusting for the same set of covariates.
2.4.4. Model Adjustments
All regression models were adjusted to account for several potential confounding variables. Specifically, demographic factors included age (continuous), gender (male or female), and body mass index (BMI; continuous). Clinical status was represented by the presence of comorbidities, coded as a binary variable distinguishing between individuals with no comorbidities and those with at least one comorbidity. The severity of the acute phase was controlled for by including the LoS, classified as short (fewer than 4 days) or long (4 days or more). Statistical significance was determined by a 95% confidence interval that did not include 1.0, and all p-values were two-tailed.
4. Discussion
This retrospective cohort study of 627 hospitalised adults resolves a critical epidemiological paradox by distinguishing the temporal effects of vaccination on long COVID. By stratifying participants into distinct prevention and post-acute cohorts, the study disentangles the biological protective effects of vaccination from behavioural confounders. The primary finding is that while vaccination administered prior to infection shows a protective trend against long COVID (aOR 0.81), the elevated odds observed with post-infection vaccination (aOR 3.41) are attributable to reverse causality, in which patients with persistent symptoms are significantly more likely to seek vaccination after their acute infection.
Our observation that pre-infection vaccination is associated with a reduction in the odds of long COVID aligns with large-scale epidemiological evidence. Registry studies, such as those involving U.S. Veterans Affairs data [
4], UK community cohorts by the Office for National Statistics [
5], and other studies [
9,
10,
11], have reported risk reductions ranging from 15% to 50%. Our adjusted odds ratio of 0.81 falls squarely within the 15–41% protective range identified in systematic reviews [
6]. However, our study extends this work by resolving the “paradox” of higher vaccination rates among long COVID patients; while unadjusted comparisons in our dataset initially mirrored conflicting reports, our stratified analysis confirms that these associations disappear when proper temporal sequencing is applied, validating the principles of causal inference described by Hernán and Robins [
12]. Crucially, our study differs from population-level studies by focusing exclusively on hospitalised survivors. For the vaccinated individuals in our prevention cohort, these represent ‘breakthrough’ severe infections. Observing a protective trend (aOR 0.81) even within this high-severity group suggests that vaccination may offer a ‘protective floor’ against downstream sequelae, even when it fails to prevent the acute hospitalisation itself.
The lack of statistical significance for the protective effect in the prevention cohort (aOR 0.81) may plausibly reflect effect modification by baseline disease severity. In a population already sufficiently ill to require hospital admission, the dominant drivers of long COVID risk—specifically comorbidity burden (aOR 2.78) and prolonged LoS (aOR 1.82)—likely overwhelm the marginal downstream protective effect of vaccination. Consequently, the vaccine’s ability to modulate post-acute sequelae may be harder to detect in this high-severity context compared to community populations where it prevents the initial cascade of severe disease entirely.
A notable and novel finding in our study is the lack of significant difference between homologous and heterologous vaccine regimens in preventing long COVID symptoms. While early immunological studies suggested that heterologous boosting might induce superior broad-spectrum immunity [
7,
8], our clinical data indicate that for the specific outcome of long COVID prevention, the timing of immunity, established before infection, is more critical than the platform combination used. This suggests that the complexity of the vaccine schedule is less important than the binary state of being immunised prior to viral exposure. However, given the limited sample sizes within these specific regimen subgroups, these findings should be interpreted with caution and viewed as hypothesis-generating rather than definitive evidence of equivalence.
The observed protective trend in the prevention cohort supports the biological hypothesis that pre-existing adaptive immunity facilitates rapid viral clearance, thereby limiting the viral persistence and tissue damage implicated in long COVID pathogenesis [
13,
14]. Pre-existing immunity likely primes the adaptive immune response to clear viral reservoirs faster, potentially reducing the inflammatory cascade associated with post-acute sequelae. This biological mechanism stands in stark contrast to the findings in our post-acute cohort, which require a behavioural rather than biological explanation.
The strong association observed in the post-acute cohort (aOR 3.41) highlights the phenomenon of indication bias. Patients suffering from debilitating sequelae likely view vaccination as a necessary therapeutic intervention or seek it urgently to prevent reinfection, given their perceived vulnerability [
3]. This mirrors patterns observed in other post-viral syndromes, in which symptomatic individuals interact more frequently with healthcare systems. Consequently, the “risk” observed in cross-sectional studies is a marker of health-seeking behaviour, distinguishing it fundamentally from the protective signal observed when vaccination precedes infection.
Beyond vaccination status, the identification of acute disease severity (LoS ≥ 4 days) as a dominant predictor of long COVID (aOR 1.82) has practical diagnostic implications. Clinicians should recognise that hospitalised patients requiring supplemental oxygen or extended admissions are at elevated risk for post-acute sequelae regardless of their vaccination status. Consequently, discharge planning for these high-severity patients should proactively include screening for functional limitations and fatigue at follow-up intervals, rather than waiting for patient self-report [
15].
For policymakers, these findings reinforce the imperative to maintain high vaccination coverage, particularly among populations with comorbidities, as preventing severe acute disease is the most effective lever for reducing downstream long COVID prevalence [
3]. Operationally, health systems should not prioritise complex heterologous vaccine supply chains solely for long COVID prevention, as standard homologous schedules appear equally effective. Stakeholders can immediately use these data to counter vaccine hesitancy narratives that falsely claim vaccines cause long COVID, focusing instead on the reduction in acute severity as the primary preventative mechanism.
The strength of our interpretation relies on the use of Bayesian logistic regression, which resolved issues of quasi-complete separation arising from zero event counts in specific subgroups, a methodological limitation that caused standard maximum likelihood models to fail. Furthermore, our use of LoS as a proxy for acute severity allowed us to adjust for the most significant confounder in hospitalised populations [
3]. This analytic approach increases confidence that the null findings in the prevention cohort are due to statistical power limitations rather than methodological bias. Despite these strengths, we must acknowledge that our single-centre retrospective design limits the sample size, resulting in wide confidence intervals that prevented the protective trend in the prevention cohort from reaching statistical significance. Selection bias is inherent to the study, as the cohort consisted entirely of hospitalised patients, meaning our results cannot be generalised to the majority of long COVID sufferers who had mild acute infections. Consequently, our estimates likely differ from community-based studies where the vaccine’s ability to prevent infection entirely plays a larger role. Our findings represent the impact of vaccination on severe disease phenotypes and may not generalise to mild, non-hospitalised cases of COVID-19, which constitute the majority of long COVID patients globally. Additionally, reliance on electronic health records may introduce measurement bias, as symptoms are recorded only when patients report them to clinicians, potentially underestimating prevalence compared with prospective symptom diaries.
5. Conclusions
In this retrospective cohort study, vaccination administered prior to infection was associated with a non-significant protective trend against long COVID, whereas vaccination administered post-infection was strongly associated with increased symptom reporting. This divergence provides critical empirical evidence that previously reported “risks” of vaccination are attributable to reverse causality and indication bias—specifically, symptomatic survivors seeking vaccination—rather than biological harm. Our findings reinforce the established efficacy of COVID-19 vaccination, demonstrating that long-term complications do not offset its acute benefits. While temporal stratification strengthens our inference, residual confounding inherent to observational methodology cannot be fully excluded. Consequently, clinical attention should shift away from unfounded concerns regarding vaccine safety and toward managing the independent predictors of sequelae identified here: comorbidities and acute severity. Specifically, patients with acute hospital stays exceeding four days warrant targeted longitudinal monitoring for functional decline, regardless of their vaccination history. From a policy perspective, maintaining high vaccine coverage in comorbid populations remains the most viable strategy to reduce long COVID incidence by mitigating its primary driver—severe acute disease. Future research should utilise large-scale prospective registries to validate these trends across emerging variants, but current evidence supports the continued prioritisation of vaccination and acute severity reduction as the foundation for minimising the pandemic’s long-term burden.