Characterizing and Predicting Post-Acute Sequelae of SARS CoV-2 Infection (PASC) in a Large Academic Medical Center in the US

Background: A growing number of Coronavirus Disease-2019 (COVID-19) survivors are affected by post-acute sequelae of SARS CoV-2 infection (PACS). Using electronic health record data, we aimed to characterize PASC-associated diagnoses and develop risk prediction models. Methods: In our cohort of 63,675 patients with a history of COVID-19, 1724 (2.7%) had a recorded PASC diagnosis. We used a case–control study design and phenome-wide scans to characterize PASC-associated phenotypes of the pre-, acute-, and post-COVID-19 periods. We also integrated PASC-associated phenotypes into phenotype risk scores (PheRSs) and evaluated their predictive performance. Results: In the post-COVID-19 period, known PASC symptoms (e.g., shortness of breath, malaise/fatigue) and musculoskeletal, infectious, and digestive disorders were enriched among PASC cases. We found seven phenotypes in the pre-COVID-19 period (e.g., irritable bowel syndrome, concussion, nausea/vomiting) and sixty-nine phenotypes in the acute-COVID-19 period (predominantly respiratory, circulatory, neurological) associated with PASC. The derived pre- and acute-COVID-19 PheRSs stratified risk well, e.g., the combined PheRSs identified a quarter of the cohort with a history of COVID-19 with a 3.5-fold increased risk (95% CI: 2.19, 5.55) for PASC compared to the bottom 50%. Conclusions: The uncovered PASC-associated diagnoses across categories highlighted a complex arrangement of presenting and likely predisposing features, some with potential for risk stratification approaches.


Study Cohort
The study included Michigan Medicine (MM) patients with a recorded COVID-19 diagnosis or a positive real-time reverse transcriptase chain (RT-PCR) test for SARS-CoV-2 infection performed/recorded at MM between 10 March 2020, and 31 August 2022. Diagnoses were recorded at clinic visits and hospital encounters. RT-PCR testing data were collected for routine screening at hospital admission, before procedures, and for employee screening. Tests included both symptomatic and asymptomatic individuals.
For each subject, the date of their first COVID-19 diagnosis or RT-PCR positive test, whichever came first, was considered the index date. Dates were regarded as protected health information and operationalized as days since birth; however, the quarter of the year of the index date was obtained. To allow sufficient follow-up time for diagnosing PASC, we limited the analysis to patients with encounters at least two months after being COVID-19 positive and stratified them in PASC cases (had a recorded PASC diagnosis) and PASC controls (had no recorded PASC diagnosis).
PASC diagnoses were either based on an entry of PASC in the diagnosis section of the EHR database's Problem Summary List (PSL, Table S1) or on observations of the ICD-10-CM (International Classification of Diseases codes, tenth edition with clinical modifications) U09.9 ("Post COVID-19 condition, unspecified") or B94.8 ("Sequelae of other specified infectious and parasitic diseases"). The CDC recommended the latter as a temporary alternative to the PASC-specific U09.9 code, which was implemented on 1 October 2021 [34]. PSL diagnoses represent active and resolved patient problems entered by healthcare providers. The age at the first observed ICD-or PSL-based PASC diagnosis was considered the age of onset of PASC. PASC cases (see definition below) without a prior positive test were excluded because the timepoint of the test was crucial for defining the pre-COVID-19 and acute-COVID-19 time periods (Figure 1). specified infectious and parasitic diseases"). The CDC recommended the latter as porary alternative to the PASC-specific U09.9 code, which was implemented on 1 O 2021 [34]. PSL diagnoses represent active and resolved patient problems ente healthcare providers. The age at the first observed ICD-or PSL-based PASC diagno considered the age of onset of PASC. PASC cases (see definition below) without positive test were excluded because the timepoint of the test was crucial for defin pre-COVID-19 and acute-COVID-19 time periods (Figure 1). Figure 1. Schematic on study design. Three time periods were defined relative to the 1. p COVID-19 test or diagnosis (index date): pre-COVID-19 until −14 days, acute-COVID-19 fr to +28 days, and post-COVID-19 from +28 days onwards. The post-COVID- 19 PheWAS is validate features of PASC cases compared to COVID-19 cases without PASC diagnoses. T COVID-19 and acute-COVID-19 PheWAS on the training data (index date in 2020-2021) inf phenotype risk scores (PheRS) that will be used to predict PASC in the testing data (index 2022).
We also categorized PASC patients based on ICD10 diagnoses concurrently re with their first PASC diagnosis and mapped them to 29 phenotype concepts prev reported as common PASC symptoms [3]. In addition, we manually mapped detail diagnoses to these 29 concepts (Tables S1 and S2).

Definition of Demographics, Socioeconomic Status, and Other Covariates
To examine and adjust for confounding by patient characteristics, socioeconom tus, and other variables, we obtained the following data for each participant: ag reported gender, self-reported race/ethnicity, neighborhood disadvantage index without proportion of Black (coded as quartiles, with larger quartiles representin disadvantaged communities) [35,36], and population density measured in perso square mile (operationalized as quartiles).
Additional covariates included vaccination status, the Elixhauser comorbidit [37,38], COVID-19 severity (non-severe (not hospitalized) and severe (hospitalized ceased)), healthcare worker (HCW) status, the timespan of records in the EHR befo after the COVID-19 test/diagnosis, the timespan of records in the EHR before 20 ferred to as "pre-pandemic" time period). These timespans were based on the first recorded encounter in the EHR data. Additional details and definitions of these cov can be found in Appendix A and Table S3.
We assumed completely at random missingness of the covariates included in o justed analyses and performed complete case analyses for each adjustment. We also categorized PASC patients based on ICD10 diagnoses concurrently recorded with their first PASC diagnosis and mapped them to 29 phenotype concepts previously reported as common PASC symptoms [3]. In addition, we manually mapped detailed PSL diagnoses to these 29 concepts (Tables S1 and S2).

Definition of Demographics, Socioeconomic Status, and Other Covariates
To examine and adjust for confounding by patient characteristics, socioeconomic status, and other variables, we obtained the following data for each participant: age, selfreported gender, self-reported race/ethnicity, neighborhood disadvantage index (NDI) without proportion of Black (coded as quartiles, with larger quartiles representing more disadvantaged communities) [35,36], and population density measured in persons per square mile (operationalized as quartiles).
Additional covariates included vaccination status, the Elixhauser comorbidity score [37,38], COVID-19 severity (non-severe (not hospitalized) and severe (hospitalized or deceased)), healthcare worker (HCW) status, the timespan of records in the EHR before and after the COVID-19 test/diagnosis, the timespan of records in the EHR before 2020 (referred to as "pre-pandemic" time period). These timespans were based on the first or last recorded encounter in the EHR data. Additional details and definitions of these covariates can be found in Appendix A and Table S3.
We assumed completely at random missingness of the covariates included in our adjusted analyses and performed complete case analyses for each adjustment.

Time-Restricted Phenomes
We constructed each subject's medical phenome by extracting available ICD9 and ICD10 codes from the EHR and mapping them to 1813 broader phenotype concepts (Phe-Codes) using the R package "PheWAS" [39,40]. In short, individuals with ICD codes that map to a specific PheCode were coded as "1", then individuals with ICD codes that map to the PheCode's specific exclusion criteria were coded as missing, and finally, all remaining individuals were coded as "0" for that particular PheCode (further details are described elsewhere [40]). We created three time-restricted phenomes relative to the index date: post-COVID-19 (+28 days to +6 months), pre-COVID-19 (predating −2 weeks), and acute COVID-19 (−14 and +28 days; Figure 1).

Matching
To minimize confounding when we compare PASC (case) versus no PASC (control), we matched each PASC case to up to 10 PASC controls using the R package "MatchIt" [41]. Nearest neighbor covariate matching was applied for age at index date, pre-COVID-19 years in EHR, and post-COVID-19 years in EHR without applying a caliper. Exact matching was used for sex, primary care visit at Michigan Medicine within the last two years (yes/no), race/ethnicity, and year quarter of the index date. We retained the case-control matching throughout all analyses. To characterize diagnoses enriched in COVID-19 patients with PASC, we also conducted PheWAS to identify phenotypes associated with PASC in the post-COVID-19 period (at least 28 days after the COVID-19 index date, see Figure 1) using Firth bias-corrected logistic regression by fitting the following model for each PheCode of the post-COVID-19 period phenome: where covariates were pre-COVID- 19 Elixhauser Score (AHRQ), NDI, population density, healthcare worker status (HCW), vaccination status, and severity, details are summarized in Appendix A and Table S3.

Pre-Disposing PheCodes
We conducted PheWAS to identify PheCodes pre-disposing to PASC using either PheCodes from the pre-COVID-19 period or PheCodes from the acute-COVID-19 period. We ran Firth bias-corrected logistic regression by fitting the following model for each PheCode of the corresponding time-restricted phenome: We applied a similar set of covariate adjustments as before (Table S3). The phenomes were split into a training set (index dates in 2020 and 2021) and a testing set (index date in 2022). This choice was to retain the true spirit of future prediction using past data. The training set was used to identify predisposing PheCodes in phenome-wide association studies (PheWAS), while the testing set was used to evaluate prediction models based on the PheWAS results.
To evaluate the robustness of effect sizes of predisposing PheCodes, we performed several sensitivity analyses: (1) females only, (2) males only, (3) index date in 2020, (4) index date in 2021, (5) non-severe outcomes (not hospitalized), (6) severe outcomes (hospitalized or deceased), (7) recorded within two years before the index date, and (8) pre-pandemic (before 2020). For the acute-COVID-19 PheWAS, we excluded PASC cases whose first recorded PASC diagnosis was observed less than 28 days after the index date. The sample sizes of the complete case analyses for various analyses are listed in Table S4.
PheWASs were restricted to PheCodes observed at least five times among cases and among controls. For all PheWAS, we excluded PheCode 136 "Other infectious and parasitic diseases" as it included the ICD-10 code "B94.8" which was used to record a PASC diagnosis.
For each PheWAS, we applied a Bonferroni correction adjusting for the number of analyzed PheCodes (Table S4). In Manhattan plots, we present -log10 (p-value) corresponding to tests for association of the underlying phenotype. Directional triangles on the PheWAS plot indicate whether a trait was positively (pointing up) or negatively (pointing down) associated.
We also tested for differences between effect sizes of three subgroup comparisons (non-severe vs. severe outcome, female vs. male, and index date in 2020 vs. 2021) using the following t-statistics: where β A and β B are the subgroup-specific beta-estimates with corresponding standard errors SE(β A ) and SE(β B ).

Phenotype Risk Scores (PheRS) PheRS Generation
To generate the phenotype risk score or PheRS, we first screened the PheWAS for PheCodes that were phenome-wide significant at a Bonferroni corrected threshold in a one-at-a-time analysis in terms of their association with PASC (after adjusting for covariates). Next, we ran a joint multivariate model with all phenome-wide significant PheCodes using ridge penalized logistic regression (R Package "glmnet" [42,43]) to obtain the adjusted coefficients/weights per PheCode from the training data before calculating the PheRS in the testing data. More specifically, we weighted the presence of PheCodes with their adjusted coefficients from the multivariate ridge penalized logistic regression and calculated the PheRS as the weighted sum. For subject j, the PheRS was of the form PheRS j = ∑ iβi PheCode ij where the sum extends over all included PheCodes,β i are the adjusted ridge regression coefficients for PheCode i from the multivariate model, and PheCode ij denotes the presence/absence (coded as 1 and 0) of a PheCode i in subject j. We used Ridge regression because it has been shown to offer good performance when there is multicollinearity between features, and when prediction is the goal [44].

PheRS Evaluation
To evaluate each of the PheRS, we fit the following Firth bias-corrected logistic regression model adjusting for age, gender, race/ethnicity, Elixhauser Score, population density, NDI, HCW, vaccination status, pre-COVID19 years in EHR and severity using a complete case analysis: For each PheRS, we assessed the following performance measures relative to the PASC status: (1) overall performance with Nagelkerke's pseudo-R 2 using R packages "rcompanion" [45], (2) accuracy with Brier score using R package "DescTools" [46]; and (3) ability to discriminate between PASC cases and matched controls as measured by the area under the covariate-adjusted receiver operating characteristic (AROC; semiparametric frequentist inference) curve (denoted AAUC) using R package "ROCnReg" [47]. Firth's bias reduction method was used to resolve the problem of separation in logistic regression (R package "brglm2") [48].
To also evaluate models with both predictors (PheRS1-Ridge + PheRS2-Ridge), we combined them by first fitting a logistic regression with the predictors in the training set to obtain the linear predictors that we used to obtain the combined score in the testing data.

Patient Characteristics
Among 63,675 patients with a history of COVID-19 who were seen in MM at least two months after their first record of COVID-19, 1724 (2.7%) received a PASC diagnosis. The PASC prevalence within three months of a COVID-19 infection ranged from 0.18% (Q3 of 2020) to 1.8% (Q3 of 2021). The most PASC cases were observed in Q4/2021 (n = 134), coinciding with the second peak of positive tests at MM (Table 1; Figure S1).
We observed that PASC cases compared to controls were on average older at their index date (mean age 47.9 versus 41.7 years), had a slightly longer timespan covered in the pre-test EHRs (11.7 versus 10.4 years), were more likely female (64.5% versus 56.7%), more likely to have received primary care at MM in the last two years (60.7% versus 46.4%) and showed different distributions across the year quarters over time (Table 1). To adjust for these observed differences, we performed nearest neighbor matching (age at index date, pre-test years in EHR, post-test years in EHR) and exact matching (gender, primary care at MM, race/ethnicity, quarter of year at COVID-19 index date). All significant differences in covariates became non-significant after matching (Table 1).

Pre-COVID-19 PheWAS
Of the 1724 individuals, 163 had incomplete covariate data. The 1561 remaining individuals were split into a training set (1212 individuals whose 1. positive test/diagnosis was recorded before 2022) and a testing set (349 individuals whose 1. positive test/diagnosis was recorded in 2022; also see flowchart in Figure S2). To identify potential PASCpredisposing conditions, we performed a PheWAS using the pre-COVID- 19 Figure 3, File S1B).

Pre-COVID-19 PheWAS
Of the 1724 individuals, 163 had incomplete covariate data. The 1561 remainin dividuals were split into a training set (1212 individuals whose 1. positive test/diag was recorded before 2022) and a testing set (349 individuals whose 1. positive test/ nosis was recorded in 2022; also see flowchart in Figure S2). To identify potential P predisposing conditions, we performed a PheWAS using the pre-COVID- 19  Additional sensitivity analyses indicated robust associations across various settings (females only, males only, 2020 only, 2021 only, non-severe outcome, severe outcomes, within two years before the index date, or before the pandemic, Figure S3A-G, File S1D-F).
Our sensitivity analyses indicated robust associations across various setting males only, males only, 2020 only, 2021 only, non-severe outcomes, severe outco where most associations remained nominally significant in each sub-analyses or had lapping confidence intervals in their sensitivity analyses. However, effect sizes wer as consistent ( Figure S4A-AK, File S1G-I). Noteworthily, the effect size for shortne breath differed significantly between index dates in 2020 and 2021 Our sensitivity analyses indicated robust associations across various settings (females only, males only, 2020 only, 2021 only, non-severe outcomes, severe outcomes) where most associations remained nominally significant in each sub-analyses or had overlapping confidence intervals in their sensitivity analyses. However, effect sizes were not as consistent ( Figure S4A-AK, File S1G-I). Noteworthily, the effect size for shortness of breath differed significantly between index dates in 2020 and 2021 (2020: OR = 2.20 [1.60, 2.99], p = 7.8 × 10 −7 compared to 2021: OR = 4.59 [3.62, 5.81], p = 9.37 × 10 −37 ; P Difference = 0.000234), though they were significantly associated with PASC in both years ( Figure S4AA, File S1C,I). Despite low numbers of individuals with severe outcomes (160 PASC cases and 150 controls), 6 of the 69 significantly associated phenotypes (as-pergillosis, bacterial pneumonia, MRSA pneumonia, hyperosmolality and/or hypernatremia, septic shock, and voice disturbances) only had sufficient observations among the subset with severe outcomes but among the non-severe outcome subset (724 PASC cases and 6799 controls; Table S4 and File S1C,G). This suggested that these phenotypes might be hospital-acquired complications. None of the 49 significantly associated phenotypes that were tested among individuals with non-severe outcomes and individuals with severe outcomes showed significant effect size differences (P difference ≥ 0.001 [0.05/49 tests]). All phenotypes with nominal effect size differences between non-severe and severe outcomes (P difference < 0.05) were all strongly and positively associated in individuals with non-severe outcomes, thus unlikely to merely represent hospital-acquired complications (File S1G).

Comparison of "Pre-PASC" Associated PheCode across Three PheWAS
To investigate whether the associated "pre-PASC" phenotypes of the pre-and acute-COVID-19 periods ("pre-PASC" phenotypes) are associated with novel PASC symptoms or if they become long-term features that manifest as PASC, we explored their frequencies and their association signals across all three PheWAS ( Figure S5). Interestingly, almost all associated "pre-PASC" phenotypes were also significantly enriched in the post-COVID-19 PheWAS, except for "allergic reaction to food" of the pre-COVID-19 PheWAS and "candidiasis" and "inflammation and edema of the lung" in the acute-COVID-19 PheWAS. However, their ORs were all positive (File S1A-C). While we observed similarities between pre-existing conditions and presenting PASC features, further analyses using rigorous causal inference methods are needed to evaluate their causal role in developing PASC. The current analysis is merely correlative and a prediction exercise.

Developing Phenotype Risk Scores for Predicting PASC
The pre-and acute-COVID-19 PheWASs indicated pre-disposing conditions for PASC. To study whether these conditions might be helpful in predicting PASC among patients with a history of COVID-19, we generated two PheRSs: a pre-COVID-19 PheRS "PheRS1" and an acute-COVID-19 PheRS "PheRS2". We avoided overfitting by using PheWAS results and PheRS weights obtained from individuals with index dates in 2020 or 2021, while the evaluations were performed in individuals with index dates in 2022 (Figures 1 and S2 and File S1J). To limit the impact of potential hospital-acquired complications of an acute-COVID-19 infection, we excluded the six phenotypes that were only tested/observed in the individuals with severe outcomes (see "acute-COVID-19 PheWAS" above).
We found that PheRS1 and PheRS2 could discriminate cases and controls, yet only with low accuracy (AAUC < 0.7). PheRS1 performance was comparable in the complete testing data (AAUC PheRS1 = 0.548 [95% CI: 0.516, 0.580]) and the testing data that were reduced to PASC cases that had at least 28 days between their index date and the PASC diagnosis (AAUC PheRS1 = 0.555 [95% CI: 0.496, 0.612]). PheRS2 was only analyzed in the latter data (AAUC PheRS2 = 0.605 [95% CI: 0.549, 0.663]) but performed better than PheRS1, which was also evident from its pseudo-R 2 which was almost five-fold higher (0.0116 and 0.0547, respectively). A combination score further improved the discrimination of cases and controls, but its accuracy remained low (AAUC Combined = 0.615 [0.561, 0.670]; Table 2).
We also explored if PheRSs based on additional suggestively associated PheCodes (defined as p < 1 × 10 −3 ) could further improve the prediction of PASC but found their individual or combined predictive ability slightly worse compared to the PheRSs that were based on phenome-wide significant hits (e.g., AAUC Combined = 0.601 [0.548, 0.658]; Table S7).
While the use for individual-level prediction seemed very limited, we found that PheRS1 and PheRS2 could significantly enrich PASC cases in their top 10% and top 10-25% risk bins compared to the lower 50% of their distributions (Table 3) Figure 5). Table 2. PheRS Evaluation in the testing data (COVID-19 positive in 2022). PheRS1 was based on the significant hits of the PheWAS with the pre-COVID-19 training data (1256 cases and 11,674 controls; COVID-19 positive in 2020/2021) while PheRS2 was based on the significant hits of the PheWAS with the acute-COVID-19 training data (874 cases and 8144 controls; COVID-19 positive in 2020/2021 and at least 28 days between first COVID-19 and first PASC diagnosis). Underlying weights can be found in File S1J and Table S8.

Discussion
In this study, we used data from a relatively large cohort of patients with history of COVID-19 from Michigan Medicine. We applied a PheWAS approach across time-restricted phenomes to identify phenotypes that may predispose to PASC. We found seven phenotypes (IBS, concussion, nausea and vomiting, shortness of breath, respiratory abnormalities, allergic reaction to food, and general circulatory disease) of the pre-COVID-19 period and 69 phenotypes (predominantly respiratory and circulatory symptoms) of the acute-COVID-19 period to be significantly enriched among PASC cases. Most of them were also observed enriched among PASC cases in the post-COVID19 period indicating that some of these phenotypes might have become longer-lasting or even chronic conditions. When incorporating these findings into PheRSs, we found that both the pre-COVID-19 PheRS and the acute-COVID-19 PheRS could predict PASC only with low accuracy among patients with a history of COVID-19, even when combined.
Possible explanations could be the random variation due to the small number of PASC cases, or differences due to different waves of coronavirus variants, the effect of vaccines, and changes in treatment and care of severe cases. Temporal trends in PASC diagnosis and management make this forward-looking prediction exercise much harder. We noted differences in the feature distributions between the training and testing sets, e.g., "nausea and vomiting" among the pre-COVID-19 features or "anxiety" among the acute-COVID-19 features, showed less pronounced differences between PASC cases and "No PASC" controls in the testing set (File S1J,K). However, both combined PheRSs could

Discussion
In this study, we used data from a relatively large cohort of patients with history of COVID-19 from Michigan Medicine. We applied a PheWAS approach across time-restricted phenomes to identify phenotypes that may predispose to PASC. We found seven phenotypes (IBS, concussion, nausea and vomiting, shortness of breath, respiratory abnormalities, allergic reaction to food, and general circulatory disease) of the pre-COVID-19 period and 69 phenotypes (predominantly respiratory and circulatory symptoms) of the acute-COVID-19 period to be significantly enriched among PASC cases. Most of them were also observed enriched among PASC cases in the post-COVID19 period indicating that some of these phenotypes might have become longer-lasting or even chronic conditions. When incorporating these findings into PheRSs, we found that both the pre-COVID-19 PheRS and the acute-COVID-19 PheRS could predict PASC only with low accuracy among patients with a history of COVID-19, even when combined.
Possible explanations could be the random variation due to the small number of PASC cases, or differences due to different waves of coronavirus variants, the effect of vaccines, and changes in treatment and care of severe cases. Temporal trends in PASC diagnosis and management make this forward-looking prediction exercise much harder. We noted differences in the feature distributions between the training and testing sets, e.g., "nausea and vomiting" among the pre-COVID-19 features or "anxiety" among the acute-COVID-19 features, showed less pronounced differences between PASC cases and "No PASC" controls in the testing set (File S1J,K). However, both combined PheRSs could identify a quarter of patients with a history of COVID-19 in the testing cohort with a 3.5-fold increased risk of PASC (95% CI: 2.19, 5.55) compared to the bottom 50%. This observation highlighted the clinical utility of existing EHR data on pre-existing and acute COVID-19 symptoms for risk stratification and the identification of a large group of vulnerable individuals who might benefit from stricter protective measures or earlier interventions.
A comparison of our findings with previous studies confirmed many pre-existing conditions that are predisposed to PASC. For example, in the pre-COVID-19 period PheWAS, we identified several respiratory symptoms that predisposed to PASC, including shortness of breath and other respiratory abnormalities. These findings are consistent with previous works [15,27,50]. The literature on IBS as a pre-disposing diagnosis for PASC seems sparse; however, there might be a connection between gut microbiota and the clinical course of COVID-19 [51] and mediation of risk factors effects for COVID-19 [52,53]. Similarly, little seems to be known of concussion as a pre-disposing diagnosis for PASC; yet, pre-existing cognitive risk factors such as mild traumatic brain injury were reported as enriched among cognitive PASC cases compared to non-cognitive PASC patients [54]. Future studies are needed to substantiate our findings and investigate how pre-disposing diagnoses relate to PASC. In addition to the results from the pre-COVID-19 period conditions, our findings from the acute-COVID-19 period also accord with previous studies. Among the 69 PASCassociated phenotypes, the majority were respiratory symptoms and in line with earlier reports (e.g., cough [55,56], dyspnea [57], respiratory insufficiency [58]). Additionally, the identified muscle-related symptoms, including myalgia, malaise, and fatigue, were supported by previous PASC studies [59,60]. Similar to a previous study, we found circulatory diseases to play an essential role as a predisposing factor for PASC [61]. While not all observed associations were previously reported, our sensitivity analyses indicated overall robustness across various settings [62,63].
An overlap between the enriched symptoms in the three periods implies the possibility of PASC being recurring symptoms of pre-existing conditions [17]. The difference in subsiding rate between cases and controls in some symptoms (e.g., respiratory symptoms) potentially indicates the development of chronic conditions [9,64].
There are several limitations to our analysis. First, we focused on predisposing diagnoses and performed matching, incl. on age, gender, and race/ethnicity, to adjust for potential confounding; however, these demographic characteristics were previously implicated as pre-disposing factors [65][66][67]. So, while matching and adjusting for these covariates might have effectively increased the power to identify pre-existing phenotypes that increase the risk for PASC, we disregarded these demographic factors as PASC predictors. Future studies are needed to evaluate the combined contributions of these variables in more comprehensive prediction models. Second, although a clinical diagnosis of PASC was used, many reported symptoms are non-specific to PASC, and defining PASC consistently across the time period of this study is nearly impossible [68]. The uncertainty around the definition of PASC is reflected in an initial lack of CDC-approved ICD10 codes. For example, the code "U09.9" ("Post COVID-19 condition, unspecified") was first introduced in October 2021, while it was recommended to also accompany this new code with existing codes for specific conditions and/or identified symptoms [69]. Before the approval of this code, the CDC encouraged providers to use an alternative but COVID-19-unrelated code, namely "B94.8" ("Sequelae of other specified infectious and parasitic diseases") [70]. The use of PSL diagnoses enabled us to detect PASC cases before any CDC recommendations were implemented. This covers the period of March 2020 to October 2021, a pre-vaccination period where PASC incidence was possibly higher. In addition, the various descriptions in the PSL diagnoses we used to define PASC cases (see Supplementary Table S1) reflect the developing language and awareness of PASC, e.g., "Post-COVID-19 syndrome", "COVID-19 long hauler" and "Multiple persistent symptoms after COVID-19". Furthermore, many of the PASC-related PSL diagnoses offered specific information about the underlying conditions and symptoms.
The performed post-COVID-19 PheWAS validated our definition of PASC in that it identified many of the established PASC symptoms. Yet, the awareness about PASC only recently increased and still might lead to an underdiagnosis of PASC [71,72]. For example, we only observed 2.7% PASC-diagnosed patients in our COVID-19 positive cohort, which is far lower than PASC studies from the US, which estimated a prevalence between 19% and 35% [73]. As a result, our predictions of PASC might be overly conservative. The available diagnosis codes for PASC lacked specificity to stratify PASC cases into PASC subtypes reliably. Future studies that incorporate natural language processing of clinical notes and that have larger sample sizes will likely improve the identification of PASC cases and subtypes [74]. Third, the analysis was restricted to the patients with a history of COVID-19 who were also seen at MM during the pre-COVID-19 and post-COVID-19 periods; due to this selection bias, both cases and controls might be less healthy and older compared to randomly chosen individuals with a history of COVID-19 [75].
Moreover, it has been reported that around 15%-40% of the confirmed COVID-19 population were asymptomatic [76,77]. Using data from a health system caused our cohort to be enriched for symptomatic COVID-19 patients, while asymptomatic COVID-19 cases may be underrepresented. Such biases and omissions might limit the generalizability to the overall population. Although this study included a large size of COVID-19 patients, attention might be given to expanding and diversifying the collection and analysis of data.
Our study used a clinical definition of PASC. In addition to the commonly used ICD code U09.9 ("Post COVID-19 condition, unspecified") or B94.8 ("Sequelae of other specified infectious and parasitic diseases"), we applied the information from the EHR internal problem list database (PSL, Table S1) to categorize PASC patients, which enabled us to collect patients whose diagnosis were recorded even before official ICD-10 recommendations/codes became available. The post-COVID-19 period PheWAS validated our PASC definition in that we enriched diagnoses consistent with subtypes of PASC that were previously reported (e.g., shortness of breath, neurological disorders, malaise, fatigue, and dysphagia) [3,74,78]. Furthermore, given the benefit of rich retrospective EHR data, we could adjust for essential confounders in our models, including race, Elixhauser comorbidity score, vaccination status, etc., that might have affected PASC outcomes. We expect that our approach and the resulting prediction models will improve over time with increasing sample sizes and, by doing so, will likely facilitate earlier detection of PASC cases or improve risk stratification. Furthermore, a better characterization of PASC mechanisms might inform on distinct PASC forms that differ in their profiles of pre-existing conditions.

Conclusions
PASC represents a worldwide public health challenge affecting millions of people. While effective therapies for PASC are still in development [79][80][81][82], prediction and risk models can help to identify individuals at increased risk for PASC and its subcategories more reliably and potentially inform preventive or therapeutic efforts.
The present research aimed to identify PASC pre-disposing diagnoses from the preand acute-COVID-19 medical phenomes and to explore them as predictors for PASC. We identified known and potentially novel associations across various disease categories in both phenomes. These phenotypes, when aggregated into PheRSs, have predictive properties for PASC, especially when considered for risk stratification approaches. Future studies might consider applying more complex non-linear models such as machine learning to improve prediction models. The next opportunity will be to incorporate additional, more complex data such as laboratory measurements or medication data into such prediction models, as they have proven relevant for PASC but have yet to be fully investigated [2,83,84]. The presented PheRS framework can also be adapted to explore alternative outcomes such as survival and, by doing so, offer comprehensive insights into the long-term consequences of COVID-19.