Metabolomics-Based Machine Learning Diagnostics of Post-Acute Sequelae of SARS-CoV-2 Infection

Ethan Cai; Valentina L. Kouznetsova; Igor F. Tsigelny

doi:10.3390/metabo15120801

,

and

¹

REHS Program, San Diego Supercomputer Center, University of California San Diego, San Diego, CA 92093, USA

²

San Diego Supercomputer Center, University of California San Diego, San Diego, CA 92093, USA

³

CureScience Institute, San Diego, CA 92121, USA

⁴

BiAna Institute, San Diego, CA 92117, USA

Metabolites2025, 15(12), 801;https://doi.org/10.3390/metabo15120801

This article belongs to the Special Issue Metabolomics in Human Diseases and Health: 2nd Edition

Version Notes

Order Reprints

Abstract

Background: COVID-19 has taken millions of lives and continues to affect people worldwide. Post-Acute Sequelae of SARS-CoV-2 Infection (also known as Post-Acute Sequelae of COVID-19 (PASC) or more commonly, Long COVID) occurs in the aftermath of COVID-19 and is poorly understood despite its widespread effects. Methods: We created a machine-learning model that distinguishes PASC from PASC-similar diseases. The model was trained to recognize PASC-dysregulated metabolites (p ≤ 0.05) using molecular descriptors. Results: Our multi-layer perceptron model accurately recognizes PASC-dysregulated metabolites in the independent testing set, with an AUC-ROC of 0.8991, and differentiates PASC from myalgic encephalomyelitis/chronic fatigue syndrome (ME/CFS), Lyme disease, postural orthostatic tachycardia syndrome (POTS), and irritable bowel syndrome (IBS). However, it was unable to differentiate fibromyalgia (FM) from PASC. Conclusions: By creating and testing models pairwise on each of these diseases, we elucidated the unique strength of the similarity between FM and PASC relative to other PASC-similar diseases. Our approach is unique to PASC diagnosis, and our use of molecular descriptors enables our model to work with any metabolite where molecular descriptors can be identified, as these descriptors can be generated and compared for any metabolite. Our study presents a novel approach to PASC diagnosis that partially circumvents the lengthy process of exclusion, potentially facilitating faster interventions and improved patient outcomes.

Keywords:

PASC; Long COVID; ME/CFS; POTS; IBS; fibromyalgia; machine learning; metabolites

1. Introduction

COVID-19 is a highly contagious disease caused by the novel coronavirus, SARS-CoV-2, characterized by symptoms such as fever, chills, and sore throat, which typically appear within a week of exposure and last up to two weeks. It has had a profound and unprecedented impact on the modern world. According to the World Health Organization (WHO), as of 28 September 2025, a total of 778.7 million cases and 7.1 million deaths have been recorded globally [1].

Due to the widespread urgency of the COVID-19 pandemic, its diagnosis, causes, and treatment have been thoroughly studied and investigated. In contrast, Post-Acute Sequelae of SARS-CoV-2 Infection (PASC), a condition that follows COVID-19, remains poorly understood, despite its widespread impact [2].

The WHO defines Long COVID as “the continuation or development of new symptoms three months after the initial SARS-CoV-2 infection, with these symptoms lasting for at least two months with no other explanation.” Symptoms and severity may differ from those in the acute phase, and more than 200 symptoms have been reported across multiple bodily systems, with fatigue, cognitive dysfunction, and shortness of breath among the most common [2,3]. Terminology for the condition varies; many sources use Long COVID, post-COVID-19 syndrome (PCS), and PASC interchangeably [4], while others distinguish between them. The WHO also uses the term post-COVID-19 condition (PCC), and the Researching COVID to Enhance Recovery (RECOVER) Initiative [5] defines it as Post-Acute Sequelae of SARS-CoV-2 Infection. In this study, we use “Long COVID” and “PASC” interchangeably, following the definition set out by the WHO [2].

Currently, Long COVID remains poorly understood and defined, in terms of terminology prevalence, occurrence, and biomarkers. The etiology of PASC remains unclear, although several hypotheses have been proposed. One suggestion is that PASC is caused by remnants of the coronavirus persisting in some organs and continuing to provoke an immune response even after the immune system has eliminated the virus, leading to chronic inflammation and tissue damage [6]. Another proposed hypothesis claims PASC sources from the reactivation of latent viruses, such as the Epstein-Barr virus (EBV) or the suggestion that PASC sources may stem from the reactivation of latent viruses, such as the Epstein-Barr virus (EBV) or Human Herpesvirus 6 (HHV6) [6,7]. It has also been proposed that PASC is caused by a combination of several coexisting factors, rather than a single factor.

As of January 2025, there is no formal diagnostic test for PASC [8]. Diagnosing PASC is particularly challenging because its symptoms are inconsistent and may manifest in different ways across multiple organs, making it difficult to identify and differentiate from other conditions [7,8,9]. The RECOVER study found that these symptoms change, disappear, and recur unpredictably [5]. Deer et al. reported that the symptoms of PASC are “extremely heterogeneous” and that among 15 studies that identified 55 symptoms, none were consistently present in every study [10]. The complexity of the disease and the lack of specific biomarkers make diagnosis cumbersome [11]. Statistics and understandings of this condition vary widely, with the estimated percentage of COVID-19 patients who developed PASC ranging from 10% to 60% depending on the source.

Another complication with PASC diagnosis is the similarity of symptoms between PASC and other diseases. For instance, fibromyalgia (FM) and myalgic encephalomyelitis/chronic fatigue syndrome (ME/CFS) share many overlapping symptoms. One study found that “one-third of long COVID patients met the diagnostic criteria for CFS/ME at six months after the initial infection” [12] and that “Two different surveys found that 20–31% of long COVID patients met classification criteria for FM more than six months after COVID-19” [12]. There is also evidence that links PASC, FM, ME/CFS, and, in addition, irritable bowel syndrome (IBS), finding them to be associated with the reactivation of endogenous viruses, particularly the latent EBV [12]. Furthermore, other studies have also identified significant similarities in symptoms causing confusion in distinguishing between PASC and other illnesses, including postural orthostatic tachycardia syndrome (POTS) and Lyme disease [13,14].

Currently, PASC is often diagnosed through a lengthy process of exclusion that relies heavily on observing a patient’s medical history. Most proposed diagnostic processes start by excluding a list of similar medical conditions, analyzing medical history, and then observing common symptoms of PASC, such as dyspnea or anosmia/dysgeusia. Depending on the case, special tests such as X-rays or pulmonary function assessments may be employed as needed [15,16].

To address the complexity and inefficiency of such diagnostics, some researchers have turned to AI-based tools to accelerate the process. Several studies use traditional exclusion methods, supported by analyzing patients’ past medical history and tracking the development of symptoms over time to diagnose long-term COVID-19 [17]. Others bypass exclusion entirely. As an example, Buonsenso et al. applied proteomics to predict long COVID with 93% accuracy [18].

Our aim is to evaluate the application of a molecular descriptor-based approach to distinguish between PASC and PASC-similar diseases, thereby improving the efficiency of PASC diagnosis. Specifically, we propose a machine learning (ML) system to develop a diagnostic system for PASC that utilizes molecular descriptors of dysregulated plasma metabolites to distinguish PASC patients from both healthy individuals and those with PASC-like symptoms, thereby reducing reliance on exclusion-based methods. We hypothesize that an ML model trained to identify PASC-associated plasma metabolites can aid in the elucidation of patients’ diagnoses. To improve specificity, we tested our model against other diseases often misdiagnosed as PASC, aiming to distinguish between PASC and PASC-like diseases.

This approach, using molecular descriptors, is novel in the context of PASC but has been applied to other diseases. Yao et al. utilized molecular descriptors (calculated using PaDEL-Descriptor software, version 2.21 [19]) of urinary metabolites to diagnose ovarian cancer, employing a random forest model that yields a specificity of 0.86 and sensitivity of 0.97 with 10-fold cross-validation [20]. Yang et al. employed molecular descriptors (calculated using ChemDes software, version 3.0 [21]) of serum metabolites to identify colorectal cancer, achieving up to 95.21% 10-fold cross-validated accuracy with a bagging algorithm [22].

2. Methods

2.1. Approach Overview

The method’s flowchart is represented in Figure 1.

Figure 1. Flowchart of methods. We (1) acquired datasets for PASC patients, (2) found dysregulated metabolites using a t-test, and (3) calculated the molecular descriptors for the metabolite datasets. Then we (4) built an ML model and (5) tested it on diseases with similar symptoms (6) to define a threshold for dysregulated metabolites classified as PASC to classify the disease itself.

We collected plasma metabolic data from open sources, including PubMed, Google Scholar, and the National Library of Medicine. Significant metabolites were selected using univariate t-test analysis (p ≤ 0.05). Then, we retrieved the same number of random metabolites from the Human Metabolite Database (HMDB) [23], preprocessed them in the same manner, and combined them with the dysregulated metabolites. Next, we calculated a set of molecular descriptors using the e-Dragon software, version 5.4 [24], and then performed dimensionality reduction using CfsSubsetEval and BestFirst search from the Waikato Environment for Knowledge Analysis (WEKA) [25]. Utilizing the prepared dataset, we trained several machine learning algorithms in Python, version 3.13.5, as implemented by Scikit-Learn, version 1.7.2 [26], using 5-fold cross-validation. After selecting the best ones, we validated them using metabolic data from diseases with PASC-like symptoms. We then determined a threshold to convert the classifier’s output score into a final, definitive class prediction, striking a balance between sensitivity and specificity.

2.2. Data Collection and Preprocessing

The plasma PASC metabolites datasets used in this study for model training and testing were all obtained from publicly accessible, peer-reviewed sources, namely from PubMed and the National Library of Medicine [4,27]. Both sets contained metabolites of healthy individuals and PASC patients. We used this data to determine the metabolites dysregulated by PASC by running a two-tailed two-sample t-test, assuming unequal variance, and selecting metabolites with a p-value ≤ 0.05. Because we are looking for any metabolites that are significantly dysregulated, regardless of the degree of such dysregulation, we use a t-test to examine metabolite dysregulation in relation to the rest of each dataset. Demographic data for the training and testing sets are shown in Table 1. Our analysis identified 123 and 36 dysregulated metabolites from the training and testing sets, respectively. We also utilized dysregulated metabolite sets associated with other diseases to evaluate the model’s ability to distinguish between PASC and PASC-like diseases. The diseases known to have similar symptomology to PASC are FM, ME/CFS, Lyme disease, POTS, and IBS [28,29,30,31,32]. To utilize these metabolites for training the model to predict whether dysregulation was due to PASC, we randomly selected from HMDB 5.0 [23] the same number of metabolites used as a negative control to PASC-dysregulated metabolites. In doing so, we balance our dataset, thereby preventing model variance caused by dataset imbalance. In the case that PASC-related metabolites from our selected list appear in the random set, those are excluded. To confirm that the selection of negative metabolites has a minimal impact on model performance, we performed a 90/10% stratified split on the training dataset. Then, for 100 bootstrapping iterations, we created a new negative control set of the same size by resampling the original 90% split of negative controls with replacement. By concatenating this set with the original 90% split of PASC metabolites, we trained a random forest model and tested it on the independent testing set. Indeed, the standard deviation of cross-validated accuracy across 100 bootstrapped runs was 0.0109, indicating minimal variance due to randomness in the selection of negative controls.

Table 1. Quantitative data is presented as median (Q1–Q3) and qualitative data is presented as n (%). The papers do not provide other forms of demographic data comparing PASC and healthy controls.

2.3. Molecular Descriptor Calculation

We first converted each metabolite into its Simplified Molecular Input Line Entry System (SMILES) [33] form using HMDB [23], PubChem [34], and the NCI/CADD Group’s CACTUS (CADD Group Chemoinformatic Tools and User Services) API [35] to streamline this process. Then, we used e-Dragon Software [36] to generate 4179 molecular descriptors from each metabolite. To reduce noise and prevent overfitting, we utilized WEKA’s version 3.9.6 correlation-based feature selection (CfsSubsetEval) algorithm, which resulted in 28 descriptors from 4179 [25]. This algorithm selects a subset of molecular descriptors such that the correlation between the molecular descriptors and the presence of PASC is maximized, while minimizing the correlation between each of the molecular descriptors within the subset, thereby reducing redundancy.

2.4. Machine Learning Model Training

Using the molecular descriptors of the PASC-dysregulated and randomly selected metabolites, we applied several machine learning classifiers (as implemented by Scikit-Learn [26]) to create ML models that utilize the molecular descriptors of each classifier to predict whether the metabolite was dysregulated by PASC. We tested the following classifiers: multi-layer perceptron (MLP), random forest (RF), logistic regression (LR), support vector machine (SVM), and gradient boosting (GB). For each model, we employed hyperparameter selection using grid search along with stratified 5-fold cross-validation. The core statistic used to measure model performance was the AUC-ROC, which was evaluated using repeated 5-fold cross-validation. We report results with a mean AUC-ROC and the corresponding 95% confidence interval. Here, we use the area under the receiver operating characteristic curve (AUC-ROC) because it gives a broad summary of model performance across all possible classification thresholds.

2.5. Validation on Independent Testing Sets and Other Diseases

We selected the top-performing models and tested them against dysregulated metabolite sets for an independent PASC testing set. We then tested these top-performing models against dysregulated metabolite sets for an independent PASC testing set, along with diseases that are often confused with PASC. For each disease set, the proportion of metabolites classified as PASC-dysregulated was recorded, along with the mean and confidence interval. The confidence interval was obtained by bootstrapping the testing set, repeating the resampling process 500 times, and calculating the 95% confidence interval from the middle. Note that, because the model is trained to flag each individual metabolite as PASC-dysregulated using each metabolite’s 28 consistent descriptors, the asymmetry in the number of metabolites in the training and testing sets does not impact performance

2.6. Threshold Determination

A crucial aspect of this study is the diagnosis of PASC and differentiating it from other disorders with similar symptoms. For this purpose, we used the previously calculated confidence intervals to determine a threshold that maximizes the separation between the confidence intervals of percent metabolites classified as PASC and those of similar diseases. When the proportion of dysregulated metabolites classified by the model as being PASC-dysregulated exceeds this threshold, a confident PASC diagnosis can be suggested.

3. Results

We identified 123 PASC-dysregulated metabolites (p ≤ 0.05) and combined them with 123 randomly selected metabolites from HMDB 5.0 to form a 246-metabolite training set. For each metabolite, we calculated 4179 molecular descriptors using e-Dragon, and attribute selection narrowed the feature space down to a set of 28 meaningful molecular descriptors. A list of the selected descriptors is included in the Supplementary Materials.

Next, we built several ML models using our training set. All models demonstrated strong and acceptable performance in cross-validation (Figure 2). Therefore, we selected a model with good overall performance, specifically the multi-layer perceptron, which had the second-highest AUC-ROC lower confidence bound (all upper confidence bounds were at 1) and a higher mean than logistic regression, which was equal in terms of lower confidence bounds. The random forest model was not selected due to extremely poor performance when testing on just the independent PASC descriptors after cross-validated training, indicating likely overfitting.

Figure 2. AUC-ROC of the different models in 5-fold cross-validation on the initial training set (A) and the independent testing set (B). AUC-ROC is the area under the ROC curve, which is a graphical representation of a model’s performance at various classification thresholds; AUC-ROC quantifies the overall ability of the model to distinguish between the positive and negative cases. Training confidence intervals are obtained through repeated stratified 5-fold cross-validation, performed ten times, and the middle 95% of the results are used. The testing confidence intervals are obtained through bootstrapping.

For further investigation, we visualize the ROC curves of the training and testing sets side by side. The ROC curves shown in Figure 3, with consistently high and increasing performance over threshold changes, demonstrate that, indeed, a molecular descriptors-based approach with Weka’s feature selection algorithm is conducive to predictive data and robust machine learning algorithms over all classification thresholds. Additionally, proficient testing results, which accompany training results, demonstrate the model’s ability to generalize to unseen datasets.

Figure 3. Receiver operating characteristic (ROC) curves for the training set and independent testing set. ROC curves demonstrate the tradeoff between true and false positive rates at various classification thresholds.

With each of these models, we further tested them against dysregulated metabolite sets from an independent PASC set, as well as other diseases often confused with PASC due to similar symptomology: FM [28], Lyme disease [29], ME/CFS [30], POTS [31], and IBS [32]. For testing on independent sets, we bootstrapped the testing sets for five hundred samples, returning the mean and 95% confidence interval for each, in order to clarify in the impact of stochastic variation. The results of this testing are demonstrated in Figure 4.

Figure 4. Mean accuracy and bootstrapped confidence intervals of identification based on the percent of dysregulated metabolites of PASC-similar disease classified as PASC by the ML model trained to recognize PASC as the dysregulation source using molecular descriptors of dysregulated metabolites.

Observing this graph, we see that the PASC model classifies the dysregulated metabolites of different diseases in a way that confidently separates PASC from Lyme disease, ME/CFS, and IBS. It is able to weakly separate PASC from POTS, with the average PASC classification exceeding that of POTS. However, the 95% confidence intervals show a sizeable overlap between the lower PASC interval and the upper POTS interval. Finally, we chose a threshold, focusing on the proportion of POTS metabolites classified as PASC, as POTS was the most challenging to differentiate from PASC, with the highest proportion of dysregulated metabolites classified as PASC, excluding FM, which none of the models can differentiate from PASC. We lower the confidence level until the PASC and POTS intervals are disjointed, and we find that this occurs at a confidence level of 0.868 and a classification proportion of 0.912. This threshold clearly differentiates PASC from all other PASC-similar diseases, except POTS, which can be differentiated from with weak confidence, and FM.

To better understand how the molecular descriptors of the metabolites are dysregulated in each of the PASC-similar diseases, we trained a model on the training set of each disease and tested it on the training sets of the other diseases. To ensure consistent results, all models used the same MLP model and training pipeline as the main PASC model. Classification proportions for diseases are shown in Figure 5.

Figure 5. Proportion of dysregulated metabolite training sets for each PASC-similar disease (columns) that are classified as positive by the ML model trained to recognize if each other’s PASC-similar disease (rows) is the dysregulation source using molecular descriptors of dysregulated metabolites. A gradient going from white to blue indicates the relative magnitude of the positive classification proportion. Confidence intervals are 95% Wilson intervals.

The classification rate for many diseases is high, reaching at least 0.9 classification proportion. However, this is only the case for a few pairs of PASC-similar diseases. One example of unidirectional is with Lyme disease and PASC. While the Lyme disease model easily mistakes many of PASC’s dysregulated metabolites as Lyme disease at a proportion of 0.9837, PASC only classifies 0.6786 of the dysregulated metabolites of Lyme disease as PASC.

4. Discussion

We found that molecular descriptors of each plasma metabolite provide sufficient information to classify, with high accuracy, whether it is dysregulated by long COVID. This is evidenced by many machine learning models performing well using 5-fold cross-validation on the training sets, with an average AUC-ROC of up to 98.37%. The ability of molecular descriptors to predict a relationship with PASC can be generalized beyond the training set, as demonstrated by an AUC-ROC of 87.7% in our independent testing set. In addition to being able to separate PASC patients from healthy individuals, our results show that molecular descriptors of dysregulated metabolites are sufficient to differentiate between patients with PASC and PASC-similar diseases, a task that is very challenging in traditional diagnoses without exclusion. For Lyme disease, ME/CFS, and POTS, this is by a large margin.

A significant anomaly in our study was the inability of this approach to separate FM and PASC. The PASC models consistently classified FM as PASC at a rate as high or higher than they classified the independent PASC testing set. This strong overlap is further emphasized by the results of the pairwise classification task. The heatmap demonstrated a uniquely strong overlap between FM and PASC classifications, as one of the only two pairs of PASC-similar diseases where both models classify the other disease’s dysregulated metabolites as its own with a proportion greater than 0.9, indicating an especially strong connection in terms of molecular descriptors between these pairs of diseases. What is most notable is that PASC and all diseases with PASC-similar symptoms, besides FM, have a viral origin and are distinguishable by our set of ML models. FM is a central sensitization disorder with neuroendocrine, mitochondrial, and genetic contributions [37,38,39,40], even though both conditions involve mitochondrial dysfunction [41,42,43,44,45].

This result reinforces and aligns with the existing literature, which supports a connection between the two diseases. FM and PASC are known to exhibit many of the same pathophysiological mechanisms, including autonomic nervous system dysfunction and immune dysregulation [46]. Indeed, these shared manifestations have a significant effect on diagnosis—independent cohorts of PASC patients found 20 to 70% of patients met diagnostic criteria for fibromyalgia, indicating significant overlap within the symptoms of these two conditions [12,37]. Both conditions are associated with low-grade immune activation and persistent inflammation, accompanied by elevated pro-inflammatory cytokines interleukin-6 and tumor necrosis factor [47].

It remains to be seen whether molecular descriptors can distinguish the metabolites dysregulated by these two diseases or if the metabolites dysregulated by these diseases are too similar. In future research, it may be worthwhile to explore further dysregulated metabolite sets for PASC and FM to more confidently ascertain if this approach can distinguish PASC from FM. Additionally, combining this ML/metabolites-based approach with other diagnostic methods, such as transcriptomics, proteomics, and clinical symptoms, may reveal an additional diagnostic strength. It also advised investigating other biomarkers, for example, small non-coding RNAs.

5. Limitations

One significant limitation of this study is that our analysis and selection of a threshold depend heavily on the datasets themselves, which may vary due to demographic differences. Most significantly, there is a nontrivial confounder for age between the training and testing datasets, with the training dataset having noticeably older patients, and both datasets having older PASC cohorts than healthy controls. Because the papers from which the datasets were drawn do not provide demographic data paired with biological data, it is difficult to identify the true confounding effect of age. There may have been other confounders within the data, but complete demographics are not provided by the papers from which the data is sourced.

Before applying our work in a clinical context, more datasets should be used for training and testing to reduce the impact of chance variance in demographics. Working with more sets encompassing diverse demographics may better validate the robust generalizability of our model and the threshold that we defined for classifying the whole set as PASC.

6. Conclusions

A critical roadblock in PASC diagnosis is the reliance on exclusion, stemming from its lack of distinctive biomarkers, which makes diagnosis challenging and time-consuming. We hypothesized that we could remediate this issue by utilizing the molecular descriptors of dysregulated metabolites in PASC patients to create a machine-learning model that can identify PASC-associated metabolic dysregulation and support the differentiation between PASC and PASC-like conditions based on metabolomics profiles.

Our findings support this hypothesis: we were able to create a model that not only differentiates PASC from healthy controls, but also successfully differentiates PASC from several similar conditions. In principle, our approach to PASC diagnosis is assay-agnostic and should offer because input descriptors can be generated regardless of the metabolite measurement assay used. Yet, the datasets used to train our model from Wang and López-Hernández et al. exclusively used LC-MS/MS and FIA-MS/MS metabolite assays. Thus, validation with further datasets is imperative to evaluate this method’s potential for generalizability.

Our model demonstrates the feasibility of automating the exclusion of multiple PASC-similar diseases, representing a novel step in PASC diagnosis. Refinements of this process could be used in conjunction with other techniques to alleviate the burdens of manual exclusion processes and enhance the efficiency and accuracy of diagnoses. Ultimately, diagnostic improvements could lead to earlier interventions and better outcomes for patients with PASC.

Beyond demonstrating that a molecular descriptors-based approach allows us to automatically distinguish PASC from many PASC-similar diseases, our study found a unique connection between PASC and FM: PASC and FM consistently overlapped in their classification by descriptors-based machine learning models, despite the fact that they have different origins, even more so than other PASC-similar diseases. This uniquely strong overlap suggests that the two diseases are strongly intertwined at a molecular level, reinforcing the literature that supports shared mechanisms and symptomology. Ultimately, this highlights fibromyalgia as a key comparator in PASC-related studies and vice versa.

Supplementary Materials

The following supporting information can be downloaded at: https://github.com/Ethanc143/Metabolomics-Based-Machine-Learning-Diagnostics-of-Post-Acute-Sequelae-of-SARS-CoV-2-Infection (accessed on 10 November 2025).

Author Contributions

E.C.: Data mining and preparation for ML model, calculations of the ML model, writing the manuscript. V.L.K.: Concept of the study, selection of methods and tools, editing the manuscript. I.F.T.: Concept of the study, selection of methods and tools, editing the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Ethical review and approval were waived for this study due to this study involved secondary analysis of fully anonymized and de-identified human data obtained from previously published studies. All of the studies from which data were retrieved [4,26,27,28,29,30,35] used data collected under the approval of respective ethics committees. This study did not collect or interact with identifiable patient data.

Informed Consent Statement

Patient consent was waived due to this study involved secondary analysis of fully anonymized and de-identified human data obtained from previously published studies. All of the studies from which data were retrieved [4,27,28,29,30,31,32] used data collected under the approval of respective ethics committees. This study did not collect or interact with identifiable patient data.

Data Availability Statement

The original contributions presented in this study are included in the article and Supplementary Materials (archived on Github at https://github.com/Ethanc143/Metabolomics-Based-Machine-Learning-Diagnostics-of-Post-Acute-Sequelae-of-SARS-CoV-2-Infection, 10 November 2025). Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

API	application programming interface
AUC	area under curve
CACTUS	CADD Group Chemoinformatic Tools and User Services
CADD	computer-aided design and drafting
FM	fibromyalgia
HMDB	Human Metabolite Data Base
IBS	irritable bowel syndrome
LMT	logistic model tree
ME/CFS	myalgic encephalomyelitis/chronic fatigue syndrome
ML	machine learning
MLP	multi-layer perceptron
NCI	National Cancer Institute
PASC	Post-Acute Sequelae of COVID-19
PCC	post-COVID-19 condition
PCS	post-COVID-19 syndrome
POTS	postural orthostatic tachycardia syndrome
PR	precision–recall
RECOVER	Researching COVID to Enhance Recovery
ROC	receiver operating characteristic
SGD	stochastic gradient descent
SMILES	Simplified Molecular Input Line Entry System
SMO	sequential minimal optimization
SVM	support vector machine
WEKA	Waikato Environment for Knowledge Analysis
WHO	World Health Organization

References

Global COVID-19 Overview|WHO COVID-19 Dashboard. Available online: https://data.who.int/dashboards/covid19/summary (accessed on 5 November 2025).
Post COVID-19 Condition (Long COVID). Available online: https://www.who.int/europe/news-room/fact-sheets/item/post-COVID-19-condition (accessed on 5 November 2025).
Nalbandian, A.; Sehgal, K.; Gupta, A.; Madhavan, M.V.; McGroder, C.; Stevens, J.S.; Cook, J.R.; Nordvig, A.S.; Shalev, D.; Sehrawat, T.S.; et al. Post-Acute COVID-19 Syndrome. Nat. Med. 2021, 27, 601–615. [Google Scholar] [CrossRef] [PubMed]
López-Hernández, Y.; Monárrez-Espino, J.; López, D.A.G.; Zheng, J.; Borrego, J.C.; Torres-Calzada, C.; Elizalde-Díaz, J.P.; Mandal, R.; Berjanskii, M.; Martínez-Martínez, E.; et al. The Plasma Metabolome of Long COVID Patients Two Years after Infection. Sci. Rep. 2023, 13, 12420. [Google Scholar] [CrossRef] [PubMed]
New Insights on Long COVID Symptoms in Adults Highlight Update to RECOVER Study Findings|RECOVER COVID Initiative. Available online: https://recovercovid.org/news/new-insights-long-covid-symptoms-adults-highlight-update-recover-study-findings (accessed on 5 November 2025).
Long COVID (Post-COVID Conditions, PCC). Available online: https://www.yalemedicine.org/conditions/long-covid-post-covid-conditions-pcc (accessed on 5 November 2025).
Babalola, T.K.; Clouston, S.A.P.; Sekendiz, Z.; Chowdhury, D.; Soriolo, N.; Kawuki, J.; Meliker, J.; Carr, M.; Valenti, B.R.; Fontana, A.; et al. SARS-CoV-2 Re-Infection and Incidence of Post-Acute Sequelae of COVID-19 (PASC) among Essential Workers in New York: A Retrospective Cohort Study. Lancet Reg. Health Am. 2025, 42, 100984. [Google Scholar] [CrossRef]
Long COVID. Available online: https://www.health.ny.gov/diseases/long_covid/ (accessed on 5 November 2025).
Frontera, J.A.; Thorpe, L.E.; Simon, N.M.; De Havenon, A.; Yaghi, S.; Sabadia, S.B.; Yang, D.; Lewis, A.; Melmed, K.; Balcer, L.J.; et al. Post-Acute Sequelae of COVID-19 Symptom Phenotypes and Therapeutic Strategies: A Prospective, Observational Study. PLoS ONE 2022, 17, e0275274. [Google Scholar] [CrossRef]
Deer, R.R.; Rock, M.A.; Vasilevsky, N.; Carmody, L.; Rando, H.; Anzalone, A.J.; Basson, M.D.; Bennett, T.D.; Bergquist, T.; Boudreau, E.A.; et al. Characterizing Long COVID: Deep Phenotype of a Complex Condition. eBioMedicine 2021, 74, 103722. [Google Scholar] [CrossRef]
O’Hare, A.M.; Vig, E.K.; Iwashyna, T.J.; Fox, A.; Taylor, J.S.; Viglianti, E.M.; Butler, C.R.; Vranas, K.C.; Helfand, M.; Tuepker, A.; et al. Complexity and Challenges of the Clinical Diagnosis and Management of Long COVID. JAMA Netw. Open 2022, 5, e2240332. [Google Scholar] [CrossRef]
Goldenberg, D.L. How to Understand the Overlap of Long COVID, Chronic Fatigue Syndrome/Myalgic Encephalomyelitis, Fibromyalgia and Irritable Bowel Syndromes. Semin. Arthritis Rheum. 2024, 67, 152455. [Google Scholar] [CrossRef]
POTS or Long COVID? How to Tell the Difference|Brain|COVID|Heart|UT Southwestern Medical Center. Available online: https://utswmed.org/medblog/pots-long-covid-research/ (accessed on 5 November 2025).
Thor, D.C.; Suarez, S. Corona With Lyme: A Long COVID Case Study. Cureus 2023, 15, e36624. [Google Scholar] [CrossRef]
Sisó-Almirall, A.; Brito-Zerón, P.; Conangla Ferrín, L.; Kostov, B.; Moragas Moreno, A.; Mestres, J.; Sellarès, J.; Galindo, G.; Morera, R.; Basora, J.; et al. Long COVID-19: Proposed Primary Care Clinical Guidelines for Diagnosis and Disease Management. Int. J. Environ. Res. Public Health 2021, 18, 4350. [Google Scholar] [CrossRef]
Raveendran, A.V. Long COVID-19: Challenges in the Diagnosis and Proposed Diagnostic Criteria. Diabetes Metab. Syndr. Clin. Res. Rev. 2021, 15, 145–146. [Google Scholar] [CrossRef] [PubMed]
Azhir, A.; Hügel, J.; Tian, J.; Cheng, J.; Bassett, I.V.; Bell, D.S.; Bernstam, E.V.; Farhat, M.R.; Henderson, D.W.; Lau, E.S.; et al. Precision Phenotyping for Curating Research Cohorts of Patients with Unexplained Post-Acute Sequelae of COVID-19. Med 2025, 6, 100532. [Google Scholar] [CrossRef]
Buonsenso, D.; Cotugno, N.; Amodio, D.; Pascucci, G.R.; Di Sante, G.; Pighi, C.; Morrocchi, E.; Pucci, A.; Olivieri, G.; Colantoni, N.; et al. Distinct Pro-Inflammatory/pro-Angiogenetic Signatures Distinguish Children with Long COVID from Controls. Pediatr. Res. 2025, 98, 928–935. [Google Scholar] [CrossRef] [PubMed]
Yap, C.W. PaDEL-descriptor: An Open Source Software to Calculate Molecular Descriptors and Fingerprints. J. Comput. Chem. 2011, 32, 1466–1474. [Google Scholar] [CrossRef]
Yao, J.Z.; Tsigelny, I.F.; Kesari, S.; Kouznetsova, V.L. Diagnostics of Ovarian Cancer via Metabolite Analysis and Machine Learning. Integr. Biol. 2023, 15, zyad005. [Google Scholar] [CrossRef]
Dong, J.; Cao, D.-S.; Miao, H.-Y.; Liu, S.; Deng, B.-C.; Yun, Y.-H.; Wang, N.-N.; Lu, A.-P.; Zeng, W.-B.; Chen, A.F. ChemDes: An Integrated Web-Based Platform for Molecular Descriptor and Fingerprint Computation. J. Cheminform 2015, 7, 60. [Google Scholar] [CrossRef]
Yang, R.; Tsigelny, I.F.; Kesari, S.; Kouznetsova, V.L. Colorectal Cancer Detection via Metabolites and Machine Learning. Curr. Issues Mol. Biol. 2024, 46, 4133–4146. [Google Scholar] [CrossRef]
Wishart, D.S.; Guo, A.; Oler, E.; Wang, F.; Anjum, A.; Peters, H.; Dizon, R.; Sayeeda, Z.; Tian, S.; Lee, B.L.; et al. HMDB 5.0: The Human Metabolome Database for 2022. Nucleic Acids Res. 2022, 50, D622–D631. [Google Scholar] [CrossRef]
Tetko, I.V.; Gasteiger, J.; Todeschini, R.; Mauri, A.; Livingstone, D.; Ertl, P.; Palyulin, V.A.; Radchenko, E.V.; Zefirov, N.S.; Makarenko, A.S.; et al. Virtual Computational Chemistry Laboratory—Design and Description. J. Comput. Aided Mol. Des. 2005, 19, 453–463. [Google Scholar] [CrossRef] [PubMed]
Witten, I.H.; Frank, E.; Hall, M.A.; Pal, C.J. Data Mining: Practical Machine Learning Tools and Techniques, 4th ed.; Morgan Kaufmann Series in Data Management Systems; Morgan Kaufmann: San Francisco, CA, USA, 2017; ISBN 9780128042915. [Google Scholar]
Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-Learn: Machine Learning in Python. Available online: https://jmlr.csail.mit.edu/papers/v12/pedregosa11a.html (accessed on 8 December 2025).
Wang, K.; Khoramjoo, M.; Srinivasan, K.; Gordon, P.M.K.; Mandal, R.; Jackson, D.; Sligl, W.; Grant, M.B.; Penninger, J.M.; Borchers, C.H.; et al. Sequential Multi-Omics Analysis Identifies Clinical Phenotypes and Predictive Biomarkers for Long COVID. Cell Rep. Med. 2023, 4, 101254. [Google Scholar] [CrossRef] [PubMed]
Marino, C.; Grimaldi, M.; Sabatini, P.; Amato, P.; Pallavicino, A.; Ricciardelli, C.; D’Ursi, A.M. Fibromyalgia and Depression in Women: An 1H-NMR Metabolomic Study. Metabolites 2021, 11, 429. [Google Scholar] [CrossRef]
Molins, C.R.; Ashton, L.V.; Wormser, G.P.; Hess, A.M.; Delorey, M.J.; Mahapatra, S.; Schriefer, M.E.; Belisle, J.T. Development of a Metabolic Biosignature for Detection of Early Lyme Disease. Clin. Infect. Dis. 2015, 60, 1767–1775. [Google Scholar] [CrossRef] [PubMed]
Germain, A.; Barupal, D.K.; Levine, S.M.; Hanson, M.R. Comprehensive Circulatory Metabolomics in ME/CFS Reveals Disrupted Metabolism of Acyl Lipids and Steroids. Metabolites 2020, 10, 34. [Google Scholar] [CrossRef]
Li, Y.; Bai, B.; Wang, H.; Wu, H.; Deng, Y.; Shen, C.; Zhang, Q.; Shi, L. Plasma Metabolomic Profile in Orthostatic Intolerance Children with High Levels of Plasma Homocysteine. Ital. J. Pediatr. 2024, 50, 52. [Google Scholar] [CrossRef]
Han, L.; Zhao, L.; Zhou, Y.; Yang, C.; Xiong, T.; Lu, L.; Deng, Y.; Luo, W.; Chen, Y.; Qiu, Q.; et al. Altered Metabolome and Microbiome Features Provide Clues in Understanding Irritable Bowel Syndrome and Depression Comorbidity. ISME J. 2022, 16, 983–996. [Google Scholar] [CrossRef]
Weininger, D. SMILES, a Chemical Language and Information System. 1. Introduction to Methodology and Encoding Rules. J. Chem. Inf. Comput. Sci. 1988, 28, 31–36. [Google Scholar] [CrossRef]
Kim, S.; Chen, J.; Cheng, T.; Gindulyte, A.; He, J.; He, S.; Li, Q.; Shoemaker, B.A.; Thiessen, P.A.; Yu, B.; et al. PubChem 2025 Update. Nucleic Acids Res. 2025, 53, D1516–D1525. [Google Scholar] [CrossRef] [PubMed]
NCI/CADD Group Chemoinformatics Tools and User Services. Available online: https://cactus.nci.nih.gov/index.html (accessed on 5 November 2025).
E-Dragon Software. Available online: https://vcclab.org/lab/edragon/ (accessed on 5 November 2025).
Khoja, O.; Mulvey, M.; Astill, S.; Tan, A.L.; Sivan, M. New-Onset Chronic Musculoskeletal Pain Following COVID-19 Infection Fulfils the Fibromyalgia Clinical Syndrome Criteria: A Preliminary Study. Biomedicines 2024, 12, 1940. [Google Scholar] [CrossRef]
Leonardo, S.; Fregni, F. Exploring the Inflammatory Basis of Fibromyalgia and Long-COVID: Insights from Human and Animal Studies. Princ. Pract. Clin. Res. 2025, 1, 15022. [Google Scholar] [CrossRef]
Häuser, W.; Ablin, J.; Fitzcharles, M.-A.; Littlejohn, G.; Luciano, J.V.; Usui, C.; Walitt, B. Fibromyalgia. Nat. Rev. Dis. Primers 2015, 1, 15022. [Google Scholar] [CrossRef]
Merino, A.S.; Simon, D. Fatores genéticos associados à fibromialgia: Uma revisão narrativa. Res. Soc. Dev. 2022, 11, e11211326421. [Google Scholar] [CrossRef]
Nunn, A.V.W.; Guy, G.W.; Brysch, W.; Bell, J.D. Understanding Long COVID; Mitochondrial Health and Adaptation—Old Pathways, New Problems. Biomedicines 2022, 10, 3113. [Google Scholar] [CrossRef] [PubMed]
Molnar, T.; Lehoczki, A.; Fekete, M.; Varnai, R.; Zavori, L.; Erdo-Bonyar, S.; Simon, D.; Berki, T.; Csecsei, P.; Ezer, E. Mitochondrial Dysfunction in Long COVID: Mechanisms, Consequences, and Potential Therapeutic Approaches. GeroScience 2024, 46, 5267–5286. [Google Scholar] [CrossRef] [PubMed]
Cordero, M.D.; De Miguel, M.; Moreno Fernández, A.M.; Carmona López, I.M.; Garrido Maraver, J.; Cotán, D.; Gómez Izquierdo, L.; Bonal, P.; Campa, F.; Bullon, P.; et al. Mitochondrial Dysfunction and Mitophagy Activation in Blood Mononuclear Cells of Fibromyalgia Patients: Implications in the Pathogenesis of the Disease. Arthritis Res. Ther. 2010, 12, R17. [Google Scholar] [CrossRef]
Cordero, M.D.; De Miguel, M.; Moreno-Fernández, A.M. La disfunción mitocondrial en la fibromialgia y su implicación en la patogénesis de la enfermedad. Med. Clínica 2011, 136, 252–256. [Google Scholar] [CrossRef]
Marino, Y.; Inferrera, F.; D’Amico, R.; Impellizzeri, D.; Cordaro, M.; Siracusa, R.; Gugliandolo, E.; Fusco, R.; Cuzzocrea, S.; Di Paola, R. Role of Mitochondrial Dysfunction and Biogenesis in Fibromyalgia Syndrome: Molecular Mechanism in Central Nervous System. Biochim. Biophys. Acta (BBA) Mol. Basis Dis. 2024, 1870, 167301. [Google Scholar] [CrossRef]
COVID and Fibromyalgia: Is There a Link? Available online: https://www.healthline.com/health/fibromyalgia/covid-and-fibromyalgia (accessed on 5 November 2025).
Davis, L.; Higgs, M.; Snaith, A.; Lodge, T.A.; Strong, J.; Espejo-Oltra, J.A.; Kujawski, S.; Zalewski, P.; Pretorius, E.; Hoerger, M.; et al. Dysregulation of Lipid Metabolism, Energy Production, and Oxidative Stress in Myalgic Encephalomyelitis/Chronic Fatigue Syndrome, Gulf War Syndrome and Fibromyalgia. Front. Neurosci. 2025, 19, 1498981. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Flowchart of methods. We (1) acquired datasets for PASC patients, (2) found dysregulated metabolites using a t-test, and (3) calculated the molecular descriptors for the metabolite datasets. Then we (4) built an ML model and (5) tested it on diseases with similar symptoms (6) to define a threshold for dysregulated metabolites classified as PASC to classify the disease itself.

Figure 2. AUC-ROC of the different models in 5-fold cross-validation on the initial training set (A) and the independent testing set (B). AUC-ROC is the area under the ROC curve, which is a graphical representation of a model’s performance at various classification thresholds; AUC-ROC quantifies the overall ability of the model to distinguish between the positive and negative cases. Training confidence intervals are obtained through repeated stratified 5-fold cross-validation, performed ten times, and the middle 95% of the results are used. The testing confidence intervals are obtained through bootstrapping.

Figure 3. Receiver operating characteristic (ROC) curves for the training set and independent testing set. ROC curves demonstrate the tradeoff between true and false positive rates at various classification thresholds.

Figure 4. Mean accuracy and bootstrapped confidence intervals of identification based on the percent of dysregulated metabolites of PASC-similar disease classified as PASC by the ML model trained to recognize PASC as the dysregulation source using molecular descriptors of dysregulated metabolites.

Figure 5. Proportion of dysregulated metabolite training sets for each PASC-similar disease (columns) that are classified as positive by the ML model trained to recognize if each other’s PASC-similar disease (rows) is the dysregulation source using molecular descriptors of dysregulated metabolites. A gradient going from white to blue indicates the relative magnitude of the positive classification proportion. Confidence intervals are 95% Wilson intervals.

Table 1. Quantitative data is presented as median (Q1–Q3) and qualitative data is presented as n (%). The papers do not provide other forms of demographic data comparing PASC and healthy controls.

Characteristic	PASC Training (n = 117)	Healthy Training (n = 28)	PASC Testing (n = 48)	Healthy Testing (n = 37)
Age (years)	62 (53–73)	55 (52–59)	51.5 (43.5–60.8)	40.5 (37–53.3)
Male	66 (56.4)	16 (57.1)	28 (58.3)	17 (44.7)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Metabolomics-Based Machine Learning Diagnostics of Post-Acute Sequelae of SARS-CoV-2 Infection

Abstract

1. Introduction

2. Methods

2.1. Approach Overview

2.2. Data Collection and Preprocessing

2.3. Molecular Descriptor Calculation

2.4. Machine Learning Model Training

2.5. Validation on Independent Testing Sets and Other Diseases

2.6. Threshold Determination

3. Results

4. Discussion

5. Limitations

6. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Article Metrics

Citations

Article Access Statistics