Stratification of Pediatric COVID-19 Cases Using Inflammatory Biomarker Profiling and Machine Learning

While pediatric COVID-19 is rarely severe, a small fraction of children infected with SARS-CoV-2 go on to develop multisystem inflammatory syndrome (MIS-C), with substantial morbidity. An objective method with high specificity and high sensitivity to identify current or imminent MIS-C in children infected with SARS-CoV-2 is highly desirable. The aim was to learn about an interpretable novel cytokine/chemokine assay panel providing such an objective classification. This retrospective study was conducted on four groups of pediatric patients seen at multiple sites of Texas Children’s Hospital, Houston, TX who consented to provide blood samples to our COVID-19 Biorepository. Standard laboratory markers of inflammation and a novel cytokine/chemokine array were measured in blood samples of all patients. Group 1 consisted of 72 COVID-19, 70 MIS-C and 63 uninfected control patients seen between May 2020 and January 2021 and predominantly infected with pre-alpha variants. Group 2 consisted of 29 COVID-19 and 43 MIS-C patients seen between January and May 2021 infected predominantly with the alpha variant. Group 3 consisted of 30 COVID-19 and 32 MIS-C patients seen between August and October 2021 infected with alpha and/or delta variants. Group 4 consisted of 20 COVID-19 and 46 MIS-C patients seen between October 2021 andJanuary 2022 infected with delta and/or omicron variants. Group 1 was used to train an L1-regularized logistic regression model which was tested using five-fold cross validation, and then separately validated against the remaining naïve groups. The area under receiver operating curve (AUROC) and F1-score were used to quantify the performance of the cytokine/chemokine assay-based classifier. Standard laboratory markers predict MIS-C with a five-fold cross-validated AUROC of 0.86 ± 0.05 and an F1 score of 0.78 ± 0.07, while the cytokine/chemokine panel predicted MIS-C with a five-fold cross-validated AUROC of 0.95 ± 0.02 and an F1 score of 0.91 ± 0.04, with only sixteen of the forty-five cytokines/chemokines sufficient to achieve this performance. Tested on Group 2 the cytokine/chemokine panel yielded AUROC = 0.98 and F1 = 0.93, on Group 3 it yielded AUROC = 0.89 and F1 = 0.89, and on Group 4 AUROC = 0.99 and F1 = 0.97. Adding standard laboratory markers to the cytokine/chemokine panel did not improve performance. A top-10 subset of these 16 cytokines achieves equivalent performance on the validation data sets. Our findings demonstrate that a sixteen-cytokine/chemokine panel as well as the top ten subset provides a highly sensitive, and specific method to identify MIS-C in patients infected with SARS-CoV-2 of all the major variants identified to date.


Introduction
Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) infection is typically milder in children than in adults, yet a significant number of patients still need hospitalization [1].A life-threatening consequence of the infection in children is multisystem inflammatory syndrome (MIS-C).The WHO definition of MIS-C describes this condition occurring in patients under 19 years of age, presenting within 12 weeks following SARS-CoV-2 primary infection or exposure with high fever for at least 3 days, two or more clinical features, disease activity and at least one elevated laboratory marker of inflammation [2].While the risk factors that predispose some children to develop MIS-C are not fully understood, identification of MIS-C is important because severe organ dysfunction and death have been reported in these patients [3,4].A prediction of MIS-C at initial presentation enables early initiation of immunotherapies that reduce severity and improve outcomes.Acute respiratory distress syndrome, mainly triggered by acute uncontrolled release of pro-inflammatory cytokines, and referred to as cytokine storm, is a leading cause of severity and morbidity in SARS-CoV-2 infected patients, and a recent meta-analysis of studies in adults [5] suggest that the cytokine storm in SARS-CoV-2 infected patients is directly linked to disease severity.In this study, we therefore aimed to characterize the cytokine/chemokine profile of pediatric patients with MIS-C compared to those with SARS-CoV-2 infection (COVID- 19) and to derive a method to stratify pediatric patients presenting with COVID-19 by their risk of MIS-C.

Study Design
Serum and plasma samples were obtained from the Texas Children's Hospital COVID-19 Biorepository (TCB) established in April 2020 under protocol H-48474 approved by the Institutional Review Board at Baylor College of Medicine.This repository holds serum/plasma samples from patients admitted with COVID-19 and/or MIS-C to any of the sites of the Texas Children's Hospital (TCH) system.In addition, samples from patients with no known inflammatory condition were included as controls.The TCH system serves the greater Houston metropolitan area of about 10,000 square miles, which incorporates 9 counties with a combined population of 8.2 million.This study used a training cohort of 205 patients who presented at one of the sites of TCH between April 2020 and January 2021.Of this cohort, 70 had a diagnosis of MIS-C by the CDC criteria, 72 were diagnosed with COVID-19 but did not develop MIS-C, and 63 were controls.The training cohort was obtained when the pre-alpha strain of the SARS-CoV-2 virus was predominant in the population.Three additional validation sets were also obtained.Validation cohort 1 constituted 92 new patients (43 MIS-C, 29 COVID-19 and 20 controls) treated in the TCH system between January 2021 and May 2021.Validation cohort 2 constituted 78 new patients (32 MIS-C, 30 COVID-19 and 16 controls) treated in the TCH system between August 2021 and October 2021.Validation cohort 3 had 76 patients (20 COVID-19, 46 MIS-C and 10 controls) treated in the TCH system between October 2021 and January 2022.While the variants infecting the specific patients from whom these samples were drawn were not recorded, the distribution of strains in TCH system patients as a function of time was recorded (Supplementary Figure S1a,b).Based on this temporal distribution, we infer that the training cohort were infected with predominantly pre-alpha variants, validation cohort 1 predominantly alpha, validation cohort 2 predominantly delta and validation cohort 3 predominantly delta and omicron variants of SARS-CoV-2.Supplementary Figure S2 shows the distribution of time intervals between a patient receiving diagnosis of MIS-C and the time of blood sampling, where negative times indicate that the sample was drawn after the diagnosis had already been made, while positive times indicate that the patient received a diagnosis of MIS-C after the blood sample was drawn.Of note is that roughly half of the MIS-C patients had not yet met CDC criteria, nor had they received a diagnosis at the time the sample was drawn.

Serum Protein/Laboratory Data/EHR Data Collection
Frozen aliquots of serum/plasma from the TCB were used for biomarker analysis.Values of thirteen clinically used laboratory markers of disease activity and inflammation (C-Reactive Protein (CRP), Procalcitonin, D-Dimer, B-type Natriuretic Peptide (BNP), Sodium, Platelet count, Albumin, Fibrinogen, Protime, Neutrophil to Lymphocyte ratio (NLR), Total CO 2 , Ferritin and Troponin I) were measured since they have been shown previously to be associated with both adult and pediatric SARS-CoV-2 infection [5,6].These laboratory markers were measured in blood samples collected at the same time as the biorepository samples.Demographic data (age, gender, race and ethnicity), vitals and results of SARS-CoV-2 antibody and PCR testing were all gathered from the patients' electronic health records (EHR).In addition, specifics of treatments administered (IVIG, anti-IL1RA, steroids, etc.) during the hospital stay were gathered from the EHR.Parameters marking the severity of disease, including length of stay in the hospital, length of stay in the ICU, and use of ventilators, ECMO, oxygen and CPAP were all also obtained from the EHR.

Confirmation of SARS-CoV-2 Infection
All patients whose nasopharyngeal swab tested positive for SARS-CoV-2 using either a transcription-mediated amplification assay or reverse-transcriptase PCR assay were considered confirmed cases of SARS-CoV-2 infection.For MIS-C clinical definition, the Brighton classification by Vogel et al. were applied [2].

Cytokine/Chemokine Profiling
Cytokines/chemokines were analyzed using a 48-plex Millipore MAP Human Cytokine/Chemokine Magnetic Bead Panel (Millipore, St. Louis, MO, USA).Each sample was run in duplicate in separate measurements on a Luminex ® MAGPIX instrument.Both kit-derived quality controls (QCs) and an in-house sample pool were used to control for lot-to-lot variability.A total of 45 of the 48 cytokines/chemokines with less than 10% interand intra-assay coefficients of variation and with less than 15% difference in duplicate readings were retained for analysis.

Computation
All calculations and computation were performed in, and all algorithms implemented in, Anaconda Python 3.9 with the sklearn and scipy packages.

Univariate Discrimination with the Wilcoxon Rank-Sum Test
The ability of laboratory and cytokine/chemokine analytes to discriminate COVID-19 from MIS-C groups was assessed using the non-parametric Wilcoxon rank-sum test on the training data (72 COVID-19 samples, 66 MIS-C samples).We tested the null hypothesis that there is no difference in the medians of the biomarker values between the COVID-19 and MIS-C cohorts.p ≤ 0.05 was used to indicate significant differences in the distributions of the analyte in the two populations.

Multivariate Cross-Validated L1 Regularized Logistic Regression Classification
An L1-regularized logistic regression augmenting the cross-entropy loss function with a penalty term proportional to the sum of the absolute values of the regression coefficients [7,8] was used for supervised learning.Five-fold cross validation was used to ensure generalization performance within the training set, characterized by the area under the receiver operating curve (AUROC) and the F1-score [9].

Multidimensional Data Representation
The Uniform Manifold expression and Projection [10][11][12] (UMAP) technique was used to reduce dimensionality of the dataset and visualize multidimensional cytokine/chemokine data in an unsupervised manner.

Network Analysis
The STRING database [13] of protein-protein interactions including both physical associations and functional associations was used to identify protein interaction networks associated with the cytokine/chemokine combinations characteristic of COVID-19 and MIS-C [14][15][16].Interactions with confidence score > 0.9 were selected for network construction.In addition, we added connections between any two cytokines/chemokines that had high correlations (Pearson correlation > 0.8) in our training data.To visualize networks of proteins affected by either COVID-19 or MIS-C, we computed subnetworks composed of all cytokines elevated or depressed two-fold or more when comparing the median level for patients with either COVID-19 or MIS-C to the levels in control patients.To place the subnetworks in a more global context, we also added proteins that were not measured into the network if they interacted with at least four measured cytokines/chemokines in the network.Graph construction and visualization were performed using custom-written Python 3.9 scripts.

Clinical Characteristics of the Training Cohort and Validation Sets
Table 1a summarizes clinical and demographic characteristics for the training cohort, and results of the Wilcoxon rank-sum test testing the null hypothesis that there is no difference in each variable between MIS-C and COVID-19 patients.Table 1b summarizes the same characteristics for validation sets 1, 2 and 3.In the training cohort, significant differences between the MIS-C and COVID-19 groups were noted in median age (9 vs. 14 years), sex (62% male vs. 46%), median hospital stay duration (7.8 days vs. 5.5 days), median ICU stay (3.4 days vs. 0 days), ventilator use (30.8% vs. 17.1%) and incidence of acute kidney injury (AKI) (35.3% vs. 11.4%).There were no significant differences in the distribution of race, ethnicity and BMI values in the two cohorts.The demographics of the three validation cohorts mirror that of the training set, except for the COVID-19 cohort of validation set 3 which has only 20 patients.The differences between the COVID-19 and MIS-C cohorts in median hospital stay duration, median ICU stay, ventilator use and AKI that are observed in the training cohort, are preserved in the three validation sets.However, the median values of these quantities decrease over time, which is consistent with evolution of the virus to higher infectivity/lower disease severity, as well as improvements in disease management and use of vaccines prior to validation set 3. For example, the median length of stay in hospital for the COVID-19 group goes down from 5.5 days in the training cohort to 3.8 days in validation sets 1 and 2 to 0.5 days in validation set 3. A corresponding drop is observed in the length of hospital stay for the MIS-C cohort starting at 7.8 days through to 6.7 days, 6.0 days, to 5.8 days in validation set 3. Similar drops over time are seen for median ICU length of stay, ventilator use, CPAP and AKI in both MIS-C and COVID-19 cohorts in the three validation sets.1c provides these statistics for validation sets 1, 2 and 3. Markers that were significantly elevated in the MIS-C training cohort compared to the COVID-19 training cohort are CRP (median 15.9 mg/dL, p = 5.2 × 10 −10 ), ferritin (median 333.3 mg/L, p = 7.2 × 10 −3 ), procalcitonin (median 4.7 ng/mL, p = 4.6 × 10 −9 ), fibrinogen (median 513 mg/dL, p = 1.8 × 10 −5 ), protime (median 15.4 s, p = 1.6 × 10 −4 ), D-dimer (median 3.0 µg/mL, p = 1.4 × 10 −7 ), BNP (median 221.6 pg/mL, p = 1.5 × 10 −7 ) and the neutrophil to lymphocyte ratio (NLR), (median 8.0, p = 3.3 × 10 −4 ).Markers that were significantly lowered in the MIS-C training cohort compared to the COVID-19 training cohort are albumin (median 3.3 g/dL, p = 9.7 × 10 −7 ), platelet counts (median 158/µL, p = 7.0 × 10 −7 ) and sodium (median 134, p = 2.5 × 10 −7 ).Troponin I level between the two cohorts was not significantly different.
In the three validation sets, for the MIS-C cohort, ferritin, fibrinogen, protime, D-Dimer, albumin and sodium levels remain about the same as in the training set.However, other inflammation markers improve from the training MIS-C cohort (pre-alpha) to the MIS-C cohort in validation set 3 (delta and omicron variants): platelet counts increase from 158/µL in the training MIS-C cohort to 199/µL in validation set 3, CRP declines from 15.9 mg/dL to 7.9 mg/dL, procalcitonin drops from 4.7 ng/mL to 1.8 ng/mL and BNP decreases from 221.6 pg/mL to 143 pg/mL.In the three validation sets, for the COVID-19 cohort, there are significant changes in four laboratory markers of inflammation from the training group to validation set 3: albumin drops from 4.2 g/dL to 3.6 g/dL (closer to the MIS-C group in the training set), BNP increases from 64 pg/mL to 169 pg/mL, procalcitonin from 1.6 to 0.4 and CRP from 2.1 mg/dL to 13.9 mg/dL.These data are consistent with MIS-C cases getting milder as the virus evolves, and COVID-19 becoming more severe, relative to the initial pre-alpha manifestation.Further, these trends suggest that standard laboratory markers of inflammation change with the severity of the variant/disease course and cannot distinguish between MIS-C and COVID-19 reliably.

Cytokine/Chemokine Profiles of the Training Cohort
Table 1c shows the levels of 45 cytokines/chemokines ordered by the value of p from the Wilcoxon rank-sum univariate test differentiating COVID-19 from MIS-C in the training cohort.A total of 34 of the 45 cytokines/chemokines were statistically significantly different in the univariate sense.The top sixteen markers of these are soluble IL2 receptor, IP-10, MIG, IL-10, IL-15, IL-3, IL-1RA, TNF-α, IL-13, IFN-Υ, IL-22, IL-2, TGF, GCSF, IL-6, and IL-27 all with p < 3 × 10 −6 .Also shown in Table 1c are the specificity and sensitivity of each individual biomarker with five-fold cross-validation on the training data.No single cytokine/chemokine has both specificity and sensitivity over 0.9.

Machine Learning Models Differentiating COVID-19 from MIS-C
Supplementary Figure S3 shows an L1 regularized logistic regression model trained with five-fold cross-validation using the 13 standard lab biomarkers.The model selects markers, omitting Troponin I. Performance statistics of this model on the training set as well as the three validation sets are in Supplementary Table S1.On the training data, the model exhibits a five-fold cross-validated AUC of 0.86 ± 0.05 and an F1 score of 0.78 ± 0.  C) with an AUC of 0.83 and an F1 of 0.71.ROC curves for the above are also depicted in Supplementary Figure S5.
Figure 1 shows an L1 regularized logistic regression model trained with five-fold cross-validation on the cytokine/chemokine data obtained from the training cohort.The coefficients of each of the five models are shown in sorted order, with each coefficient representing the change in log-odds of MIS-C corresponding to unit change in the value of the cytokine/chemokine.The model uses a total of 16 of the available 45 cytokines/chemokines to achieve a cross-validated AUC of 0.95 ± 0.02 and F1 score of 0.91 ± 0.04.The performance of L1-regularized models built with both cytokines/chemokines and laboratory biomarkers as predictors was identical to models built with cytokines/chemokines alone.The addition of laboratory biomarkers to the overall set of predictors was therefore determined to not improve the performance of the models.
of the cytokine/chemokine.The model uses a total of 16 of the available 45 cytokines/chemokines to achieve a cross-validated AUC of 0.95 ± 0.02 and F1 score of 0.91 ± 0.04.The performance of L1-regularized models built with both cytokines/chemokines and laboratory biomarkers as predictors was identical to models built with cytokines/chemokines alone.The addition of laboratory biomarkers to the overall set of predictors was therefore determined to not improve the performance of the models.1c, none of these markers have high specificity and high sensitivity at an individual level.Our multivariate analysis reveals a combination of 16 (or a subset of 10) with excellent discriminative performance on a large patient cohort gathered over time, as the virus evolved.
To understand the effectiveness of our multivariate model, a two-dimensional UMAP projection of the 45-dimensional cytokine/chemokine vector representing each sample in the training cohort is constructed.Figure 2A (top) shows the UMAP projection.Two clusters become apparent: COVID-19 patients (purple dots) to the left and the MIS-C patients (red dots) to the right.The separation of the two cohorts in a low-dimensional projection of the data explains the excellent cross-validated performance of the logistic regression model.Note that the separation is not perfect-the logistic regression model misclassifies some of the MIS-C patients with a less severe form of the disease, as reflected in the cytokine/chemokine profiles shown in Supplementary Figure S4a,b.

Network Analysis of the Cytokine/Chemokine Training Data
The cytokine profiles in MIS-C and COVID-19 are significantly different.Figure 3A shows the elevated cytokines/chemokines in COVID-19 and in MIS-C in a network where the edges denote protein-protein interactions derived from the STRING database.An edge represents either a direct or an indirect (via a longer pathway) protein-protein interaction.Both networks reflect inflammation and immune activation, but the MIS-C network is far more extensive, involving many more cytokines/chemokines displaying orders of magnitude higher levels of inflammation.Compared to controls with no known inflammatory condition, the median levels of eight chemokines/cytokines in the COVID-19 patients in our training cohort exhibit a roughly two-fold elevation over healthy controls: IL-6 (2.3), GROa (2.2), MIG (2.3), IP-10 (2.0), IL-15 (2.1), IL-12 p(70) (2.6), sCD40L (2.4) and IL- Medians and interquartile ranges for the thirteen laboratory biomarkers for each of the clusters are shown in Supplementary Table S2.As expected, the values of the lab markers in these clusters correlate well with the median cytokine/chemokine profiles from Figure 2A.In addition, we computed medians of the length of stay (days), ICU length of stay (days).Cluster 4 and Cluster 1 (the MIS-C clusters) are associated with longer stays in both the ICU and the hospital compared to Cluster 2 and Cluster 3 (the COVID-19 clusters).To assess severity of disease, we computed the fraction of patients on respiratory support (ECMO, Ventilator, CPAP) for the four clusters.Cluster 4 has the highest usage of ECMO, Ventilator and CPAP, while Cluster 2 has the least usage confirming the severity/risk ordering of the COVID-19 and MIS-C patients.

Generalizability of Model to New Validation Sets
Table 2 shows that the model trained on the initial cohort of 72 COVID-19 and 70 MIS-C patients performs well on three new validation sets from local patients.To understand the classification errors made by the model, we project each of the validation sets into the UMAP coordinate frame defined by the training data.
Table 2. Performance of the logistic regression model trained on cytokines/chemokines of the training cohort and tested on three de novo validation sets gathered as the virus evolved in time.Note that the training set performance is judged by five-fold cross-validation, and thus there is a mean and standard deviation associated with each performance measure.The validation sets are evaluated in a standard train/test configuration.A single model is built with all the training data, and the validation sets are evaluated in turn against this model.Hence there is a single number characterizing the performance of the model along each metric.Figure 2D shows Validation set 3 projected on the training data UMAP.Of the twenty COVID-19 and forty-six MIS-C patients, one COVID-19 and two MIS-C patients were misclassified.An overall milder disease profile is consistent with the accuracy of this classification.

AUC
These projections of the validation sets onto the UMAP of the training data are consistent with the cytokine/chemokine measurements yielding highly accurate predictions of MIS-C even as the disease evolved.The 16 cytokine/chemokine model as well as the minimal set of 10 cytokine/chemokines appears to be robust and maintains its accuracy over time.To understand the role of elevated cytokines/chemokines in the context of other unmeasured proteins, we augmented the network with proteins that interact with at least four of the measured biomarkers.The augmented networks for both COVID-19 and MIS-C are shown in Figure 3B.These longer-range interactions connect the disconnected components in both networks to the core subnetworks.In the COVID-19 network in Figure 3B, IL-15, scD40L and IL-12p70 connect to the MIG/IP-10/IL-6 subnetwork via IL-4.In addition.IL-22 connects to the core subnetwork in COVID-19 in Figure 3 via IL-6.In the MIS-C network, IL-22 connects to the core connected component in Figure 3 via IL-10.Figure 3B demonstrates the essential connectedness of the networks shown in Figure 3, the extent and scope of immune system dysfunction in MIS-C and the signaling pathways affected by the disease.
Examining differential expression of the cytokines/chemokines between COVID-19 and MIS-C patients in the training cohort, Figure 3C shows cytokines/chemokines for which the ratio of the median levels in the MIS-C group to the median level in the COVID-19 group is greater than 2, i.e., a two-fold or more elevation in MIS-C.A total of 14 cytokines/chemokines are observed to be differentially overexpressed: IL-1RA (92.7),IP-10 (27.7),MIG (11.1), sIL2R (4.2), IL-10 (3.9), IL-27 (3.2), VEGF (2.4), TNF-α (2.4), IL-18 (2.3), MCP-1 (2.3), IL-6 (2.2), IL-3 (2.1), TGF-ß(2.1) and IL-17F (2.1).The protein-protein interaction network corresponding to these differentially expressed cytokines/chemokines identifies sIL2R->IL6->IP-10->IL-10->MIG as the network path with the highest inflammation in MIS-C relative to COVID-19.These cytokines/chemokines are therefore potential targets for therapeutic intervention.Of note is that one of the direct branches of this pathway with extremely high differential expression (IL-1RA) in MIS-C is the target of the drug anakinra, currently used to treat this condition.Figure 3C also shows the differentially expressed cytokines/chemokines in the context of other proteins in the STRING database, revealing additional signaling pathways that could be targeted for therapeutics.
Cluster 4 having the highest usage of ECMO, Ventilator and CPAP, while Cluster 2 has the least usage confirming the severity/risk ordering of the COVID-19 and MIS-C patients.
The sixteen-cytokine/chemokine panel as well as the top ten subset generate useful system-level hypotheses about the pathogenesis of MIS-C.By deriving a protein-protein network using the STRING database, with elevated cytokines/chemokines in COVID-19 and MIS-C as nodes, we visualize the affected signaling pathways in both these conditions.Network analysis of the cytokines and chemokines elevated in COVID-19 versus MIS-C reveals major differences in the scope of the inflammatory response.Eight of the forty-five measured cytokines/chemokines are elevated in COVID-19 and fifteen are elevated in MIS-C.Both networks show immune system dysfunction, but the MIS-C network is far more extensive, involving many more cytokines/chemokines displaying orders of magnitude higher levels of inflammation.In both networks, IL-6, MIG and IP-10 contribute the most to the cytokine storm.However, high levels of soluble IL2R, IL-1RA, IL-10, IL-18, IL-8, IFN-Υ, IL-27, IL-17F and TNF-α appear unique to MIS-C and could serve as markers for disease severity.
The immunological features of pediatric COVID-19 and MIS-C are the subject of numerous investigations [15][16][17][18][19][20][21][22][23][24][25].Our patient cohort is one of the largest that has been studied to date, and our set of 45 cytokine/chemokines is one of the most comprehensive panels to be analyzed.Amongst the different investigators who have studied differences between COVID and related diseases and MIS-C, one of the first to report clear differences between KD and MIS-C using IL-6, IP-10 and IL-17A was a small study [25] of 13 MIS-C patients.Recently, other studies [16][17][18][19][26][27][28], have also shown differential cytokine/chemokine expression in MIS-C compared to either controls, COVID-19 or Kawasaki disease.In a small cohort [29] of seven MIS-C patients, differentiation between COVID-19 and MIS-C patients was achieved using a combination of IL-10, IL-1RA, IL-18, IL-6, TNF and IFN-gamma.Subsequently, ref. [30] in a larger cohort of 118 subjects, significant elevations in IL-6, IL-10, IL-17A and IFN-gamma were demonstrated, and correlated to length of hospital stay.
Network analysis reveals that the top ten cytokines/chemokines selected by the robust L1-regularized logistic model for differentiating COVID-19 from MIS-C include a subset (MIG, IP-10 and IL-15) which are three of the five cytokines/chemokines elevated in both conditions, with significantly greater elevation in MIS-C.The biomarkers sIL2R, IL-1RA and IL-8 are elevated only in MIS-C.Of the other four biomarkers included in the model, MDC (1.6) and PDGF-AB/BB (1.4) are elevated in COVID-19 relative to MIS-C, while G-CSF (2.36) and FLT-3L (1.1) are elevated in MIS-C, consistent with the model's robust performance across a range of de novo validation sets gathered even as the disease itself evolved.The L1-regularized logistic regression model based on the measurement of as few as 10 novel cytokine/chemokine analytes provides a highly sensitive and specific method to predict MIS-C at initial presentation of SARS-CoV-2 infected patients, while also providing key insights into potential therapeutic targets.

Conclusions
Several investigations have examined different laboratory markers of inflammation, disease activity and cytokine/chemokine signatures for MIS-C and COVID-19.These studies suggest that there is no single laboratory or cytokine/chemokine biomarker that can differentiate MIS-C and COVID-19.This motivated our quest for a multianalyte profile with algorithmic interpretation.Our model which was derived from a pre-alpha strain training cohort is accurate in predicting disease status for all major COVID variants to date (alpha, delta, omicron) in large validation cohorts, and it identifies MIS-C with few errors.The entire model or summary of this study is depicted in Supplementary Figure S6, from the training cohort to the L1 regularized logistic regression models, including why it gives an excellent prediction of MIS-C in subsequent cohorts and plausible mechanistic pathways.With the change in the course and severity of the disease as well as its management with time, it is important to note that standard laboratory markers do not add any significant information to our model.Notably, the model appears to work for both diagnosis and 07.On validation set 1 of 29 COVID-19 and 43 MIS-C patients, the model makes 15 errors (5 COVID-19 and 10 MIS-C) with an AUC of 0.85 and an F1 of 0.81.On validation set 2 of 32 COVID-19 and 30 MIS-C patients, the model makes 14 errors (3 COVID-19 and 11 MIS-C) with an AUC of 0.84 and an F1 of 0.75.On validation set 3 of 20 COVID-19 and 46 MIS-C patients, the model makes 16 errors (4 COVID-19 and 12 MIS-

Figure 1 .
Figure 1.L1 regularized logistic regression model trained with 5-fold cross-validation on the cytokine/chemokine data obtained from the training cohort.The model uses a total of 16 of the available

Figure 1 .
Figure 1.L1 regularized logistic regression model trained with 5-fold cross-validation on the cytokine/chemokine data obtained from the training cohort.The model uses a total of 16 of the available 45 cytokines/chemokines.Each bar is a graphical representation of the five logistic models obtained with 5-fold cross-validation.The coefficients of each model are presented in sorted order.Each coefficient represents the change in log-odds of MIS-C with unit change in the value of that cytokine/chemokine.On validation set 1 of 29 COVID-19 and 43 MIS-C patients, the model makes six errors (0 COVID-19 and 6 MIS-C) with an AUC of 0.98 and an F1 of 0.93.On validation set 2 of 32 COVID-19 and 30 MIS-C patients, the model makes eight errors (5 COVID-19 and 3 MIS-C) with an AUC of 0.89 and an F1 of 0.88.The drop in performance on the second validation set is consistent with the prevalence of the delta variant in this cohort with inflammation in COVID-19 being more severe compared to the pre-alpha training cohort.On the third validation set of 20 COVID-19 and 46 MIS-C patients, the model makes three errors (1 COVID-19 and 2 MIS-C) with an AUC of 0.99 and an F1 of 0.97.The confusion matrices and AUCROC curves for the three validation cohorts are presented in Supplementary Figure S3.Importantly, these results do not change significantly even when the model is restricted to 10 cytokines/chemokines composed of the top five predictive of COVID-19 and the top five predictive of MIS-C from the original 16 biomarker model.These 10 cytokines/chemokines are soluble IL2R, IP-10 (CXCL-10), IL-1RA, IL-15, MIG (CXCL-9), MDC (CCL 22), IL-8, G-CSF, FLT-3L and PDGF-AB/BB.There is some overlap with the cytokines/chemokines reported in Sacco et.al. [17] who find IP-10, IL-2, MDC, IL-15 significant at the univariate level for discriminating MIS-C from COVID-19.However, as shown in Table1c, none of these markers have high specificity and high sensitivity at an individual level.Our multivariate analysis reveals a combination of 16 (or a subset of 10) with excellent discriminative performance on a large patient cohort gathered over time, as the virus evolved.To understand the effectiveness of our multivariate model, a two-dimensional UMAP projection of the 45-dimensional cytokine/chemokine vector representing each sample in the training cohort is constructed.Figure2A(top) shows the UMAP projection.Two clusters become apparent: COVID-19 patients (purple dots) to the left and the MIS-C patients (red dots) to the right.The separation of the two cohorts in a low-dimensional projection of the data explains the excellent cross-validated performance of the logistic regression model.Note that the separation is not perfect-the logistic regression model misclassifies some of the MIS-C patients with a less severe form of the disease, as reflected in the cytokine/chemokine profiles shown in Supplementary FigureS4a,b.

Figure 2 .
Figure 2. (A) Top: a two-dimensional UMAP projection of the 45-dimensional cytokine/chemokine vector representing each sample in our training cohort.The COVID patients in the validation set are colored purple, while the MIS-C patients are colored red.(A) Bottom: the medians of key cytokines/chemokines for each of the four clusters.(B) The first validation set projected back into the UMAP coordinates derived from the training data.Misclassified COVID-19 and MIS-C patients are called out by dotted circles of purple and red.For the first validation set with 29 COVID and 43 MIS-C, only 6 MIS-C patients were misclassified.Five of them fall in the COVID clusters defined by the training cohort, and these MIS-C patients were confirmed by chart review to be mild cases.(C) The second validation set projected back into the UMAP coordinates derived from the training data.Misclassified COVID-19 and MIS-C patients are called out by dotted circles of purple and red.For the second validation set with 32 COVID-19 and 30 MIS-C patients, 5 COVID-19 and 3 MIS-C patients were misclassified.In total, 4 of the COVID-19s fall in the MIS-C clusters defined by the training cohort, showing the evolution of the disease with the COVID-19 patients having more severe disease compared to the initial training cohort.The 3 misclassified MIS-C's have a mild version of the disease and fall into the low risk COVID-19 cluster defined by the training data.(D) The third validation set projected back into the UMAP coordinates derived from the training data.Misclassified COVID-19 and MIS-C patients are called out by dotted circles of purple and red.For the third validation set with 20 COVID and 46 MIS-C patients, 1 COVID-19 and 2 MIS-C patients were misclassified.

Figure 2 .
Figure 2. (A) Top: a two-dimensional UMAP projection of the 45-dimensional cytokine/chemokine vector representing each sample in our training cohort.The COVID patients in the validation set are colored purple, while the MIS-C patients are colored red.(A) Bottom: the medians of key cytokines/chemokines for each of the four clusters.(B) The first validation set projected back into the UMAP coordinates derived from the training data.Misclassified COVID-19 and MIS-C patients are called out by dotted circles of purple and red.For the first validation set with 29 COVID and 43 MIS-C, only 6 MIS-C patients were misclassified.Five of them fall in the COVID clusters defined by the training cohort, and these MIS-C patients were confirmed by chart review to be mild cases.(C) The second validation set projected back into the UMAP coordinates derived from the training data.Misclassified COVID-19 and MIS-C patients are called out by dotted circles of purple and red.For the second validation set with 32 COVID-19 and 30 MIS-C patients, 5 COVID-19 and 3 MIS-C patients were misclassified.In total, 4 of the COVID-19s fall in the MIS-C clusters defined by the training cohort, showing the evolution of the disease with the COVID-19 patients having more severe disease compared to the initial training cohort.The 3 misclassified MIS-C's have a mild version of the disease and fall into the low risk COVID-19 cluster defined by the training data.(D) The third validation set projected back into the UMAP coordinates derived from the training data.Misclassified COVID-19 and MIS-C patients are called out by dotted circles of purple and red.For the third validation set with 20 COVID and 46 MIS-C patients, 1 COVID-19 and 2 MIS-C patients were misclassified.The COVID-19 and MIS-C patients are further stratified into two clusters each.Cluster 2 (50 COVID-19, 9 MIS-C) and Cluster 3 (16 COVID-19, 6 MIS-C) are predominantly COVID-19 clusters, while Cluster 1 (4 COVID-19, 31 MIS-C) and Cluster 4 (2 COVID-19, 24 MIS-C) are predominantly MIS-C clusters.As shown in Figure 2A (bottom), Cluster 3 represents patients in the COVID-19 cohort with higher median levels of inflammation in the measured cytokines/chemokines. Relative to Cluster 2, Cluster 3 patients have elevated median levels of IL-18, IL-27, PDGF-AB/BB, and FGF-2.Cluster 4 represents patients in the MIS-C cohort with higher median levels of IP-10, MIG, TNF-a, and IFN-g, relative to Cluster 1. Cluster 1 is characterized by an elevation of IL-1RA relative to Cluster 4, potentially reflecting treatment with IVIg/steroids/anakinra.The overall risk stratification/disease severity , M-MIS-C, Val-validation, AUC-Area under the curve.

Figure
Figure 2B shows Validation set 1 projected on the training data UMAP.Of the twentynine COVID and forty-three MIS-C patients in this validation set, only six MIS-C patients were misclassified.Note that all the new COVID-19 patients fall within the COVID-19 clusters 2 and 3 defined by the training data.Five of the six misclassified MIS-C's fall in the COVID-19 clusters defined by the training cohort, and these MIS-C patients were confirmed by chart review to be mild cases.Figure 2C shows Validation set 2 projected on the training data UMAP.Of the thirtytwo COVID-19 and thirty MIS-C patients, five COVID-19 and three MIS-C patients were misclassified.Four of the COVID-19s fall in the MIS-C clusters defined by the training cohort, consistent with this set containing more delta variant infected patients with severe disease compared to the initial training cohort.The three misclassified MIS-C's have a mild version of the disease and fall into the low risk COVID-19 cluster defined by the training data.Figure2Dshows Validation set 3 projected on the training data UMAP.Of the twenty COVID-19 and forty-six MIS-C patients, one COVID-19 and two MIS-C patients were misclassified.An overall milder disease profile is consistent with the accuracy of this classification.These projections of the validation sets onto the UMAP of the training data are consistent with the cytokine/chemokine measurements yielding highly accurate predictions

Figure 3 .
Figure 3. (A) Ratio of median cytokine levels in COVID-19 patients to healthy controls and in MIS-C patients to controls for all cytokines/chemokines that are elevated/depressed two-fold or more.The network uses protein-protein interactions (black) from the STRING database to determine connections.Nodes are colored based on the fold change, and the actual fold change is also included in the node label.Many more cytokines/chemokines are inflamed in MIS-C compared to COVID-19.

Figure 3 .
Figure 3. (A) Ratio of median cytokine levels in COVID-19 patients to healthy controls and in MIS-C patients to controls for all cytokines/chemokines that are elevated/depressed two-fold or more.The network uses protein-protein interactions (black) from the STRING database to determine connections.Nodes are colored based on the fold change, and the actual fold change is also included

Table 1 .
Cont.Table 1b also summarizes statistics for all the measured laboratory markers for the training cohort, and Table