Investigating Cellular Trajectories in the Severity of COVID-19 and Their Transcriptional Programs Using Machine Learning Approaches

Single-cell RNA sequencing of the bronchoalveolar lavage fluid (BALF) samples from COVID-19 patients has enabled us to examine gene expression changes of human tissue in response to the SARS-CoV-2 virus infection. However, the underlying mechanisms of COVID-19 pathogenesis at single-cell resolution, its transcriptional drivers, and dynamics require further investigation. In this study, we applied machine learning algorithms to infer the trajectories of cellular changes and identify their transcriptional programs. Our study generated cellular trajectories that show the COVID-19 pathogenesis of healthy-to-moderate and healthy-to-severe on macrophages and T cells, and we observed more diverse trajectories in macrophages compared to T cells. Furthermore, our deep-learning algorithm DrivAER identified several pathways (e.g., xenobiotic pathway and complement pathway) and transcription factors (e.g., MITF and GATA3) that could be potential drivers of the transcriptomic changes for COVID-19 pathogenesis and the markers of the COVID-19 severity. Moreover, macrophages-related functions corresponded more to the disease severity compared to T cells-related functions. Our findings more proficiently dissected the transcriptomic changes leading to the severity of a COVID-19 infection.


Introduction
The novel SARS-CoV-2 virus has caused a total of 22 million COVID-19 cases and nearly 380,000 deaths in the United States since 21 January 2020 [1]. COVID-19 patient symptoms exhibit significant variation, ranging from being asymptomatic to death [2]. COVID-19 infection mortality risk factors such as age, smoking status, gender, diabetes, and hypertension have been identified [3]. While the majority of patients experience no symptoms to moderate symptoms such as loss of taste or smell, fever, and chills, some patients develop respiratory failure and require hospitalization. Furthermore, there are patients that develop a severe infection despite having no known risk factors at all [4]. In addition to patients developing clinical psychological sequelae such as post-traumatic stress disorder, depression, and anxiety after a COVID-19 infection, the unpredictability of these infections continues to drive the worsening mental health of the general public [5].
As the number of cases and fatalities continue to rise, identifying the key mechanisms that modulate the severity of an infection is essential. There is growing evidence that immunopathology may play a significant role in the development of severe clinical sequelae in Genes 2021, 12, 635 2 of 13 infected patients [6]. Patients may develop cytokine storms resulting in various symptoms such as disseminated intravascular coagulation, respiratory failure, and shock [7].
Single cell RNA sequencing (scRNA-seq) is a high-throughput technique that enables the examination of gene expression of the cellular heterogeneity at an individual cell level [8]. It has been applied to COVID-19 studies to understand the mechanisms of disease, although such data is currently very limited due to the unavailability of human tissues from the patients. Liao et al. generated and analyzed bronchoalveolar lavage fluid (BALF) scRNA-seq data to reveal the landscape of bronchoalveolar immune cells in COVID-19 patients [9]. They collected 66,452 cells in total with 13 samples, and the study identified the presence of proinflammatory monocyte-derived macrophages in severe patients and CD8+ T cells in moderate cases. Several groups followed up with additional analyses using other data. For example, Liu et al. reanalyzed the same data in addition to bulk-tissue RNA-seq data. They detected SARS-CoV-2 gene expression in as many as eight immune-cell types including macrophages, CD8+ T cells, and NK cells [10]. Moreover, they identified an abundance of ORF10 and a high ORF10/N ratio in severe cases, which had not been previously reported [10]. Xu et al. performed another reanalysis of Liao et al.'s data by integrating the data with peripheral blood mononuclear cell scRNA-seq data [11]. They reported anomalous activation of BALF monocyte-macrophages in severe COVID-19. Although the previous work identified cellular features that are differentiated by the disease severity, it is still unknown what are the cellular trajectories of COVID-19 disease progression, i.e., healthy-to-moderate and healthy-to-severe changes, as well as what cellular features drive differentiation of the pathogenesis. In addition, as SARS-CoV-2 continues to rapidly evolve, identifying driver genes and transcriptional programs that respond to an infection will be useful for predicting the near future trend of the pandemic [12,13].
In this study, we hypothesize that there are linear cellular trajectories and transcriptional programs involved in the pathogenesis of COVID-19, which could be inferred from COVID-19 BALF scRNA-seq data. We further hypothesize that there are different cellular trajectories according to cell types and disease severity, and there are transcriptional programs that differentiate the cellular trajectories. To this end, we applied novel bioinformatics and machine learning methods to the Liao et al. scRNA-seq data to uncover the underlying biological mechanisms of the COVID-19 pathogenesis at single-cell resolution. We first applied the Slingshot algorithm [14] to infer the cellular trajectory from healthy-to-disease (moderate or severe) patients in two cell types: T cells and macrophages, both of which were found correlations between their abundance and COVID-19 severity in Liao et al. [9]. Other cell types were not used due to an insufficient number of cells. We subsequently assessed the biological pathways and transcriptional programs in each cell type using DrivAER (Driving transcriptional programs based on AutoEncoder derived Relevance scores), a deep learning algorithm developed in our lab [15]. These findings provided some important insights into the cellular changes during the infection and toward patient severity.

COVID-19 BALF scRNA-Seq Data
We retrieved COVID-19 BALF scRNA-seq data of 13 patients (severe (n = 6), moderate (n = 3), and healthy (n = 4) cases) generated by Liao et al. [9] from the UCSC Cell Browser [16]. The severity of infection was defined by the patient's symptoms. Severe COVID-19 infection was defined as requiring ventilation support and/or the presence of pneumonia in the lungs as opposed to the moderate patients, which only exhibited symptoms such as fever, chills, and nausea [9]. More detailed information is available in Liao et al.'s publication. We retained 49,417 macrophage cells and 7716 T cells based on the scRNA-seq data and annotations provided by the original paper. We then performed data filtration, normalization, dimensional reduction (Principal Component Analysis (PCA) and Uniform Manifold Approximation and Projection (UMAP) [17]), data integration, and clustering using Seurat version 3.2.3 for each cell type [18].
We processed Liao et al.'s data as follows: First, we identified 2000 highly variable genes per sample using the FindVariableFeatures function in Seurat. Second, with the identified highly variable genes, we performed canonical correlation analysis for data integration using the IntegrateData function implemented in Seurat [18]. Next, we normalized and integrated the gene-expression data using the NormalizedData and ScaleData functions in Seurat. Afterward, we performed PCA to calculate 30 principal components (PCs) of the integrated data via RunPCA and RunUMAP functions in Seurat. The first 20 PCs were used for UMAP embedding and clustering. At the end of the process, cell clustering was performed with the FindNeighbors and FindClusters functions. As a result, the process generated three types of outputs: a pre-processed gene-expression matrix, a cluster label of each cell, and cell embeddings. The cluster information and embeddings were used in the trajectory inference step, and the pre-processed gene-expression matrix was used during the DrivAER analysis.

Trajectory Inference Using Slingshot
Differentiation of the pseudotime of a lineage between sample groups could indicate a transition of the disease status (e.g. healthy-to-severe), as demonstrated in Fu et al. [19]. Slingshot is a top-performer for cellular trajectory from single-cell RNA-seq data, according to the recent benchmarking by Saelens et al. [20]. Saelens et al. also suggested using it to infer a linear trajectory (i.e., a disease progression), which suits our assumptions. We used Slingshot to infer cellular trajectories (i.e., pseudotime of cells) between two groups, and the following paired disease transitions were considered: healthy-to-moderate ( H → M) and healthy-to-severe ( H → S) . Slingshot requires a dimensional-reduced cell matrix and cluster labels of each cell as input, which we generated by using Seurat. For each trajectory inference, we set the parameters start.clus (the cluster containing the starting point of the trajectory) and end.clus (the cluster with the ending point of the trajectory) to infer the expected cellular trajectory (i.e. from the first group to the second group) based on the cell population of each group. As a result, slingshot generated multiple lineages. A single lineage represented a path in a cellular trajectory (i.e., possible COVID-19 pathogenesis on the cell type) and the cells belonging to the lineage were ordered by pseudotime calculated by slingshot. Since the next step (i.e., transcriptional program analysis) required us to select a lineage, we empirically selected a lineage based on the following criteria: (1) the number of cells covered by the lineage, preferably more cells, and (2) pseudotime distribution between the two groups, preferably a lineage with a higher distribution of the disease group (moderate/severe). Wilcox signed-rank test and fold-change calculation of pseudotimes between two groups were performed to evaluate the second criterion.

Transcriptional Program Identification
In this study, we applied DrivAER, a deep-learning based algorithm to identify transcriptional programs that potentially regulate the inferred cellular trajectories [15]. DrivAER is a transcriptional program identification method that utilizes deep learning and machine learning techniques. Transcriptional programs are sets of genes that are coregulated (e.g., targets of a transcription factor (TF)) or have common biological functions (e.g., genes in a pathway), and their functions potentially determine a cellular response such as cellular trajectory or condition. A previous study demonstrated that DrivAER accurately identifies transcriptional programs driving cellular response [21].
To test whether a transcriptional program is a potential regulator of the cellular response, DrivAER works as follows: First, DrivAER reduces the gene expressions of the transcriptional program into a low dimensional data manifold using the deep count autoencoder (DCA) [22]. Secondly, DrivAER uses random forest models to calculate a relevance score quantifying the association between the cellular manifold coordinates and their cellular response. If the cell responses are continuous like pseudotime values, R 2 , the coefficient of determination between prediction by a random forest and the observed cellular response, will be calculated as a relevance score. A higher relevance score indicates the transcriptional program potentially regulates the cellular response and vice versa. Transcriptional program annotations from MSigDB [23] and TRRUST [24] were collected and used in this study.
Since we have two different trajectories for each cell type ( H → S and H → M ), we designed ∆ RS = RS H→S − RS H→M , where RS H→S and RS H→M represent relevance scores of a transcriptional program in H → S and H → M trajectories, respectively. ∆RS scores the differentiation of a transcriptional program between two conditions. When ∆RS is greater than 0, it indicates that the transcriptional program may drive COVID-19 infections towards a severe state, and ∆RS < 0 indicates that the role of the transcriptional program may stabilize the patient's infection severity. When ∆RS is close to or equal to 0, it means there is no differentiation between two trajectories.

Macrophages and T Cells Exhibited More Diverse Cellular Trajectories from Healthy Controls to COVID-19 Patients
We first performed trajectory inference for four cases: H → M and H → S on macrophages and H → M and H → S T cells using the Slingshot algorithm ( Figure 1, Table 1, Supplementary Figure S1, and Supplementary Table S1). Slingshot detected cellular trajectories with multiple lineages for every case. We observed that the H → S of both macrophages and T cells showed more diverse cellular trajectories that could differentiate the disease status (six and four lineages), while the H → M only revealed a smaller number of the differentiated lineages (three and one linages). Next, we selected the top lineage of each transition and found that H → S of macrophages showed better differentiation than T cells (0.94 fold-change vs. 0.56 fold-change respectively, Figure 1A,C), and H → M of T cells showed better differentiation than macrophages (0.47 fold-change vs. 0.37 fold-change respectively, Figure 1B,D). This result suggested that studying macrophages would be informative to understand what cellular features/mechanisms aggravate the COVID-19 symptoms, and T cells would be helpful to study how the immune system reduces COVID-19 infection severity.

SARS-CoV-2 Gene-Expression Pattern Could Dissect the Healthy-to-Severe Trajectories into Three Different Stages of COVID-19
To investigate how cellular trajectories correlated with the SARS-CoV-2 virus, we explored the patterns of SARS-CoV-2 gene expression, which was quantified in Liu et al. [10], across the pseudotime in the H → S trajectories for macrophages and T cells (Supplementary Figure S2). We were unable to perform the analysis for the H → M trajectories because only a small number of cells contained SARS-CoV-2 gene transcripts in the moderate cell population [10]. We observed most of the gene expressions were located in the middle stage for both trajectories of macrophages and T cells ( Table 2), suggesting that the inferred healthy-to-severe pathogenesis at the single cell level consists of three stages: (1) cells without infection, (2) cells with SARS-CoV-2 virus infection, and (3) development of the disease and symptoms. Moreover, we found an enrichment of the ORF7a gene-expression at the late stage in macrophages (Supplementary Figure S2). We assumed that ORF7a expression at the late stage might indicate macrophages' ingestion of infected

The Inferred Cellular-Trajectories Model the Cell Type-Specific Immune Response
Next, using the sub-cell type information of T cells and macrophages from Liao et al. [9], we observed the subtype population changes of each cellular trajectory throughout pseudotime to examine whether the trajectory could model the cell type-specific immune response. (Figure 2). We observed a proper immune response in the healthy-to-moderate trajectory of T cells, which ultimately resulted in an adaptive immune response by cytotoxic T cells (CD8 + T cells), resulting in infection clearance. In contrast, T cells' healthy-to-severe trajectory showed the subpopulation responded early in the infection, at a similar time point, and decreased in numbers as the infection progressed ( Figure 2A). Immune cell exhaustion has been observed in severe COVID-19 patients, and the T cell dynamics reflect it. In addition, macrophage subpopulations in the healthy-to-severe trajectory had an increased number of FABP4 + (a sub-cell type with high-expression of fatty acidbinding protein 4) and FCN1 -SPP1 + (a sub-cell type with low-expression of ficolin-1 and high-expression of secreted phosphoprotein 1) throughout the trajectory ( Figure 2B). This reflects the recruitment of proinflammatory monocytes towards the lungs, which has been observed in severe conditions. Our findings demonstrate that the inferred trajectories could model not only the COVID-19 progression but also the immune response.

DrivAER Identified Potential Transcriptional Programs That Differentiate The Severity of COVID-19
We performed transcriptional program analysis for the selected top lineage of each cellular trajectory using DrivAER (Figure 3 and Supplementary Table S2). To investigate whether SARS-CoV-2 genes directly regulate the trajectories, we first ran DrivAER using SARS-CoV-2 genes (only for H → S in macrophages). We did not find a significant correlation between the SARS-CoV-2 gene expression and the cellular trajectory (RS H→S = 0.0019, Supplementary Figure S3). This result suggested that the disease progression might not be directly regulated by the viral genes, rather it might be regulated by other biological mechanisms.
We next ran DrivAER using transcriptional program annotations from MSigDB [23] and TRRUST [24] and prioritized several hallmark pathways and TFs by their ∆RS values, which measure the differentiation of a selected TP between H → S and H → M . From the analysis of macrophages trajectories, we identified the xenobiotic metabolism pathway and the TF MITF (melanocyte inducing transcription factor) as the top transcriptional programs in healthy-to-severe pathogenesis on macrophages. The visualization of manifold gene expression of both top transcriptional programs showed marginally gradient patterns that follow their trajectory ( Figure 4A,B), but both manifold gene expressions did not show any strong linear patterns that indicate strong correlations between the manifold gene expressions and pseudotimes. We also found that expressions of some genes in the transcriptional programs were moderately correlated with the inferred pseudotime ( Figure 4C,D). We performed DrivAER on the sub-cell types of both T cells and macrophages (Supplementary  Table S3). Proliferating T cells were not used during the analysis because the cell type's healthy-to-moderate trajectory did not have enough number of cells. Our DrivAER analysis could not identify any transcriptional programs that showed stronger ∆RS for the sub-cell types than Macrophages or T cells (Supplementary Figure S4). toxic T cells (CD8 + T cells), resulting in infection clearance. In contrast, T cells' healthy-tosevere trajectory showed the subpopulation responded early in the infection, at a similar time point, and decreased in numbers as the infection progressed (Figure 2A). Immune cell exhaustion has been observed in severe COVID-19 patients, and the T cell dynamics reflect it. In addition, macrophage subpopulations in the healthy-to-severe trajectory had an increased number of FABP4 + (a sub-cell type with high-expression of fatty acid-binding protein 4) and FCN1 -SPP1 + (a sub-cell type with low-expression of ficolin-1 and high-expression of secreted phosphoprotein 1) throughout the trajectory ( Figure 2B). This reflects the recruitment of proinflammatory monocytes towards the lungs, which has been observed in severe conditions. Our findings demonstrate that the inferred trajectories could model not only the COVID-19 progression but also the immune response.

DrivAER Identified Potential Transcriptional Programs That Differentiate The Severity of COVID-19
We performed transcriptional program analysis for the selected top lineage of each cellular trajectory using DrivAER (Figure 3 and Supplementary Table S2). To investigate whether SARS-CoV-2 genes directly regulate the trajectories, we first ran DrivAER using SARS-CoV-2 genes (only for → in macrophages). We did not find a significant correlation between the SARS-CoV-2 gene expression and the cellular trajectory ( → = 0.0019, Supplementary Figure S3). This result suggested that the disease progression might not be directly regulated by the viral genes, rather it might be regulated by other biological mechanisms. We next ran DrivAER using transcriptional program annotations from MSigDB [23] and TRRUST [24] and prioritized several hallmark pathways and TFs by their values, which measure the differentiation of a selected TP between → and → . From the analysis of macrophages trajectories, we identified the xenobiotic metabolism pathway and the TF MITF (melanocyte inducing transcription factor) as the top transcriptional programs in healthy-to-severe pathogenesis on macrophages. The visualization of manifold gene expression of both top transcriptional programs showed marginally gradient patterns that follow their trajectory ( Figure 4A,B), but both manifold gene expressions did not show any strong linear patterns that indicate strong correlations between the manifold gene expressions and pseudotimes. We also found that expressions of some genes in the transcriptional programs were moderately correlated with the inferred pseudotime (Figure 4C,D). We performed DrivAER on the sub-cell types of both T cells and macrophages (Supplementary Table S3). Proliferating T cells were not used during the analysis because the cell type's healthy-to-moderate trajectory did not have enough number of cells. Our DrivAER analysis could not identify any transcriptional programs that showed stronger for the sub-cell types than Macrophages or T cells (Supplementary Figure S4).

Discussion
In this study, we utilized machine learning approaches to investigate the cellular trajectory in the severity of COVID-19. Using the slingshot algorithm, we first found that there were more diverse trajectories of → than → for both macrophages and T cells. Furthermore, our deep-learning algorithm DrivAER analysis found that the trajectories are not directly regulated by SARS-CoV-2 genes, but several transcriptional programs potentially drive the transcriptomic changes in COVID-19 pathogenesis and serve as the biomarkers of COVID-19 severity.
We found several pieces of evidence from previous studies that the identified transcriptional programs could be keys to understanding the undiscovered mechanism of the pathogenesis of COVID-19 and differentiation between → and → . For example, the previous studies about the xenobiotic metabolism pathway hint that increased cytochrome P450 expression in macrophages worsens disease severity. The term xenobiotic

Discussion
In this study, we utilized machine learning approaches to investigate the cellular trajectory in the severity of COVID-19. Using the slingshot algorithm, we first found that there were more diverse trajectories of H → S than H → M for both macrophages and T cells. Furthermore, our deep-learning algorithm DrivAER analysis found that the trajectories are not directly regulated by SARS-CoV-2 genes, but several transcriptional programs potentially drive the transcriptomic changes in COVID-19 pathogenesis and serve as the biomarkers of COVID-19 severity.
We found several pieces of evidence from previous studies that the identified transcriptional programs could be keys to understanding the undiscovered mechanism of the pathogenesis of COVID-19 and differentiation between H → M and H → S . For example, the previous studies about the xenobiotic metabolism pathway hint that increased cytochrome P450 expression in macrophages worsens disease severity. The term xeno-biotic refers to any chemical or substance that is exogenous to the system, specifically for humans in this case [26]. The cytochrome P450 protein family (CYPs) is the most important member of this pathway. The interaction between the immune system and CYPs during inflammation has previously been examined. Pro-inflammatory cytokines such as IL-6 and TNF-alpha have been shown to down-regulate CYP activity in the liver [27]. There are also CYPs expressed within alveolar lung macrophages [28]. However, our results show an increased activity of macrophage xenobiotic metabolism in patients with a severe symptom compared to healthy controls. The CYP activity has been reported to increase inflammation and inhibit macrophage phagocytic ability during sepsis [29]. This contributes to oxidative stress, which contributes to the cytokine storm observed in severe cases during COVID-19 infection [30]. We believe that initially, the SARS-CoV-2 virus triggers increased macrophage activity, and this increased xenobiotic metabolism reflects changes in an oxidative burst from macrophages after phagocytosis. In addition, as the infection progresses, the macrophages secrete various pro-inflammatory cytokines that recruit other pro-inflammatory monocytes to the infection site [31]. This creates a positive feedback loop that continues to increase the oxidative stress of the patient and ultimately creates a cytokine storm, resulting in significant immunopathology.
We also found that dysregulation of TF MITF may result in severe COVID-19 infections. MITF is a transcription factor involved in the development of cell lineage, growth, and survival [32]. It was initially discovered in melanocytes and has also been studied in the context of melanoma [33]. In the context of the COVID-19 pandemic, Bost et al.'s study of SARS-CoV-2 host-viral infection maps identified MITF as one of the up-regulated genes [34]. MITF has been identified as a suppressor of innate immunity [35]. The gene for MITF lies downstream of M-CSF (macrophage colony-stimulating factor), a cytokine known as a growth factor for differentiation and growth of monocytes and macrophages [36]. The cytokine GM-CSF (granulocyte-macrophage colony-stimulating factor) has been explored as a potential therapy or therapeutic target for COVID-19 hyper-inflammation [37]. A randomized interventional trial suggested that administering recombinant GM-CSF improved patient outcomes, but there were questions that needed to be addressed as well [38]. Our results suggest that dysregulation of MITF in macrophages worsens infection severity in patients, but the mechanism behind this is not understood especially with the current evidence of macrophage-related inflammation in COVID-19 infections.
In the T cell healthy-to-severe trajectory analysis, we found that increased T cell activity may be insufficient to control the severity of SARS-CoV-2 infection. The role of adaptive cell-mediated immunity, CD4 + and CD8 + T cells, in a viral infection, is well understood. T cell activity plays a major role in the clearance of a SARS-CoV-2 infection [39], and T cell dysfunction has been observed in the most severe COVID-19 cases [40,41]. Moreover, T cell overactivation and exhaustion contribute to the hyperinflammatory state observed in severe patients and enhance the immunopathology caused by the cytokine storm [42]. The G2M checkpoint pathway is involved in the cell cycle. It is where cells progress into mitosis unless DNA damage has occurred [43]. In addition, the TF E2F is also involved in the cell cycle by promoting cellular growth. Aberrant E2F pathway activation can result in inappropriate cellular entry into the S phase [44]. Our results indicate that T cells undergo increased cellular growth during a severe COVID-19 infection as seen in the ∆RS of the G2M checkpoint pathway and E2F pathway between healthy control and severe patients. We also observed increased mitosis as seen in the ∆RS of the mitotic spindle pathway between healthy control and severe patients [45]. This indicates that T cell growth and proliferation are still functional but insufficient to control the virus in a severe COVID-19 infection.
We also observed the preservation of complement pathway activation in the H → M trajectory of T cells ( Figure 3C). The complement cascade is another arm of the innate immune system. Proteins involved in the cascade can initiate opsonization by phagocytes or directly attack pathogens by forming a membrane attack complex [46]. In addition, complement acts as a link between the innate and adaptive immune systems to allow for a coordinated response during infection [47]. Complement's role in COVID-19 infection has been highlighted recently as well. Complement dysfunction has been associated with respiratory failure [48]. It also comprises one of the multiple factors associated with severe infections [49]. Damage from complement dysfunction has been observed in autoimmune infections where complement cascade targets the host's cells [50]. The complement system may worsen inflammation in disease as well [51]. As stated earlier, the hyperinflammatory state during a COVID-19 infection is a major factor in a severe infection. Complement increases the cell response via increased IFN-gamma activity [52,53]. IFN-gamma is classified as a pro-inflammatory cytokine [54]. Our results may support that preserved complement cascade in T cells results in a more favorable outcome for patients infected by SARS-CoV-2.
It is worth noting that the inflammatory response pathway appears to be stronger in H → M than H → S on T cells ( Figure 3C). However, our results did not actually indicate that moderate patients exhibit a greater inflammatory response than the severe patients. We observed that the severe patients exhibited a higher level of gene expression of inflammatory response pathway genes (Supplementary Figure S5), and we suspected the higher expression might indicate their inflammatory response signature to remain constant throughout the infection.
The transcription factor analysis of H → M trajectory of T cells showed that activation of GATA3 coincides with decreased cytokine secretion and a better infection outcome. GATA3 holds an important role in the development and function of Th2 cells [55]. GATA3 also contributes to the ability of Th2 cells to secrete cytokines such as IL-4, 5, and 13, which are required for a type 2 immune response [56]. The type-2 immune response is characterized as immunity against helminths and parasites [57]. However, it is the type-1 response that is required to fight against intracellular pathogens including SARS-CoV-2. The type-2 immune response has also been shown to negatively impact the course of infections by respiratory viruses due to their promotion of inflammatory cells such as eosinophils [58]. The GATA3 gene is highly expressed in T cells of moderate patients (Wilcox signed-rank test, p-value < 2.2 × 10 −16 , while it is less expressed in T cells of severe patients. This validates our results as GATA3 activity should be down-regulated during a viral infection. These patients with a severe infection may have an active Th1 response as a result of GATA3 downregulation. On the other hand, moderate patients expressing a higher level of GATA3 may be a result of a lighter infection load. It is also possible there is some unknown mechanism causing this as cytokine profiles of severe COVID-19 patients are skewed towards a Th2 response [59].
We acknowledge the limitations in our studies including but not limited to the small sample size and lack of experimental validation. BALF is normally collected during the bronchoalveolar washing, which is a diagnostic tool for uncommon condition in lower respiratory tract pathology. This would greatly limit the scale of the sample collection. Moreover, after we checked all the available COVID-19 BALF single-cell datasets, we found there are no other studies that have a comparable study design, severity definition, or sample size. Furthermore, due to the high infectivity and pathogenicity of SARS-CoV-2, only a limited number of labs would have the capability to conduct functional validations. This is a common problem in the COVID-19 research field. Thus, further validation for our results is warranted when additional data is released. Other limitations include the type of data required for this analysis. Our dataset did not contain any developmental data and the number of patients was low, possibly resulting in a lower statistical power than desired. The cellular trajectory is also one-directional and does not consider more complex cases. In the future, we believe this method could be applied to other datasets and diseases especially in diseases where patients exhibit a great deal of heterogeneity.

Conclusions
We used machine learning approaches to investigate the cellular status transition of macrophages and T cells in diverse COVID-19 severity. We identified macrophage-related functions (xenobiotic metabolism pathway and binding of MITF) that contribute more to the severe COVID-19 symptoms. On the other hand, the deficiency of certain T cell-related functions (complement pathway and binding of GATA3) will likely lead to severe infection. Our findings provide new insight into the disease pathogenesis and potential treatment of COVID-19.  Figure S3: The SARS-CoV-2 gene manifold of the H→S trajectory of macrophages. The gene manifold was performed by DCA. X-and y-axes indicate the first and second dimension of the gene manifold. A point indicates a cell, and its color indicates the cell's pseudotime. Figure S4: ∆RS distributions for each cell type. Figure S5: average expressions and percentages of expressing cells of inflammatory response pathway genes by disease status. Table S1: The summary of inferred cellular trajectories and their lineages on macrophages and T-cells. Selected lineages of each trajectory are marked by an asterisk. Table S2: The list of TPs, relevance scores (RS H→M and RS H→S ), and ∆RS obtained by DrivAER analysis on macrophages and T cells. Table S3: The list of TPs, relevance scores (RS H→M and RS H→S ), and ∆RS obtained by DrivAER analysis on sub-cell types. Funding: Z.Z. was partially supported by National Institutes of Health grants R01LM012806 and R01DE030122 and Chair Professorship for Precision Health funds. We thank the technical support from the Cancer Genomics Core funded by the Cancer Prevention and Research Institute of Texas (CPRIT RP180734 and RP170668). The funders had no role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Informed Consent Statement: Not applicable
Data Availability Statement: The raw scRNA-seq files supporting the conclusions of this article can be downloaded from the NCBI GEO database (GEO ID: GSE145926), and the pre-processed gene-expression data are available at https://covid19-balf.cells.ucsc.edu.