The GSNCASCR R package (version 0.1.0) compares gene co-expression networks in terms of their structural properties. The construction of co-expression networks, the graph spectral analysis, and main features of the package are described below:
We used both simulation experiments and analyses of biological data to evaluate the performance of GSNCASCR.
2.2. Simulation
To evaluate the statistical powers of GSNCASCR, GSCA, and GSNCA methods, we generated simulated datasets with scDesigner2 (version 0.1.0).
With scDesign2, and based on the dataset of mouse_sie_10x.rds in this package, we simulated one dataset, maintaining the gene co-expression and another without co-expression (
Figure 2). We extracted the highly corelated gene pairs in the training dataset and manually created gene sets with different numbers of co-expressed gene pairs. Datasets with >40% and <20% gene pairs were defined as positive and negative datasets, respectively [
15].
Steps to generate the simulated data are illustrated in
Figure S2. (1) Based on the training dataset, scDesign2 was used to generate two single-cell datasets: one that maintains co-expression and another that does not. (2) We manually created pathways of varying sizes (20, 40, 60, 80, and 100) by selecting different numbers (n) of highly correlated gene pairs from the initial training dataset. The remaining genes were randomly selected to complete the pathways (pathway size–number of correlated genes). (3) We ran the GSNCASCR, GSNCA (version 1.42.0), and GSCA (version 1.1.1) software tools to compare datasets with and without maintained co-expression. Positive pathways exhibited more co-expression changes than negative pathways. (4) We used AUC values to evaluate the performance of the three tools.
The
value, as described in Equation (3) of Materials and Methods, is utilized to quantify the differences between networks under two conditions. Permutations were obtained by shuffling cell labels. We observed that the
values from these permutations follow a normal distribution (Shapiro–Wilk test), as shown in
Figure S3. Consequently, we compared our observed results with the permutation-derived distribution to estimate
-values.
We used simulated data to compute true sensitivities and precision of the tools for detecting co-expression alteration pathways. Receiver operating characteristic (ROC) curves, using the simulated data (>40% and <20% gene pairs for positive and negative pathways, respectively), are shown in
Figure 3. GSNCASCR shows the highest area-under-the-curve (AUC) value, indicating the best performance among the three tools tested.
Average true positive rates (TPRs, sensitivities), false positive rates (FPRs), precision, and accuracy of the tools are given in
Table 1. We defined TPs as truly called differentially co-expressed pathways and FPs as the pathways called significant but not differentially co-expressed pathways. Similarly, true negatives (TNs) were defined as pathways that were not truly differentially co-expressed and were not called significant, and false negatives (FNs) were defined as pathways that were truly differentially co-expressed but were not called significant.
As seen in
Table 1, GSNCASCR identified the gained, highest identification accuracy at 0.79 and precision at 1.00. In comparison, GSCA identified the greatest number of truly differentially co-expressed pathways but also introduced the highest number of false positives (high false positive rate), which resulted in a low identification accuracy at 0.69. GSNCA identified the smallest number of truly differentially co-expressed pathways though it introduced a small number of false positives, which resulted in a low identification accuracy at 0.73.
Adjusting the cutoff criteria used to define positive and negative pathways in simulated data will produce different AUC values for the three software tools. More stringent cutoffs are anticipated to result in higher AUC values because they create a clearer distinction between positive and negative pathways. We experimented with various cutoff types and discovered that GSNCASCR consistently outperformed the others in approximately 90% of the cutoff combinations. AUC values using the criteria of greater than 40% for positive and less than 40% for negative pathways, without any grey area, are presented in
Supplementary Figure S4.
Due to the inherent noise in scRNA-seq data, it is essential to assess the impact of dropout and noise. Our simulated dataset is based on actual scRNA-seq data, which naturally includes dropout and noise. To further mimic these conditions, we introduced two types of data corruption: (a) increasing the number of zeros (dropouts) and (b) adding noise [
16,
17]. Dropouts were applied by setting low expression values to zero with a higher probability, varying the fraction of zeros from 0.1 to 0.7. For noise addition, we randomly increased expression values by 30–50% or decreased them by 20–40%, with probabilities ranging from 0.1 to 0.7.
AUC values decrease as dropout rates and noise levels increase (
Figure S5). Therefore, conducting quality control and removing low-quality cells are essential steps before analysis. Further improvements in performance can be achieved by screening and eliminating technical noise in scRNA-seq data [
18].
2.3. Biological Experiments
To examine performance in a real dataset, we firstly applied GSNCASCR to a scRNA-seq dataset from human peripheral blood mononuclear cells (PBMC) of seven hospitalized patients with SARS-CoV-2 and six healthy donors to identify biological pathways differentially regulated in COVID-19 patients [
19]. Gene sets were taken from the Hallmark pathway sets of the molecular signature database (MSigDB,
https://www.gsea-msigdb.org/gsea/index.jsp (accessed on 16 May 2023)) where a total of 50 pathways are present. We also used gene sets taken from the Gene Ontology (GO) biological process pathways from MSigDB. Pathways with <40 or >1000 genes were discarded and the resulting datasets comprised 7000 genes and 1026 pathways to analyze [
2].
Approximately 80% of gene expressions follow a normal or log-normal distribution [
20]. Given that different genes may exhibit varying distributions, selecting a normal distribution is often the best approach, as it fits most genes well, which is required by CS-CORE. We used the same real dataset as referenced in the paper of CS-CORE [
6]. Consequently, the dataset met the requirements for CS-CORE’s measurement model.
The top 20 complete lists of pathways identified in CD4
+ T cells are provided in
Table 2. Pathways found by GSNCASCR approaches were mainly immune related, including HALLMARK_INTERFERON_GAMMA_RESPONSE, HALLMARK_INTERFERON_ALPHA_RESPONSE, HALLMARK_TNFA_SIGNALING_VIA_NFKB, HALLMARK_COMPLEMENT, HALLMARK_IL2_STAT5_SIGNALING, and HALLMARK_IL6_JAK_STAT3_SIGNALING. GO terms datasets contained many more pathways, and again most of pathways were immune related, with top pathways of GOBP_DEFENSE_RESPONSE_TO_SYMBIONT, GOBP_CYTOPLASMIC_TRANSLATION, GOBP_REGULATION_OF_VIRAL_GENOME_REPLICATION, GOBP_POSITIVE_REGULATION_OF_IMMUNE_SYSTEM_PROCESS, GOBP_RESPONSE_TO_VIRUS, GOBP_AMIDE_BIOSYNTHETIC_PROCESS, GOBP_PEPTIDE_BIOSYNTHETIC_PROCESS, GOBP_PROTEIN_ACETYLATION, GOBP_NEGATIVE_REGULATION_OF_VIRAL_PROCESS, and GOBP_ANTIGEN_RECEPTOR_MEDIATED_SIGNALING_PATHWAY.
Co-expression networks for healthy, control, and differential conditions are shown in
Figure 4 (figure with .pdf format is available in
Supplementary File S2). Additionally, network visualizations in ggraph format for control, COVID-19, and differential networks are available in
Supplementary File S3 for further examination.
In both B cells (
Table 3 and
Table S1) and CD4
+ T cells, HALLMARK_INTERFERON_GAMMA_RESPONSE, HALL-MARK_INTERFERON_ALPHA_RESPONSE, and HALL-MARK_TNFA_SIGNALING_VIA_NFKB were all among the top identified pathways. This was consistent with our understanding of COVID-19. Cytokines, such as interleukin-6 (IL-6), interleukin-1 (IL-1), interleukin-17 (IL-17), and tumor necrosis factor-alpha (TNF-α) play a significant role in lung damage in acute respiratory distress syndrome patients through impairment of respiratory epithelium. Cytokine storm is defined as acute overproduction and uncontrolled release of proinflammatory markers, locally and systemically [
21].
Multiple studies have highlighted dysregulation of complex networks of peripheral blood immune responses in COVID-19, using scRNA-seq analysis [
20,
22,
23]. Monocytes, dendritic cells, natural killer (NK) cells, T cells, and B cells are all reported to relate to disease severity, while a dysregulated interferon (IFN) response, which has a key role in innate immune response, is associated with disease pathogenesis and severity. Rare loss-of-function mutations in IFNAR2 are associated with severe COVID-19 and many other viral infections. Administration of IFN might reduce the likelihood of critical illness in COVID-19 but could not distinguish if such a treatment might be effective during disease progression of COVID-19. Several of these loci corresponded to previously documented associations to lung or autoimmune and inflammatory diseases [
24].
High levels of proinflammatory cytokines such as TNF-α and interleukins are produced by innate immune cells to fight SARS-CoV-2 infections. Cytokine-mediated inflammatory events are also linked to detrimental lung injury and respiratory failure, which can result in patients’ deaths. TNF-α is among the early cytokines produced to mediate proinflammatory responses and enhance immune cell infiltration in response to SARS-CoV-2 infections.
We then examined differential expressed pathways in CD8
+ T cells, and the results are shown in
Table 4.
Surprisingly, in CD8
+ T cells, top pathways were not immune-related, although COVID-19 causes several immune-related complications, such as lymphocytopenia and cytokine storm. Our results are consistent with a study that showed that SARS-CoV-2-infected human CD4
+ T helper cells, but not CD8
+ T cells, are present in blood and bronchoalveolar lavage CD4
+ T helper cells of severe COVID-19 patients. Also, previous studies showed SARS-CoV-2 spike glycoprotein directly binds to the CD4 molecule, which in turn mediates the entry of SARS- CoV-2 into CD4
+ T helper cells, leading to impaired CD4
+ T cell functions and cell death. SARS-CoV-2-infected CD4
+ T helper cells express elevated IL-10, which is associated with viral persistence and disease severity. Thus, CD4-mediated SARS-CoV-2 infection of CD4
+ T helper cells may contribute to a poor immune response in COVID-19 patients [
21]. Similarly, with GO biological process terms, the top terms were TELOMERE related, DNA replication, and protein synthesis and localization (
Table S2). In contrast, in CD4
+ T cells, most of the top terms were immune related (
Table S3), such as GOBP_DEFENSE_RESPONSE_TO_SYMBIONT, GOBP_CYTOPLASMIC_TRANSLATION, GOBP_REGULATION_OF_VIRAL_GENOME_REPLICATION, GOBP_POSITIVE_REGULATION_OF_IMMUNE_SYSTEM_PROCESS, GOBP_RESPONSE_TO_VIRUS, GOBP_AMIDE_BIOSYNTHETIC_PROCESS, GOBP_PEPTIDE_BIOSYNTHETIC_PROCESS, and GOBP_PROTEIN_ACETYLATION.
Our results also revealed the importance of identifying cell-type-specific co-expression, which is more enriched for biorelevant pathways [
2], as most gene–gene correlations were brought by the cell-type specificity of gene expression. For example, two genes specifically expressed in one cell type were highly correlated when we analyzed all cell populations.
We examined the importance of HALLMARK_INTERFERON_GAMMA_RESPONSE in COVID-19 infection. In network analysis of B cells, interferon-induced antiviral factor (
IFITM3) was the hub gene (
Table 5).
IFITM3 inhibits SARS-CoV-2 infection by preventing SARS-CoV-2 spike-protein-mediated virus entry and cell-to-cell fusion. Analysis of a Chinese COVID-19 patient cohort demonstrated that the rs12252 C genotype of
IFITM3 is associated with the SARS-CoV-2 infection risk in the studied cohort. These data suggest that individuals carrying the rs12252 C allele in the
IFITM3 gene may be vulnerable to SARS-CoV-2 infection and benefit from early medical intervention [
25].
The
IFITM3 rs6598045 G allele was significantly more common in deceased COVID-19 patients than in those who recovered. Highest mortality rates were observed in the Delta variant and with the lowest qPCR Ct values. COVID-19 mortality was associated with the
IFITM3 rs6598045 GG and AG in the Delta variant and the
IFITM3 rs6598045 AG in the Alpha variant. A statistically significant difference was observed in the qPCR Ct values between individuals with GG and AG genotypes and those with an AA genotype [
26].
IFITM proteins are directly involved in adaptive immunity, and they regulate CD4+ T helper cell differentiation [
27].
IFITM3 also directly engages and shuttles incoming virus particles to lysosomes [
28].
IFITM3 was also a hub gene in the differential network of CD4
+ T cells, ranking 12 out of 118 genes (
Table 6). The number one hub gene was
BST2, which was associated with COVID-19. There was a decrease in SARS-CoV-2 in cells with deleted transmembrane
BST2 domains compared to the initial Vero cell line. Similar results were obtained for SARS-CoV-2 and avian influenza virus [
29]. Another study found that BST-2 restricts SARS-CoV-2 virion egress by tethering virions to the plasma membrane. We also identified several SARS-CoV-2 proteins that are putative modulators of
BST2 function [
30].
BST2 is an antiviral protein that inhibits the release and spread of many viruses and is upregulated as part of the innate immune defense against infections [
31].
BST2 can respond to infection by inducing proinflammatory responses via NF-κb signaling pathway activation [
32].
Successful identification of hub genes illustrated the capability of GSNCASCR in prioritizing disease-related genes for understanding pathophysiology of disease and potential therapies.
DADA2 (deficiency of adenosine deaminase 2) is a vasculitis disease caused by autosomal-recessive loss-of-function mutations in the
ADA2 gene [
33]. The spectrum of disease manifestations includes vasculitis, vasculopathy, and inflammation. ADA2 protein is primarily secreted by stimulated monocytes and macrophages, and aberrant monocyte differentiation to macrophages is important in the pathogenesis of DADA2. We also applied GSNCASCR to an scRNA-seq dataset comprising monocytes, CD4
+, and CD8
+ T lymphocytes of DADA2 patients and the results are shown in
Table 7.
As expected, gene sets identified by GSNCASCR in monocytes in DADA2 patients were highly related with immune response, including IFN-γ and IFN-α and TNF-α signaling via NFκB and other pathways, indicating activation of monocytes and general inflammation in DADA2. Our previous research also revealed that T lymphocytes were activated and potentially contributed to exaggerated inflammation via ligand–receptor interactions with monocytes [
34]. Consistently, upregulation of genes in the immune pathways such as
IFN-γ and
IFN-α, IL6 JAK STAT3 signaling, IL2 STAT5 signaling, and TNF-α signaling via NFκB were seen in CD4
+ T cells of DATA2 patients, defined by GSNCASCR [
33]. GSNCASCR also showed that CD8
+ T cells in DADA2 upregulated stress pathways, including unfolded protein response, UV response, and inflammation (TNF-α signaling via NFκB and PI3K AKT MTOR signaling), suggesting T cell activation, cytotoxicity, and contribution to inflammation in the disease [
34,
35].
The results from GSNCA and GSCA applied to DADA2 and COVID-19 datasets are presented in
Supplementary File S4. While most findings aligned with those from GSNCASCR, some discoveries were not clearly identified by these two tools. For instance, GSCA and GSNCA also identified immune response pathways to be differentially co-expressed in monocytes in DADA2, but GSNCA failed in CD4 and CD8 cells, and GSCA failed in CD8 cells. We recommend using multiple software tools on real datasets to thoroughly assess both consistent and inconsistent results for biological interpretation.