Bioinformatics and Connectivity Map Analysis Suggest Viral Infection as a Critical Causative Factor of Hashimoto’s Thyroiditis

Hashimoto’s thyroiditis (HT) is a common autoimmune disease, and its prevalence is rapidly increasing. Both genetic and environmental risk factors contribute to the development of HT. Recently, viral infection has been suggested to act as a trigger of HT by eliciting the host immune response and subsequent autoreactivity. We analyzed the features of HT through bioinformatics analysis so as to identify the markers of HT development. We accessed public microarray data of HT patients from the Gene Expression Omnibus (GEO) and obtained differentially expressed genes (DEGs) under HT. Gene Ontology (GO) and KEGG-pathway-enrichment analyses were performed for functional clustering of our protein–protein interaction (PPI) network. Utilizing ranked gene lists, we performed a Gene Set Enrichment Analysis (GSEA) by using the clusterprofiler R package. By comparing the expression signatures of the huge perturbation database with the queried rank-ordered gene list, a connectivity map (CMap) analysis was performed to screen potential therapeutic targets and agents. The gene expression profile of the HT group was in line with the general characteristics of HT. Biological processes related to the immune response and viral infection pathways were obtained for the upregulated DEGs. The GSEA results revealed activation of autoimmune-disease-related pathways and several viral-infection pathways. Autoimmune-disease and viral-infection pathways were highly interconnected by common genes, while the HLA genes, which are shared by both, were significantly upregulated. The CMap analysis suggested that perturbagens, including SRRM1, NLK, and CCDC92, have the potential to reverse the HT expression profile. Several lines of evidence suggested that viral infection and the host immune response are activated during HT. Viral infection is suspected to act as a key trigger of HT by causing autoimmunity. SRRM1, an alternative splicing factor which responds to viral activity, might serve as potential marker of HT.


Introduction
Hashimoto's thyroiditis (HT) is one of the most prevalent autoimmune diseases (AIDs) [1] and a common cause of hypothyroidism in developed countries [2]. The pathologic features of HT include extensive inflammation in thyroid tissue caused by infiltrating CD4+ T lymphocytes [3] and the presence of autoantibodies, causing follicular cell damage and ultimately leading to disrupted thyroid function [4]. The clinical phenotype of HT can vary from asymptomatic to severe symptoms of hypothyroidism, including weight gain, menstrual disorders, and heat intolerance [5]. Although HT is an organ-specific disease, the disease might be related to other AID or systemic autoimmune disorders in many cases [6]. The proposed risk factors for HT include genetic susceptibility, sex (female), and senescence, as well as environmental triggers such as iodine uptake, drugs, chemicals, and pathways in the HT group, visualizing these through various R packages. Finally, we identified promising targets and compounds for HT via a CMap analysis.

Validation of GEO Data
Processed data from the GSE138198 human thyroid tissue microarray dataset was validated by checking the distribution in a boxplot and the heatmap clustering of samples. The boxplot revealed median-centered values, indicating that the data are well-normalized for all samples ( Figure 1A). A UMAP plot located each sample in a reduced dimension, using Euclidean distance, concluding that samples within groups are in close proximity (HT vs. TN), and, on the contrary, at long distance between the two groups ( Figure 1B). Likewise, the hierarchical correlation heatmap indicated that samples are clustered by groups ( Figure 1C).

Identification of DEGs Extracted from HT
The volcano plot displayed the distribution of DEGs by their fold change and p-value, as was distinguishable by color ( Figure 1D). After the validation of gene identifiers, a total of 838 upregulated DEGs and 595 downregulated DEGs were confirmed between the HT vs. TN groups. Lists of the 15 most significant up-or downregulated genes are presented in Table 1. KIF5B and PTH were confirmed as the most significantly up-and downregulated DEGs, respectively, in the HT group as compared to the TN group.

Functional Clustering Analysis of DEGs Reveals the Pathological Characteristics of HT
The full PPI network of up-and downregulated DEGs is presented in Supplementary  Figures S1 and S2. A further analysis of functional modules was conducted by using the full PPI network with regard to GO (BP) and KEGG terms (Figures 2 and 3). PPI modules were in line with the general characteristics of HT. Upregulated DEGs were grouped in three notable clusters whose gene lists overrepresented the immune response (viral infection), immune system regulation (autoimmune disease), and RNA translation (ribosome), with scores of 13.467, 8.214, and 6.857, respectively (Figure 2A-C). Meanwhile, downregulated DEGs were grouped into three remarkable clusters enriched in muscle contraction (cardiomyopathy), translation (ribosome), and oxidative phosphorylation (cellular respiration and its related diseases), with scores of 18.842, 11.091, and 7.60, respectively ( Figure 3A-C).   Results of BP and KEGG-enrichment analyses presented as dot plots. Each cluster represents a set of highly connected genes. (A-C) represent Cluster 1-3, respectively, which had the highest clustering scores.

BP GSEA Results Revealed a Strong Connection between HT and Immune Regulation
The BP GSEA results were presented in a dot plot, which clearly showed the significantly activated or inactivated biological processes in HT ( Figure 4A). As expected, immune-response-related BPs were upregulated in HT patients. In contrast, muscular processes were distinctively downregulated in HT patients. Subsequently, the Emapplot showed close interactions between the most significant BP terms based on overlapping genes ( Figure 4B). The BP term 'cytokine production' had solid interactions with other terms, including 'positive regulation of immune system process', 'regulation of immune response', 'hemopoiesis', and 'cellular response to cytokine'. These five essential BP terms and enriched genes were further plotted by using the Cnetplot function ( Figure 4C). Twentyseven genes, including CASP8, IL6, TNF, IL1B, and SOCS1, were common between these five BPs, which play key roles in cytokine and inflammatory signaling (Supplementary File S1). The Ridgeplot analysis illustrated that the overall distribution of component genes consists of each BP ( Figure 4D). Similarly, BP terms related to various immune responses and cytokine production exhibited increased activity. The term 'lymphocyte activation' showed predominant results in regard to both the p-value and NES, exhibiting enriched distributions of activated genes ( Figure 4E). The top BP GSEA results for the HT group are presented in Table 2. Supplementary Figure S3 presents the top five BPs or pathways and the distributions of associated genes.

GSEA Results Suggested a Link between Autoimmune Thyroiditis and Viral Infections
The association between HT and other disease pathways was investigated via the KEGG GSEA of HT transcriptional profiles from microarray data. The KEGG GSEA dot plot showed a predominance of viral-infection-related pathways (including human T-cell leukemia virus infection, human cytomegalovirus infection, Epstein-Barr virus infection, herpes simplex virus 1 infection, etc.) in the activated panel, with strong statistical significance ( Figure 5A). The Emapplot displayed a number of interconnected viral disease pathways, while the Epstein-Barr virus infection pathway was a hub in the network ( Figure 5B). Five representative KEGG pathways with their connected genes were plotted, suggesting that TNF and IL6 were common targets of all five pathways ( Figure 5C). The Ridgeplot showed a distribution of gene expression (activated) in several viral-infectious diseases' pathways, such as herpes simplex virus 1 infection ( Figure 5D). Upon a closer look, the comparison of GSEA plots between herpes simplex virus 1 infection and autoimmune thyroid disease revealed that both had a skewed distribution (activated) of genes with a high Normalized Enrichment Score (NES) ( Figure 5E). The top KEGG GSEA results of the HT group are presented in Table 3.

Common DEGs Involved in Two Intuitive Pathways Provide Insight into Disease Etiology
As our results repeatedly indicated a high correlation of HT with viral infectious disease, we mapped both pathways and colored DEGs (red to green) according to their relative expression levels, using the PathView R package ( Figure 6A,B). Common genes between the two pathways were plotted by using the Cnetplot function with modified arguments to investigate only pathways of interest. Of note, two seemingly independent pathways shared many upregulated MHC Class I and II genes (Figure 7). These 14 HLA genes consisted of 2 MHC Class I and 12 MHC Class II molecules. Due to the unillustrated intracellular signaling pathway in the autoimmune thyroid disease KEGG map, other immune-related intersecting genes were not presented as results ( Figure 6B).
(A) (B) Figure 6. KEGG pathway mapping of (A) herpes simplex virus 1 infection pathway and (B) autoimmune thyroid disease based on microarray data of the HT group. The plot was created by using PathView R package. The red-to-green color indicates the relative gene expression.

CMap Analysis Revealed Potential Markers and Drugs Based on the HT Gene Signature
The CLUE webtool deduced potential target genes and compounds for HT by comparing the gene signature against a gene-expression database based on perturbations. Arranged by their median CMAP score, 19 KD gene perturbations were assumed to restore the gene-expression profiles of the queried signature (HT), with CMap-score criteria below −90 ( Figure 8A). Six compound perturbations were expected to reverse the gene signature with the same CMap score as criteria ( Figure 8B). Among the 19 KD gene perturbagen lists, three genes (SRRM1, NLK, and CCDC92) actually exhibited a significantly increased expression (p < 0.05) in the HT group ( Figure 8C). The SRRM1 expression fold change, p-value, and transcriptional activity score (TAS) were most prominent among the three genes.

Discussion
A US population-based review study reported that AID prevalence is 5-7% and is gradually increasing [10,29]. It has been suggested that both genetic susceptibility and environmental factors act as triggers for the breakdown of tolerance and progress of the disease [6]. Genetic predisposition, usually associated with HLA Class II alleles and other immune mediators, is not sufficient on its own for the increasing prevalence of AIDs [30]; rather, environmental factors, including viral infections, are regarded as major contributors to the incidence of AID [21].
Accumulating evidence indicates such a connection between viral infections and the development of AIDs, with an example being the increased prevalence of AIDs following an influenza pandemic reported in a population-based observational study [16,31]. A recent review on SARS-CoV-2 suggested that this virus could also trigger autoimmune responses through molecular mimicry, working in similar ways with other viruses [32].
The most prevalent autoimmune diseases/conditions involved with COVID-19 infection are Guillain-Barre syndrome, immune thrombocytopenic purpura, Kawasaki disease and autoimmune thyroid disease [33]. Investigations on the involvement of SARS-CoV-2 infection in autoimmune thyroid disease development are currently underway [34]. A recent systematic review revealed general characteristics of COVID-19-induced subacute thyroiditis patients [35]. The authors found that subacute thyroiditis to be the most common clinical thyroidal syndrome associated with COVID-19, thus indicating the direct effect of viral infection on autoimmune disease [35]. In another systematic review, the authors pointed out that multifaceted effects of SARS-CoV-2 infection on thyroid functions are variable (thyrotoxicosis, hypothyroidism, and non-thyroidal illness syndrome) and difficult to predict [36]. In this situation, it is of great importance to identify biomarkers to elucidate the link between viral infections and the development of HT.
Our analysis of microarray data revealed characteristic features of HT. The functional cluster analysis of upregulated DEGs provided strong evidence of an association between viral infection and the pathophysiological traits of AID through the activation of innate immune responses (Figure 2) [37]. The GSEA results indicated immune activation in the HT group, with an upregulation of genes implicated in lymphocyte activation, regulatory immune responses, cytokine production, and other related pathways (Figures 4 and 5). Decisively, the GSEA of the herpes simplex virus 1 infection and autoimmune thyroid disease pathways showed similar plots and presented high-ranked significance (each 1st and 44th rank), with the KEGG analysis of DEGs demonstrating an intimate relationship between the two pathways in HT ( Figures 5E and 6A,B).
In our data, of all HLA Class I and II molecules investigated, the expression of 14 genes was significantly upregulated (Figure 7). Aberrant HLA II expression in nonantigen presenting cells, such as epithelial cells and cultured thyrocytes, has been described in a number of AIDs [38]. Similarly, overexpression of HLA Class I during the antiviral response was observed in tissue obtained via core needle biopsy from 46 HT patients [39]. Therefore, the increased expression of both HLA classes might be a general indicator of HT induced by the antiviral immune response.
Interestingly, ribosome-enriched clusters were noted among both up-or downregulated DEGs ( Figure 3). As cellular machinery that is hijacked by viruses, host ribosome factors, including ribosomal proteins (RPLs), play critical roles during viral infection [40]. While detailed functions of the various RPLs are under investigation, these are expected to impact viral replication and gene expression through interactions with viral proteins [40]. The result leads to the speculation of mixed reaction from host cellular defense mechanisms and viral activity. In line with our data, Wu et al. reported similar results from their analysis of blood samples from patients with systemic lupus erythematosus, another AID [41]. When they analyzed the PPI network consisting of DEGs, several RPL genes were implicated as key genes in a module, and the authors explained it as being the outcome of the host immune response to viral infection [41].
SRRM1 is an RNA-binding protein and splicing factor participating in the alternative RNA splicing process [42]. It was suggested that host SRRM proteins are modulated by HIV-1 to facilitate its replication and release [43]. Furthermore, the observed changes in alternative splicing could be either a direct consequence of viral manipulation, the innate immune response, or cellular damage [44]. Dysregulation of post-transcription processes during the antiviral immune response, which is modulated by Type I and III interferons, may lead to autoimmunity [45]. In our study, SRRM1 was suggested as a critical marker of HT, as indicated by the strictly negative CMap score (−99.08) and upregulated expression (log 2 FC = 1.424) in the HT group ( Figure 8C).
While the clinical phenotype of HT differs per case and is difficult to predict, it progresses slowly over months to years [46]. It is highly suggested that patients with autoimmune thyroid disorders should be monitored for the thyroid functions in consideration of the relationship with other systemic autoimmune diseases as well [6]. The predictive Thyroid Events Amsterdam (THEA) score takes into account the TSH, TPOAb levels, and familial background information to estimate the 5-year risk of overt hypothyroidism [47]. Based on our current findings, the evaluation of viral activity, as a key environmental factor in the context of HT, may improve prediction of further development of the disease. To this end, future studies should elucidate the relationship between SRRM1 (or the other alternative splicing factors) and the widely employed clinical HT biomarkers.

Data Resources and Processing
The human microarray dataset of GSE138198 [48] was accessed via the GEO database (https://www.ncbi.nlm.nih.gov/geo/) at the National Center of Biotechnology Information (NCBI) [24]. The GSE138198 dataset comprises a total of 36 human thyroid tissue microarray samples, including 13 HT tissue samples and 3 normal thyroid (TN) tissue samples. The microarray dataset is based on the GPL6244 (HuGene-1_0-st) Affymetrix Human Gene 1.0 ST Array (transcript (gene) version).

Screening of DEGs
GEO2R, an interactive web tool for the analysis of GEO datasets, was used to screen DEGs between HT and TN (http://www.ncbi.nlm.nih.gov/geo/geo2r/). GEO2R identified DEGs via GEO queries and the Limma R package from R/bioconductor [49,50]. Genes with an adjusted p-value < 0.05 and |log 2 (Fold Change)| > 1 were considered DEGs. Adjustment of the p-value was performed via the Benjamini and Hochberg method to reduce false discovery rate. DEGs were visualized with a color-differentiated (by significance and fold change) and labeled volcano plot, using ggplot R packages. A clustered correlation heatmap of all samples was created by using the pheatmap package.

Construction of PPI Network and Functional Subcluster Analysis
We used the Search Tool for the Retrieval of Interacting Genes (STRING, https:// string-db.org/) online database to construct a PPI of up-or downregulated DEGs [51]. PPI networks were further imported to Cytoscape software (version 3.91) [52] for a functional cluster analysis, which groups genes with similar functions, using the MCODE plugin to yield the top 3 clusters by score [53]. Sets of genes from these clusters were separately subjected to an overrepresentation analysis.

GO and KEGG Pathway Enrichment Analyses
GO and KEGG-pathway-enrichment analyses of DEGs were conducted by using the Database for Annotation, Visualization and Integrated Discovery (DAVID database, Accessed on 31th July) [54]. Lists of clustered genes were uploaded with the identifier set to 'official gene symbol'. The data of enriched terms and KEGG pathways were processed and further visualized as a bubble plot created by using the ggplot2 R package [55].

GSEA of GO and KEGG Pathway
GSEA of all ranked genes was performed by using the clusterprofiler R package and the genome-wide annotation package (OrgDb) of Bioconductor [56]. The package provides gseGO and gseKEGG functions for GSEA, using GO (biological process, molecular function, and cellular component) and KEGG annotations. The list of significant gene-set annotations was arranged in the order of normalized enriched score (NES).

Clinical HT Biomarker Identification Via Comparison of Gene Signatures in the CMap Database
We selected the top 150 up-and downregulated DEGs of HT groups and submitted these to the Connectivity Map for analysis by querying the list on the CLUE web tool (https://clue.io/query/, version 1. 1.1.43). The results present perturbagens, including KD (gene knock down), CP (compounds), OE (gene overexpression), and PCL (perturbagen lists), aligned by the calculated connectivity score (median tau score), which varies from 100 to −100.
A negative connectivity score indicates an inversed gene-expression signature between the query and perturbagen, thus implying potential as a promising target or drug [59]. Rank-ordered perturbagens with a CMap connectivity score (tau) < −90 were selected and regarded as significant candidates.

Conclusions
In conclusion, through the use of bioinformatics approaches for the analysis of microarray data, we provided evidence for the association between viral infection and immune activation in HT. We identified SRRM1, an mRNA splicing factor, as a key player in this association. Clinical validation of our current results remains to be performed. The history of viral infection should be seriously taken into account by clinicians during HT diagnosis. The host immune response to viral infection, especially with alternative mRNA splicing, may provide us helpful indicators of HT development and could be harnessed as diagnostic for therapeutic purposes. Further elucidating the relationship between viral infection and HT may improve therapeutic approaches against this disease.  Data Availability Statement: The datasets generated during and/or analyzed during the current study are available in the Gene Expression Omnibus (GEO) repository, https://www.ncbi.nlm.nih. gov/geo/query/acc.cgi?acc=gse138198.

Conflicts of Interest:
The authors have no conflict of interest to declare.