Structural Variability, Expression Profile, and Pharmacogenetic Properties of TMPRSS2 Gene as a Potential Target for COVID-19 Therapy

The human serine protease serine 2 TMPRSS2 is involved in the priming of proteins of the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) and represents a possible target for COVID-19 therapy. The TMPRSS2 gene may be co-expressed with SARS-CoV-2 cell receptor genes angiotensin-converting enzyme 2 (ACE2) and Basigin (BSG), but only TMPRSS2 demonstrates tissue-specific expression in alveolar cells according to single-cell RNA sequencing data. Our analysis of the structural variability of the TMPRSS2 gene based on genome-wide data from 76 human populations demonstrates that a functionally significant missense mutation in exon 6/7 in the TMPRSS2 gene is found in many human populations at relatively high frequencies, with region-specific distribution patterns. The frequency of the missense mutation encoded by rs12329760, which has previously been found to be associated with prostate cancer, ranged between 10% and 63% and was significantly higher in populations of Asian origin compared with European populations. In addition to single-nucleotide polymorphisms, two copy number variants were detected in the TMPRSS2 gene. A number of microRNAs have been predicted to regulate TMPRSS2 and BSG expression levels, but none of them is enriched in lung or respiratory tract cells. Several well-studied drugs can downregulate the expression of TMPRSS2 in human cells, including acetaminophen (paracetamol) and curcumin. Thus, the interactions of TMPRSS2 with SARS-CoV-2, together with its structural variability, gene–gene interactions, expression regulation profiles, and pharmacogenomic properties, characterize this gene as a potential target for COVID-19 therapy.


Introduction
The severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has caused a pandemic of coronavirus disease  that has led to a global public health crisis.
Infection of human cells with viral particles occurs through the binding of viral spike proteins to the receptors of the host cell and their subsequent priming with proteases. Angiotensin-converting enzyme 2 (ACE2) is considered to be the classic receptor for SARS-CoV-2, but there is evidence that the virus can also use the Basigin (BSG) receptor (CD147) [1]. The priming of viral proteins is carried out by the TMPRSS2. No specific therapy has yet been developed for SARS-CoV-2. However, blockers of all three proteins have been shown to prevent cell infection [2].
In addition to protein blocking, there are other mechanisms that can alter the expression levels of specific proteins or the affinity of their interactions with viral particles. Possible causes of expression differentiation include alterations of protein structure due to genetic variants: single nucleotide variants (SNVs), indels, copy number variations (CNVs), variants affecting regulatory regions (expression quantitative trait loci; eQTLs), and epigenetic regulation: methylation, microRNAs (miRNAs).
The TMPRSS2 gene in humans encodes a transmembrane protein of the serine protease family. Alternatively spliced transcript variants encoding different isoforms have been found for this gene. The TMPRSS2 protein is involved in prostate carcinogenesis via overexpression of Erythroblast Transformation Specific transcription factors (ETS), such as ERG and ETV1, through gene fusion. TMPRSS2-ERG gene fusion, which is present in 40-80% of prostate cancers in humans, is a molecular subtype that has been associated with predominantly poor prognosis [3,4].
The TMPRSS2 protease proteolytically cleaves and activates glycoproteins of many viruses, including spike proteins of human coronaviruses 229E (HCoV-229E) and EMC (HCoV-EMC), and the fusion glycoproteins of Sendai virus, human metapneumovirus, and human parainfluenza 1, 2, 3, 4a, and 4b viruses [2,[5][6][7][8][9][10]. Both the coronavirus responsible for 2003 SARS outbreak in Asia (SARS-CoV) and SARS-CoV-2 are activated by TMPRSS2 and can thus be inhibited by TMPRSS2 inhibitors [2]. Here, we report the genetic variability of the TMPRSS2 gene in 76 human populations of North Eurasia in comparison with worldwide populations, and analyze the data with respect to the expression and regulation of TMPRSS2, its interactions with SARS-CoV-2 receptors, and its pharmacogenetic properties.

Structural Variability Data
Allele frequency for worldwide populations were downloaded from the GnomAD database, which contains information on the frequencies of genomic variants from more than 120,000 exomes and 15,000 whole genomes [11]. These data were used to search for SNVs and indels in the TMPRSS2 gene. CNV data were obtained from the CNV Control Database [12].
Data on allele frequencies in 76 populations of North Eurasia were extracted from our own unpublished dataset of population genomics data obtained by genotyping using Illumina Infinium genome-wide microarrays. In brief, 1836 samples from 76 human populations were genotyped for 1,748,250 SNVs and indels using a Infinium Multi-Ethnic Global-8 kit. The populations represent various geographic regions of North Eurasia (Eastern Europe, Caucasus, Central Asia, Siberia, North-East Asia) and belong to various linguistic families (Indo-European, Altaic, Uralic, North Caucasian, Chukotko-Kamchatkan, Sino-Tibetan, Yeniseian). DNA samples were collected with informed consent and deposited in the DNA bank of the Research Institute for Medical Genetics, Tomsk National Medical Research Center, Tomsk, Russia, and the DNA bank of the Institute of Biochemistry and Genetics, Ufa Federal Research Centre of the Russian Academy of Sciences. The study was approved by the Ethical Committee of the Research Institute for Medical Genetics, Tomsk National Medical Research Center. Data on four missense mutations in the TMPRSS2 gene were extracted from the dataset. The CNV search was performed using a Markov model algorithm for high-resolution CNV detection in whole-genome implemented with the PennCNV tool [13]. To determine the possible functional impact of the detected SNVs, the Polymorphism Phenotyping v2 (PolyPhen-2) tool was used [14]. PolyPhen estimates the impact of a mutation on the stability and function of the protein using structural and evolutionary analyses with amino acid substitution. The tool evaluates the mutation as probably damaging, possibly damaging, benign, or of unknown significance, using quantitative prediction with a score.

Bioinformatics Analysis of Gene Expression, Mirna Interactions, and Pharmacogenomics
Protein-protein interaction analysis of SARS-CoV-2-interacting proteins was carried out using the GeneMANIA and STRING web resources [15,16]. Single-cell RNA sequencing (RNA-seq) data were downloaded from the PanglaoDB database, which contains more than 1300 single-cell sequencing samples [17]. Lung single-cell RNA-seq data were obtained from the Sequence Read Archive [18] and processed in the R software environment using the Seurat package [19].
Analysis of the interactions of miRNAs with target proteins was performed using information from two databases: miRTarBase, which contains information from more than 8000 referenced sources about experimentally confirmed miRNA-protein interactions [20]; and miRPathDB, which contains experimentally confirmed and predicted miRNA-protein interactions [21]. Data on the differential expression of miRNAs in various cell cultures were downloaded from the database of the FANTOM5 project [22]. The DrugBank database [23] was searched for drugs that could change the expression levels of proteins.

Protein-Protein Interaction Networks of Sars-Cov-2-Interacting Genes
The protein-protein interaction networks obtained with two different tools GeneMA-NIA and STRING (Figures 1 and 2) demonstrated that TMPRSS2 was co-expressed with other SARS-CoV-2-interacting genes, despite showing contradictory co-expression patterns. According to GeneMANIA, TMPRSS2 was co-expressed with BSG, whereas STRING indicated co-expression of ACE2 and TMPRSS2. Interestingly, BSG showed the maximum number of protein-protein interactions in both networks. BSG, or extracellular matrix metalloproteinase inducer, also known as cluster of differentiation 147 (CD147), encoded by the BSG gene, is a transmembrane glycoprotein of the immunoglobulin superfamily and a determinant of the Ok blood group system.
The BSG protein has an important role in targeting monocarboxylates (MCT) transporters SLC16A1, SLC16A3, SLC16A8, SLC16A11, and SLC16A12 to the plasma membrane by interaction with MCT molecules via its transmembrane and cytoplasmic domains. BSG is involved in spermatogenesis, embryo implantation, neural network formation, and tumor progression. It stimulates adjacent fibroblasts to produce matrix metalloproteinases. BSG seems to be a receptor for oligomannosidic glycans and, according to in vitro experiments, can promote outgrowth of astrocytic processes [24][25][26]. BSG is also involved in tumor development, plasmodium invasion, and viral infection [27][28][29][30][31].
Previous data on SARS indicate that BSG has a functional role in facilitating SARS-CoV invasion of host cells, and CD147-antagonistic peptide-9 has a high binding rate to HEK293 cells and an inhibitory effect on SARS-CoV [32]. Based on the similarity of SARS-CoV and SARS-CoV-2, the function of BSG in invasion of host cells by SARS-CoV-2 can be assumed. The exact role of BSG in COVID-19 is still unknown; however, it was recently shown that CD147 may bind to the spike protein of SARS-CoV-2 [1]. Preliminary data from a small sample of COVID-19 patients demonstrated that meplazumab, a humanized anti-CD147 antibody, efficiently improved the recovery of patients with SARS-CoV-2 pneumonia and showed a favorable safety profile [33].

Expression of ACE2, BSG, and TMPRSS2 in Single Cells
Expression profiles of SARS-CoV-2-interacting genes in various tissues demonstrated that ACE2 had high level of expression only in testicles in peritubular myoid cells ( Figure 3). The highest expression levels of BSG were found in germ cells, endothelia of various localizations, fibroblasts, and some other cell types ( Figure 4). TMPRSS2 showed high levels of expression in the prostate, intestines, and lungs ( Figure 5).   In addition, the expression levels of these proteins were analyzed in a single sample (SRS2769051) of proximal stromal lung cells (Figures 6-9). ACE2 had low levels of expression in pulmonary alveolar cells and fibroblasts. BSG was characterized by average levels of expression in fibroblasts and alveolar cells. Only the TMPRSS2 gene demonstrated tissue-specific expression in alveolar cells. Given the high specificity of the expression of TMPRSS2 in lung tissue, we further studied genomic and epigenomic properties that may affect its expression levels and the affinity of its interactions with viral particles.

Snv and Indel Variants of the Tmprss2 Gene
According to information in the GnomAD database, 1025 SNVs and indels of various frequencies, functional impact, and localization have been described in the TMPRSS2 gene. This list includes 332 missense variants, 17 frameshifts, 64 splice site variants, 14 stop codon mutations, and three in-frame indels. Among frequent variants (minor allele frequency >0.01), there were only 13 intronic polymorphisms, five synonymous variants, and two missense mutations (rs12329760 and rs75603675). Both missense variants had high frequencies (24.8% and 35.0% in GnomAD, respectively) ( Table 1). The variant rs12329760 is a mutation of C to T in the 589 position of the gene, which leads to a change from valine to methionine at amino acid position 197 (exon 7) of transmembrane protease serine 2 isoform 1, or at position 160 (exon 6) of isoform 2. This mutation is predicted by Poly-Phen-2 to be probably damaging, with a score of 0.997 (sensitivity: 0.41; specificity: 0.98). The T allele of the TMPRSS2 rs12329760 variant was positively associated with TMPRSS2-ERG fusion by translocation; it was also associated with an increased risk of prostate cancer in European and Indian populations [34,35].
The rs75603675 variant (C to A transition in position 23, Gly8Val) was not reported to be associated with prostate cancer or any other clinical condition. An interesting feature of both frequent missense variants was the difference in prevalence between European and Asian populations; rs12329760 was 15% more frequent in populations of East Asia (38%) than in European populations (23%). For rs75603675, the difference was even more significant: the minor allele reached 42% in European populations and about 1% in populations of East Asia. Regarding CNVs, the controlDB database contained only one deletion in the TMPRSS2 gene (one copy variant) with relatively low frequency (1.2%) ( Table 2). The structural variability of the TMPRSS2 gene in relation to COVID-19 has recently been investigated by many research groups. In particular, Paniri et al. used various bioinformatics approaches to predict the functional consequences of TMPRSS2 SNPs with respect to susceptibility to SARS-CoV-2, the functional effects of SNPs on splicing, and the influence of polymorphisms on miRNA function. According to Paniri et al., rs12329760 showed the highest scores in various analyses and was considered deleterious by three tools, indicating its negative functional impact. On the contrary, rs75603675 (G8D) was considered deleterious only by polyphen-2, whereas other tools such as Phyre2, GOR IV, and PSIPRED predicted that both variants would have functional effects on the secondary structure of the TMPRSS2 protein [36].

Frequency of Protein-Changing Allelic Variants of the Tmprss2 Gene in Populations of North Eurasia
In order to study the population differentiation in TMPRSS2 functional variants in more detail, we searched for TMPRSS2 allele frequencies in our own unpublished data from 76 populations of North Eurasia, based on 1836 samples genotyped using genome-wide microarrays. Four missense mutations and two CNVs in the TMPRSS2 gene were found in our dataset. We compared the frequency of TMPRSS2 missense mutations in the North Eurasian population (Table 3) with the worldwide data (Table 4). Three missense mutations (rs148125094, rs143597099, and rs201093031) were very rare variants, whereas rs12329760, which was previously shown to be associated with prostate cancer, was found with high frequency in all populations. Data on the second high-frequency missense variant in the TMPRSS2 gene according to the GnomAD database (rs75603675) were not available because of the absence of this SNP from the microarray used in our study. The minor allele of variant rs148125094 was found on only two chromosomes (total frequency 0.00054) in single heterozygous individuals from the Karelian and Abkhaz populations. The variant rs143597099 was present only in one heterozygote from the Veps population. The variant rs201093031 was found in North-East Asian Nivkh and Udege populations with a frequency of 7%, and in a single Tuvan individual from Siberia. The frequency of the probably damaging minor allele of the rs12329760 polymorphism ranged from 10% (in the Khvarshi population from Dagestan) to 63% (in Sagays Khakas). In general, the minor allele T had higher frequencies in Siberia and Central Asia (both around 35%), whereas the lowest frequencies of the damaging variant were found in North Caucasus (19%), Dagestan (22%), and Eastern Europe (29%). This distribution is consistent with the worldwide data, which demonstrates a much higher frequency of the minor allele in Asian populations (36-41%) than in Europeans (22-24%) ( Table 4).  In addition, we detected CNVs in two samples. In the first case, an increase in the number of copies covering the entire gene was found in a Karanogai individual. The second CNV, affecting exons 3-7, was found in a single Kumyk individual.
Thus, potentially functionally significant variants in the TMPRSS2 gene were found in many human populations with relatively high frequencies, demonstrating region-specific distribution patterns. Both variants-the probably damaging SNV and heterozygous deletion of the gene-may significantly affect the interactions of this human serine protease with viral spike proteins, thereby changing the efficacy of the priming of viral proteins by TMPRSS2. However, the roles of the TMPRSS2 gene and its variants in interactions with SARS-CoV-2 and in viral infectivity still need to be elucidated.

Eqtls
According to the GTEx Analysis V8 database, the TMPRSS2 gene contains 136 eQTLs (including 60 downregulating and 76 upregulating variants) that significantly alter its expression in lung tissues (Table 5). However, in general, these eQTLs have only minor effects on gene expression. The average slope of the regression line (the value that characterizes the strength of an eQTL's effect) was around 0.09 both for down-and upregulating variants. The strongest single variant could change the gene's expression by 13%.

Mirnas
According to the miRTarBase and miRPathDB databases, no experimentally proven miRNAs regulating TMPRSS2 have been detected. It is worth noting that the TMPRSS2 and BSG genes have the same predicted regulatory miRNAs.
The top 30 miRNAs predicted to regulate TMPRSS2 and BSG were analyzed for enrichment in various cell types using the FANTOM5 database. None of the top miRNAs was enriched in lung or respiratory tract cells, but three miRNAs showed slight expression in immune and endothelial cells (Table 6). According to the DrugBank database, nine drugs can reduce the level of expression of TMPRSS2. For five of them (acetaminophen/paracetamol, curcumin, cyclosporine, and ethinylestradiol), this effect has been clinically proved (Table 7). Information on the direction of the effect of estradiol is conflicting; in different experiments it has been found to either downregulate or upregulate TMPRSS2 expression. Two drugs from the list above (acetaminophen/paracetamol and curcumin) have also been considered as possible therapies for COVID-19 [37]. Acetaminophen is currently being discussed as a possible drug for the correction of fever in patients with COVID-19. The ability of this drug to reduce the level of expression of TMPRSS2 may be an additional argument in favor of its use, compared with non-steroidal anti-inflammatory drugs. Curcumin, a widely used food supplement, has the predicted ability to block the main protease (Mpro) of SARS-CoV-2 [38] and may be studied further in relation to COVID-19 therapy. However, only pentanal, which enhances the expression of ACE2, is described in DrugBank as a drug that can change the expression level of ACE2. According to this database, expression of BSG1 is affected by eight drugs, five of which can reduce the level of the protein. One of them, valproic acid, can also reduce the expression of TMPRSS2 (Table 8).

Conclusions
The TMPRSS2 protein plays a crucial part in the process of SARS-CoV-2 activation in human cells. The gene encoding this protease demonstrates a high level of genetic variability, as well as having many variants that may regulate its expression levels. Although very few of the potentially functionally significant variants of this gene are of relatively high frequency, population-specific patterns of TMPRSS2 variability may contribute to some extent to the different viral infectivity of SARS-CoV-2 in populations of various geographic origins.
TMPRSS2 is probably co-expressed with SARS-CoV-2 receptors (ACE2 and BSG), but only the TMPRSS2 protease demonstrates tissue-specific expression in alveolar cells, the target cell type of SARS-CoV-2. Thus, TMPRSS2 is potentially the most promising target for COVID-19 therapy, based on its specific expression in lung, its important role in the process of cell infection, and its interactions with other proteins involved in the infection process. Several well-studied drugs can downregulate the expression of TMPRSS2 in human cells, including acetaminophen and curcumin. Both deserve close attention as possible anti-COVID-19 drugs, owing to their confirmed effects on TMPRSS2 expression, as well the long history of their use, their known side effects, and their wide availability.  Informed Consent Statement: Informed consent was obtained from all subjects involved in the study.

Conflicts of Interest:
The authors declare no conflict of interest.