Protective Role of a TMPRSS2 Variant on Severe COVID-19 Outcome in Young Males and Elderly Women

The protease encoded by the TMPRSS2 gene facilitates viral infections and has been implicated in the pathogenesis of SARS-CoV-2. We analyzed the TMPRSS2 sequence and correlated the protein variants with the clinical features of a cohort of 1177 patients affected by COVID-19 in Italy. Nine relatively common variants (allele frequency > 0.01) and six missense variants which may affect the protease activity according to PolyPhen-2 in HumVar-trained mode were identified. Among them, p.V197M (p.Val197Met) (rs12329760) emerges as a common variant that has a deleterious effect on the protease and a protective effect on the patients. Its role appears particularly relevant in two subgroups of patients—young males and elderly women—and among those affected by co-morbidities, where the variant frequency is higher among individuals who were mildly affected by the disease and did not need hospitalization or oxygen therapy than among those more severely affected, who required oxygen therapy, ventilation or intubation. This study provides useful information for the identification of patients at risk of developing a severe form of COVID-19, and encourages the usage of drugs affecting the expression of TMPRSS2 or inhibiting protein activity.

TMPRSS2 is expressed in the prostate glands predominantly, but also in type II pneumocytes in the lung, nasal goblet secretory cells, and small intestine [9][10][11][12]. It is regulated by androgenic hormones in vivo [1].
Genetic determinants of susceptibility and/or severity of COVID-19 have been sought in TMPRSS2 in predictive [13][14][15][16] as well as in Whole-Exome Sequencing (WES) studies [17,18]. An analysis extended to large cohorts confirms the protective role of a common polymorphism p.V197M (rs12329760) in TMPRSS2 and, in particular, its effect in young males and elderly women.

Materials and Methods
Whole-Exome Sequencing (WES) data derived from the GEN-COVID Multicenter Study [19] were analyzed. Odds ratio (OR) for different contingency tables were calculated using unconditional maximum likelihood estimation (Wald's method); confidence intervals (CI) at 95% were calculated using the normal approximation. Independence was tested using a Chi-squared test. All calculations were performed within the R environment for statistical computing, using the OR function from the epitools package. R scripts are provided in Supplementary file S1.
The sequence of human TMPS2_HUMAN was obtained from the UniProt database [20]. Structural templates were searched using the SWISS-MODEL server [21] and a model of the region spanning aa 187 to 526 was built using the structure of human hepsin [22]. A protein model was visualized using PyMol [23]. The solvent accessibility and the effect of mutations on protein stability were calculated for variants occurring in the region spanning aa 187 to 526 using SDM [24].
PolyPhen-2 [25], which is a program that can graduate the severity of missense variants [26], was used to identify mutations with a high impact on TMPS2_HUMAN.

Results
We analyzed the WES data of 1177 patients affected by COVID-19 and selected according to the following inclusion criteria: (i) endotracheal intubation; (ii) CPAP/BiPAP ventilation; (iii) oxygen therapy; (iv) hospitalized without oxygen support; and (v) not hospitalized. In this cohort, we identified 52 variants in TMPRSS2, nine of which are relatively common (allele frequency > 0.01). TMPRSS2 encodes two isoforms that differ for 37 aa at the amino-terminus and both can activate respiratory viruses [27]. In this paper, the numbering of the longest isoform is used to identify variants in coding regions. The frequencies of two common mutations, one missense and one synonymous, p.V197M (rs12329760) and p.G296G (rs2298659), correlate with COVID-19 severity ( Figure 1 and Table 1).  The percentage of carriers in severely (intubation + CPAP/BiPAP + oxygen) or mildly (no resp support + nh) affected patients differs significantly and is higher in the latter group (p.V197M p = 0.0289, OR = 0.7601, LC = 0.5941, HC = 0.9724; p.G296G p = 0.0039, OR = 0.6947, LC = 0.5422 HC = 0.8899; Table 1). The percentage of carriers in severely (intubation + CPAP/BiPAP + oxygen) or mildly (no resp support + nh) affected patients differs significantly and is higher in the latter group (p.V197M p = 0.0289, OR = 0.7601, LC = 0.5941, HC = 0.9724; p.G296G p = 0.0039, OR = 0.6947, LC = 0.5422 HC = 0.8899; Table 1).
Patients were divided in two categories according to the severity of the disease: mild (including not hospitalized and no respiratory assistance required) and severe (including oxygen therapy, ventilation, or intubation required).
Several lines of evidence support the hypothesis that p.V197M affects the stability and/or the function of the protease. It is predicted to be deleterious by SIFT [29] and PolyPhen-2 [25], in both predictive modes, HumDiv and HumVar (Supplementary file S2).
p.V197M does not occur in the catalytic domain but in the SRCR domain that is likely needed for protein-protein interaction [30,31]. The region spanning aa 187 to 526 can be built by homology modeling. The side chain of Valine197 is buried (Figure 2), and solvent accessibility (%) 15.4 and its substitution by Methionine affects the stability of the protein (pseudo∆∆G= −2.00 kcal/mol). On the other side, the synonymous variant p.G296G, which corresponds to NM_001135099:exon9:c.C888T, is not annotated as an eQTL and does not fall at the exonintron junction. We hypothesize that the missense mutation is protective and is often associated in cis with the synonymous variant, which has little effect per se.
The frequency of the other common variants does not correlate with the disease severity (Supplementary file S3). This is expected since most of them are synonymous or intron variants. The only missense mutation is p.G8V that is predicted to be benign by SIFT [29] and PolyPhen-2 [25] in both predictive modes, HumDiv and HumVar (Supplementary file S2).
It has been observed and it is commonly accepted that age, sex, and co-morbidities influence the severity of COVID-19. We divided patients according to sex and age (below and above median values) and the severity of the disease. The number of carriers of p.V197M (heterozygous + homozygous individuals) among mild and severe cases differs significantly in two sub-cohorts, those of young males (p = 0.0200, OR = 0.5804, LC = 0.3663, HC = 0.9197) and elderly women (p = 0.0347, OR = 0.5346, LC = 0.2977, HC = 0.9601) ( Table  2). To check whether slightly altering the age frame may flip the p-value to non-significant, we tested different ages and found that, provided that the boundary between young and On the other side, the synonymous variant p.G296G, which corresponds to NM_001135099: exon9:c.C888T, is not annotated as an eQTL and does not fall at the exon-intron junction. We hypothesize that the missense mutation is protective and is often associated in cis with the synonymous variant, which has little effect per se.
The frequency of the other common variants does not correlate with the disease severity (Supplementary file S3). This is expected since most of them are synonymous or intron variants. The only missense mutation is p.G8V that is predicted to be benign by SIFT [29] and PolyPhen-2 [25] in both predictive modes, HumDiv and HumVar (Supplementary file S2).
It has been observed and it is commonly accepted that age, sex, and co-morbidities influence the severity of COVID-19. We divided patients according to sex and age (below and above median values) and the severity of the disease. The number of carriers of p.V197M (heterozygous + homozygous individuals) among mild and severe cases differs significantly in two sub-cohorts, those of young males (p = 0.0200, OR = 0.5804, LC = 0.3663, HC = 0.9197) and elderly women (p = 0.0347, OR = 0.5346, LC = 0.2977, HC = 0.9601) ( Table 2). To check whether slightly altering the age frame may flip the p-value to nonsignificant, we tested different ages and found that, provided that the boundary between young and old patients falls between 52 and 66 years for males and between 54 and 70 years for females, the protective effect remains significant for young males and elderly women (Supplementary Figure S1). We restricted the two groups of patients and analyzed young males and elderly women who were affected by co-morbidities such as diabetes and/or hypertension and/or obesity or other diseases (young males affected by co-morbidities p = 0.0057, OR = 0.2969, LC = 0.1254, HC = 0.7027; elderly women affected by co-morbidities p = 0.0391, OR = 0.4667, LC = 0.2262, HC = 0.9627) ( Table 3). The results hold significance when we consider the first quartile of the male or the last quartile of the female population (very young males in Q1 affected by co-morbidities p = 0.0343, OR = 0.2121, LC = 0.0485, HC = 0.9281; very elderly women in Q4 affected by co-morbidities p = 0.0090, OR = 0.2455, LC = 0.0812, HC = 0.7424) ( Table 4). Data in the two sub-cohorts of elderly males and young women are shown in Supplementary file S4. The subdivision of patients was carried out starting from the complete data set reporting the age, sex, and co-morbidities of individual patients. It is provided in Supplementary file S5.
Patients were divided according to sex and age (above and below median age) and the severity of the disease.
Patients with co-morbidities were divided according to sex and age (above and below median age) and the severity of the disease.
Patients were divided according to sex and age (quartiles) and the severity of the disease. It is not surprising that the effects of a deleterious mutation in TMPRSS2 are seen in both sexes because the expression of the gene is only slightly higher in males [13]. The difference in the times of life can be explained because androgens and estrogens have opposite effects on gene expression, as proved by the data collected from the Expression Atlas [32] (Supplementary file S6). Young males who do not carry p.V197M are at risk because of high testosterone levels. Particularly relevant is the risk of wild-type elderly women who are not protected by estrogens. Androgenic hormones decline with age less rapidly than estradiol after menopause and this effect might explain the risk in elderly females and the protective role of the variant.
Other rare missense mutations are found in the Italian cohort in heterozygosity (Supplementary file S2). We will discuss the germline mutations that can affect the protein.
A few fall in the region spanning aa 187 to 526 that can be modeled by homology, thus precluding a comparative analysis based on structural effects. PolyPhen-2 [25] uses sequence conservation to predict deleterious effects and two databases for training and testing predictions. The HumDiv model is trained using Mendelian disease variants vs. divergence from close mammalian homologs of human proteins (≥95% sequence identity). HumVar is trained using all human variants associated with some disease (except cancer mutations) or loss of activity/function vs. common (minor allele frequency > 1%) human polymorphism with no reported association with a disease of other effects.
The HumVar-trained model, which is best suited for distinguishing mutations with drastic effects, predicts that only one rare variant, p.L128P (rs147711290), has a strong effect (probably damaging, D, supplementary file S2) on the protein, perhaps because it occurs in the transmembrane anchor where proline destabilizes the helix [33]. p.L128P, which is relatively frequent only among Ashkenazy Jews (0.2%), was found in the Italian cohort in a single patient on oxygen therapy. The HumVar-trained model predicts that p.S265I and p.F246I (rs150554820), which occur on the surface of SRCR-like domain (aa 187-279) (Figure 2), p.P78L (rs138651919) and p. Y82D (rs201679623), which are located in the poorly characterized cytoplasmic region, have a moderate effect (possibly damaging, P, supplementary file S2) that is not sufficient to protect the patients. The effects of p.S265I and p.F246I (rs150554820) were calculated using the model shown in Figure 2. Both mutations affect relatively buried amino acids, S265 and F246 (Solvent accessibility (%) 11.5 and 1.2, respectively), but have a very mild effect on protein stability (S265I and F246I pseudo∆∆G = −0.29 kcal/mol and −0.07 kcal/mol, respectively).
These variants are not enriched in the mildly affected patients but were observed in severely affected ones too, 5 and 6, respectively.

Discussion and Conclusions
From our analysis, based on the WES of a large cohort of Italian patients, a variant in TMPRSS2 emerges as a predictive factor to identify patients at risk of a severe course of COVID-19.
The allele frequency of p.V197M in our study is 0.18, in line with what was already reported (0.17 [13]). This value is well below the one reported in GnomAD for Eastern Asians (0.38), Finns (0.37), and Africans (0.29). Among Non-Finnish Europeans, the allele frequency of p.V197M is 0.23 with an apparent gradient from North to South, Northern Sweden 0.29, Estonia 0.31 (Genetic variation in the Estonian population), UK 0.21 (UK 10K study-Twins), and Spain 0.17 (Medical Genome Project healthy controls from Spanish population). It is highly suggestive correlating the low frequency of V197M with the high impact that the first wave of the epidemics had in Italy.
Other missense mutations are not enriched in the mildly affected patients, possibly because their effect on the protein is weak.
The protein encoded by TMPRSS2 belongs to a family of membrane proteases, some of which promote SARS-CoV-2 infection [4,5,11]. No common deleterious missense mutations are found in these genes but for a variant in TMPRSS4, p.P413L, which is frequent among Latinos and mixed Americans (0.1259) but not among Europeans (0.002).
The fact that a missense mutation with a destabilizing effect on the protein product, such as p.V197M, influences the course of COVID-19 implicitly suggests that TMPRSS2 can be targeted for therapies either reducing its expression or inhibiting its protein product.
Sexual hormones can be used for therapeutic purposes [34,35]. Quite interestingly, and in line with our results, Seeland et al. reported that: "their retrospective study of hormone therapy in female COVID-19 patients shows that the fatality risk for women > 50 years receiving estradiol therapy (user group) is reduced by more than 50%; the OR was 0.33, 95% CI (0.18, 0.62) and the hazard ratio (HR) was 0.29, 95% CI (0.11,0.76). For younger, pre-menopausal women (15-49 years), the risk of COVID-19 fatality is the same irrespective of estradiol treatment, probably because of higher endogenous estradiol levels" [36].
To obtain more targeted effects, specific inhibitors at the protein level could be considered since they could have a protective effect analogous to that exerted by the mutation. Preliminary clinical data concerning Camostat [37,38] and Nafamostat [39,40] have been published. Their identification was the result of reposition which is a useful approach to reduce the time and costs of drug development [41] and has been largely employed during the emergency posed by COVID-19 [42,43].
In vitro Camostat and Nafamostat bind and inhibit TMPS2_HUMAN with great affinity, IC50 6.2 nM and 0.27 nM, respectively, but are not specific [44] and are not devoid of side effects in the patients [45]. In silico docking experiments have been carried out to find other inhibitors of TMPS2_HUMAN [46], but in vitro validation of the hits has not yet been carried out.
The effect of V197M on TMPS2_HUMAN was predicted by several authors [13][14][15][16][17][18]. WES analysis conducted on a large cohort of patients proves that the variant has indeed a statistically significant protective role in COVID-19. We do hope that our study will not only help to identify at risk patients, especially among elderly women, but also encourage the development of drugs for their treatment.
During the revision of this paper, we became aware that an independent research group proved the protective role of TMPRSS2 variants in COVID-19 in the general population [47].  Figure S1: influence of the boundary age on the statistical significance of p.V197M protective effect. Institutional Review Board Statement: The GEN-COVID study was conducted according to the guidelines of the Declaration of Helsinki, and approved by the University Hospital of Siena Ethical Review Board (protocol code 16929, dated 16 March 2020). This observational study has been inserted in www.clinicaltrial.gov on 16 September 2020 (NCT04549831).

Informed Consent Statement:
As part of GEN-COVID Multicenter Study written informed consent was obtained from all individuals who contributed samples and data. Detailed clinical and laboratory characteristics (data), specifically related to COVID-19, were collected for all subjects.

Data Availability Statement:
The data and samples referenced here are housed in the GEN-COVID Patient Registry and the GEN-COVID Biobank and are available for consultation. You may contact Alessandra Renieri (e-mail: alessandra.renieri@unisi.it).