The Impact of Population Variation in the Analysis of microRNA Target Sites

The impact of population variation in the analysis of regulatory interactions is an underdeveloped area. MicroRNA target recognition occurs via pairwise complementarity. Consequently, a number of computational prediction tools have been developed to identify potential target sites that can be further validated experimentally. However, as microRNA target predictions are done mostly considering a reference genome sequence, target sites showing variation among populations are neglected. Here, we studied the variation at microRNA target sites in human populations and quantified their impact in microRNA target prediction. We found that African populations carry a significant number of potential microRNA target sites that are not detectable in the current human reference genome sequence. Some of these targets are conserved in primates and only lost in Out-of-Africa populations. Indeed, we identified experimentally validated microRNA/transcript interactions that are not detected in standard microRNA target prediction programs, yet they have segregating target alleles abundant in non-European populations. In conclusion, we show that ignoring population diversity may leave out regulatory elements essential to understand disease and gene expression, particularly neglecting populations of African origin.


Introduction
The study of gene regulatory iterations is at the heart of biomedical research. Many disease-causing mutations affect regulatory motifs that eventually lead to a miss-regulation of fine-tuned biological processes. In the last years, microRNAs have emerged as an important type of regulatory molecules involved in virtually all biological networks [1,2]. Both the deletion and overexpression of microRNAs have been associated to human diseases (see for instance the compilation by [3]). Notably, the role of microRNAs in cancer is a particularly active research field [4,5]. Unlike other regulatory molecules (such as transcription factors or RNA binding proteins), the interaction between regulator and target is mediated by pairwise nucleotide complementary [1]. Specifically, a microRNA is partially complementary to motifs in RNA transcripts and, after binding, translational repression or even RNA degradation mechanisms are triggered [1,6]. Indeed, changes at microRNA target sites are now associated to major human diseases including cancer, neurodegeneration and metabolic disorders (see [7,8] and references therein), among others. The study of microRNA target sites is becoming more important as we start to understand the complexities behind microRNA regulatory networks.
From a computational point of view, the microRNA targeting mechanisms have a clear advantage: targets can be predicted by scanning genomic sequences for potential complementary sites [9][10][11]. However, microRNA target prediction algorithms have, in general, high rates of false positives [12,13]. Consequently, most research on microRNA targets complements a computational sieve of potential target sites plus experimental validations, usually by luciferase assays or similar (e.g., [14]). Therefore, the first step in microRNA target analysis is the prediction of binding sites from a primary sequence. In biomedical research, this primary sequence is often, if not always, the human reference genome sequence (currently hg38). However, a reference genome does not take into account existing variation within human populations and, therefore, any inference from such a sequence may be biased. Populations of African origin show a higher nucleotide diversity than other populations, probably as a consequence of a recent origin of Eurasian populations from a small group of African migrants (reviewed in [15]). There exist several projects aimed to account for all this nucleotide variability in human populations (i.e., [16]). Nevertheless, microRNA target prediction programs rely on reference genome sequences (as other regulatory interaction programs do), where this diversity is not truly represented.
Variation at gene regulatory sites, particularly at transcription factor binding sites, has been studied in the past (e.g., [17,18]). In microRNA target sites, some studies have shown that, as expected, functional target sites are often conserved in populations [19,20]. On the other hand, the creation of novel, potentially harmful, microRNA target sites, is selected against in populations [21,22]. In summary, the literature clearly shows evidence that nucleotide variation occurs at microRNA target sites. The question is, how much of this variation is neglected in the human reference genome sequence, and therefore not accounted for in standard microRNA target prediction programs? In other words, what is the impact of nucleotide variation among humans in microRNA target prediction research? This is the question we are tackling in this paper.

Results and Discussion
We compiled biallelic SNPs at predicted microRNA target sites (see Methods) for which one of the alleles is a target site and the other is not a target site. These are the target/near-target pairs as defined before [11,21]. Considering that populations of African origin host more variation, including ancestral alleles, than other populations, we hypothesize that ancestral target sites that are not present in the reference genome sequence were lost in Out-of-Africa populations. Thus, we first consider target sites whose allele in the reference genome sequence is not a target, yet the ancestral allele was a target site. The distribution of these alleles frequencies in five major groups (African, European, South Asian, Native American and East Asian) indicates that non-African populations tend to have the non-target allele whilst African populations show a wider variation, and a paucity of non-targets compared to the other populations ( Figure 1, top). Additionally, we consider target sites in the reference genome sequence that were not ancestral to human populations and, as expected, we found that African populations have lower frequencies of the target allele ( Figure 1, bottom). These results suggest that a significant loss and gain of microRNA target sites happened in Out-of-African populations.
To illustrate the impact of variation at target sites we selected SNPs with a very high degree of population differentiation (Fst > 0.7; see Methods) that are not present in the human genome reference sequence. We found 26 target sites with such a degree of variation across populations (Table 1) for highly expressed microRNAs. Many of these target sites are lost in Out-of-Africa populations, suggesting that losses rather than the gains of target sites are more likely to happen during differentiation between populations (see discussion in [22]). To illustrate the impact of variation at target sites we selected SNPs with a very high degree of population differentiation (Fst > 0.7; see Methods) that are not present in the human genome reference sequence. We found 26 target sites with such a degree of variation across populations (Table 1) for highly expressed microRNAs. Many of these target sites are lost in Out-of-Africa populations, suggesting that losses rather than the gains of target sites are more likely to happen during differentiation between populations (see discussion in [22]). Table 1. MicroRNA target sites, not in the reference genome sequence, with a very high differentiation among human populations.
We observed that MTAP have lost two target sites for two highly expressed microRNAs in non-African populations. These two target sites were the ancestral form in human populations and conserved as targets in other primates (Figure 2A). Genome-wide association studies have linked variants at MTAP with the abundance of naevi (moles) and also with melanoma incidence [23,24], although the association between MTAP and melanoma seems to be rather complex [23]. We speculate that the loss of two target sites at MTAP transcripts could be associated with an elevated expression of the gene in Out-of-Africa populations, perhaps by a relaxation on the selective pressures that maintained the target sites under purifying selection in African populations, where light exposure is Non-coding RNA 2019, 5, 42 4 of 8 higher, and an overproduction of MTAP may be linked to a higher incidence of skin cancer. Under this hypothesis, we expect the target allele frequency to be also negatively correlated with the latitude of sampling of non-African human populations. We compiled sampling information for human populations (see Methods) and build a linear model to predict the target allele frequency as a function of the latitude. We observed that populations from higher latitudes have lower target allele frequencies at both of the studied target sites (Figure 2B), although the statistical support was weak (rs7875199: R 2 (adjusted) = 0.18, p = 0.06; rs7868374: R 2 (adjusted) = 0.14, p = 0.09). We would like to emphasize that the association between latitude and MTAP target site allele frequencies is still speculative, but the fact that other similar associations has been found at microRNA target sites (e.g., [25]) suggest that other latitude depending microRNA function remain to be discovered.
Likewise, we found a similar pattern in two target sites at CYB5R4, a gene associated to gene-environment interactions [26,27]. In particular, CYB5R4 has been associated to the risk of diabetes, and variants at its 5 region have been found to be specific of African populations [26]. The association between population variants and disease-related genes such as MTAP and CYB5R4 remains highly speculative, yet it is very suggestive. In any case, the identified target sites at MTAP and CYB5R4 are not present in the human genome reference genome yet they are highly abundant in African populations and conserved in primates, indicating that target prediction programs often neglect an important part of human variability.
non-African populations. These two target sites were the ancestral form in human populations and conserved as targets in other primates (Figure 2A). Genome-wide association studies have linked variants at MTAP with the abundance of naevi (moles) and also with melanoma incidence [23,24], although the association between MTAP and melanoma seems to be rather complex [23]. We speculate that the loss of two target sites at MTAP transcripts could be associated with an elevated expression of the gene in Out-of-Africa populations, perhaps by a relaxation on the selective pressures that maintained the target sites under purifying selection in African populations, where light exposure is higher, and an overproduction of MTAP may be linked to a higher incidence of skin cancer. Under this hypothesis, we expect the target allele frequency to be also negatively correlated with the latitude of sampling of non-African human populations. We compiled sampling information for human populations (see Methods) and build a linear model to predict the target allele frequency as a function of the latitude. We observed that populations from higher latitudes have lower target allele frequencies at both of the studied target sites ( Figure 2B), although the statistical support was weak (rs7875199: R 2 (adjusted) = 0.18, p = 0.06; rs7868374: R 2 (adjusted) = 0.14, p = 0.09). We would like to emphasize that the association between latitude and MTAP target site allele frequencies is still speculative, but the fact that other similar associations has been found at microRNA target sites (e.g., [25]) suggest that other latitude depending microRNA function remain to be discovered.
Likewise, we found a similar pattern in two target sites at CYB5R4, a gene associated to gene-environment interactions [26,27]. In particular, CYB5R4 has been associated to the risk of diabetes, and variants at its 5′ region have been found to be specific of African populations [26]. The association between population variants and disease-related genes such as MTAP and CYB5R4 remains highly speculative, yet it is very suggestive. In any case, the identified target sites at MTAP and CYB5R4 are not present in the human genome reference genome yet they are highly abundant in African populations and conserved in primates, indicating that target prediction programs often neglect an important part of human variability. We identified microRNA/gene interactions that have been experimentally detected in high-throughput experiments with a significant population differentiation (PFst < 0.05, see Methods) that were not detected as targets in the reference genome sequence ( Table 2). For instance, miR-24-3p, a tumour-suppressor microRNA [28], has a polymorphic target site in Caspase 10 transcripts, whose product is an component of the apoptotic machinery. This target site is present in about 26% or African sampled alleles whilst it is not detected in most other populations. Indeed, neither TargetScan nor DIANA-microT [10] predict this interaction. This result further supports the Figure 2. MicroRNA target sites variation at transcript from the MTAP gene: (A) alignment of 3 UTR fragments from 10 primate species, including human, and the location of the target sites and the polymorphic nucleotides; (B) scatter-plot between the latitude (in absolute value) and the target allele frequency for target sites for miR-519a-5p (red) and let-7a-5p (blue). Straight lines are fitted linear models (see main text).
We identified microRNA/gene interactions that have been experimentally detected in high-throughput experiments with a significant population differentiation (P Fst < 0.05, see Methods) that were not detected as targets in the reference genome sequence (Table 2). For instance, miR-24-3p, a tumour-suppressor microRNA [28], has a polymorphic target site in Caspase 10 transcripts, whose product is an component of the apoptotic machinery. This target site is present in about 26% or African sampled alleles whilst it is not detected in most other populations. Indeed, neither TargetScan nor DIANA-microT [10] predict this interaction. This result further supports the view that reference genome sequences used to predict microRNA target sites often ignore population variability that may indicate functional diversity.
As this work is a computational evaluation of potential microRNA target sites, we have not validated any of the targets. However, we wanted to explore whether experimentally validated polymorphic target sites showed some population differentiation. Hence, we explored the 111 variants tested by PASSPORT-seq [29] of which 59 were listed in our database. The distribution of Fst values shows population differentiation for some targets (Figure 3). For instance, eight target sites (over 13% of the total) had an associated Fst value of 0.3 or more. As this work is a computational evaluation of potential microRNA target sites, we have not validated any of the targets. However, we wanted to explore whether experimentally validated polymorphic target sites showed some population differentiation. Hence, we explored the 111 variants tested by PASSPORT-seq [29] of which 59 were listed in our database. The distribution of Fst values shows population differentiation for some targets (Figure 3). For instance, eight target sites (over 13% of the total) had an associated Fst value of 0.3 or more. Last, we investigated polymorphic target sites identified by the popular software TargetScan. We downloaded the whole conserved and non-conserved target datasets and identified polymorphic target sites also present in our database PopTargs. We compared the target allele frequencies between populations and concluded that, for both conserved and nonconserved targets, the target allele was more common in European than in African populations (Table 3). Indeed, for nonconserved targets, the differences are larger.  Last, we investigated polymorphic target sites identified by the popular software TargetScan. We downloaded the whole conserved and non-conserved target datasets and identified polymorphic target sites also present in our database PopTargs. We compared the target allele frequencies between populations and concluded that, for both conserved and nonconserved targets, the target allele was more common in European than in African populations (Table 3). Indeed, for nonconserved targets, the differences are larger. In conclusion, here, we showed that the study of variation in human populations reveals the presence of microRNA target sites that are not detected with current methodological frameworks. We suggest that variation at target sites must be taken into account in biomedical studies as some populations (mostly of African origin) may be neglected. Further research is needed to quantify the impact of population variation in the function of microRNA target sites.

Methods
Polymorphic sites at predicted microRNA target sites were retrieved from the PopTargs database (version 2; Hatlen, Helmy and Marco, under review; https://poptargs.essex.ac.uk). Only target sites for highly expressed microRNAs were considered. Population frequencies, Fst values and ancestral alleles were also retrieved in PopTargs, as provided in [30]. We only consider segregating alleles with a frequency between 0.01 and 0.99. Sequence alignment of primate sequences was extracted using the UCSC table browser [31]. The geographical location (latitude) of the genome samples sequenced in the 1000 genomes project was obtained from the sample descriptions at https: //www.coriell.org/1/NHGRI/Collections/1000-Genomes-Collections/1000-Genomes-Project. When the specific location was ambiguous, we used the location of the capital city of the country of origin. For multiple collection points we computed the centroid middle point. Recently migrated populations were discarded (CHD, CEU, ASW, ACB, GIH, STU and ITU as described at IGSR: The International Genome Sample Resource (http://www.internationalgenome.org/category/population/)) were discarded from the analysis in Figure 2B. We extracted microRNA targets from the miRTarBase database [32], which were detected only in high-throughput experiments (PAR-CLIP and/or HITS-CLIP).