Untangling SNP Variations within CYP2D6 Gene in Croatian Roma

CYP2D6 is a highly polymorphic gene whose variations affect its enzyme activity. To assess whether the specific population history of Roma, characterized by constant migrations and endogamy, influenced the distribution of alleles and thus phenotypes, the CYP2D6 gene was sequenced using NGS (Next Generation Sequencing) method-targeted sequencing in three groups of Croatian Roma (N = 323) and results were compared to European and Asian populations. Identified single nucleotide polymorphisms (SNPs) were used to reconstruct haplotypes, which were translated into the star-allele nomenclature and later into phenotypes. A total of 43 polymorphic SNPs were identified. The three Roma groups differed significantly in the frequency of alleles of polymorphisms 6769 A > G, 6089 G > A, and 5264 A > G (p < 0.01), as well as in the prevalence of the five most represented star alleles: *1, *2, *4, *10, and *41 (p < 0.0001). Croatian Roma differ from the European and Asian populations in the accumulation of globally rare SNPs (6089 G > A, 4589 C > T, 4622 G > C, 7490 T > C). Our results also show that demographic history influences SNP variations in the Roma population. The three socio-culturally different Roma groups studied differ significantly in the distribution of star alleles, which confirms the importance of a separate study of different Roma groups.


Introduction
The CYP2D6 gene encodes the phase I drug-metabolizing homonymous enzyme and is located in tandem with pseudogenes CYP2D7P and CYP2D8P on chromosome 22q13.1, at the 3 end of the CYP2D cluster [1,2]. It contains nine exons consisting of 1461 codons and is highly polymorphic, with more than a hundred genetic variations and numerous subvariants that differ in single nucleotide polymorphisms (SNPs) or copy number variations (CNV); the latter resulting from CYP2D6 gene deletion or multiplication. In 1996, a group of international experts in pharmacogenetics decided to systematize allelic variants of CYP2D6 proposing a haplotype-based star (*) nomenclature system [3]. Since then, mostly due to the development of DNA sequencing technology, a tremendous amount of allelic and suballelic variants have been identified and classified by the Pharmacogene Variation Consortium [4].
The most important consequence of CYP2D6 genetic variations is their influence on the metabolizing activity of the CYP2D6 enzyme. These variations were broadly grouped into four different drug-metabolic phenotypes of the CYP2D6 enzyme; (1) poor metabolizer (PM-only null activity alleles detected), (2) intermediate metabolizer (IM-one normal activity allele with strict rules of endogamy, have led to the founder and the bottleneck effects that have caused a genetic structure of Roma to differ compared to other populations [35,37,38], which has been shown to affect ADME (Absorption, Distribution, Metabolism and Excretion) genes variations as well [39]. In Croatia, the Vlax Roma mostly inhabit the Baranja Region, in the eastern part of Croatia, and the Medjimurje Region, in the north of Croatia. The Balkan Roma mostly live in Zagreb area.
The main objective of this study was to estimate variation within the CYP2D6 gene among three socio-culturally and geographically distinct Croatian Roma groups (Balkan Roma, and Vlax Roma from the Baranja region and Medjimurje region), to find out whether their specific population history influenced the distribution of CYP2D6 alleles and consequently phenotypes.

Materials and Methods
We analyzed 323 DNA samples, all collected during field studies of the ongoing multidisciplinary anthropological, molecular-genetic, and epidemiological investigations of Roma groups in Croatia. Samples were collected during field research in Vlax Roma settlements in the Baranja region and Medjimurje region and in Balkan Roma settlements in the city of Zagreb, Croatia ( Figure 1). All respondents participated in the research voluntarily, and with the help of Roma volunteers, were introduced to the objectives, methods, and the anticipated contribution of the project. The protocol of the study was approved by the Scientific Board and the Ethics Committee of the Institute for Anthropological Research in Zagreb, Croatia. DNA was extracted from peripheral blood using the salting-out method [40]. The entire CYP2D6 gene (ENSG00000100197 chromosome 22: 42,126,499-42,130,881, GRCh38.p12) was sequenced with Illumina MiSeq V3 kit using the method Genotyping-in-Thousands by sequencing (GT-seq) in a commercial laboratory. GT-seq is a multiplexed amplicon sequencing method that allows for the simultaneous genotyping of hundreds of SNPs across thousands of individuals in a single library, making library preparation simple and cost-effective [41]. Raw reads were demultiplexed and sequencing adapter remnants were clipped from all reads (reads with final length < 100 bases were discarded). Primer sequences were removed, and sequence fragments were turned into forward-reverse primer orientation. Low-quality reads were discarded (Phred quality score of 20 over a window of 10 bases). Quality trimmed reads were aligned against all clusters using BWA version 0.7.12.
The FreeBayes v1.02-16 was used for variant discovery and genotyping of the samples. Variants were filtered using a set of GT-seq specific rules (minimum allele count must exceed eight reads, minimum allele frequency across all samples must exceed 10%).
Allele and genotype frequencies and Hardy-Weinberg equilibrium were calculated using VCFtools [42]. Differences in genotype and allele frequencies between Roma groups were tested using the Chi-square test and Fisher's exact test. The analyses were performed using the SPSS Statistics 21.0 statistical package for Windows (SPSS Inc., Chicago, IL, USA) and p-values were corrected for multiple testing using Bonferroni correction.
CYP2D6 gene sequencing identified 43 single nucleotide polymorphisms (SNPs), 28 of which were used for haplotype reconstruction. The 14 SNPs were excluded due to low level of heterozygosity (rs28371732, rs867985262, rs79596243, rs28578778, rs141009491, rs762158210, rs35742686, rs28371717, rs1349481801, rs376056664, rs189736703, rs28371703, rs1080992, and rs769811346). The low level of heterozygosity was considered for cases in which there was no minor allele in two Croatian Roma groups, while in the third Roma group it occurred with a frequency of two or less. The remaining insertion polymorphism (rs1269631565) was excluded because it cannot be phased by Phase software due to its indel nature. However, this polymorphism is not among star alleles' defining SNPs.
Haplotypes of the Croatian Roma groups were inferred using Phase ver. 2.1 [43,44]. Haplotypes were translated into the star allele nomenclature using the data provided at the PharmVar website [45]. Star diplotypes were translated to phenotypes and classified into three metabolizer categories-normal, intermediate, and poor according to the guidelines [46]. Software Arlequin 3.5 [47] was used to infer intra-population diversity indexes (haplotype and nucleotide diversity) and population pairwise F ST values, statistical significance assessed by generating 100,000 random samples. Possible departure from selective neutrality was tested using the Ewens-Watterson (EW) homozygosity test also implemented in Arlequin. Statistical significance was assessed by generating 10,000 random samples under the null conditions of no selection and constant population size.
The relations among haplotypes were shown in networks, which were calculated by the median-joining (MJ) procedure with default settings [48] using the program NETWORK 10.2.0.0 [49].
The inter-population comparisons of the Roma with other populations, based on frequencies of minor alleles, were performed using the data from the gnomAD database (v3.1.2) for East Asian, South Asian, and European (non-Finnish) populations. A comparison of the star allele distribution between the surveyed Roma groups and world population divided into major ethnic groups was performed using data from studies listed in Supplementary Table S2 in the work of Gaedigk (2017) [4]. Data were visualized in six graphs, each one representing a star allele found in more than 5% of the Roma in the Croatian Roma groups. Because the major ethnic groups consisted of data from multiple studies, the average star allele frequency for each major ethnicity group was calculated by weighting according to the sample size of each study.

Results
Sequencing of the CYP2D6 gene in DNA samples obtained from three Croatian Roma groups (323 persons in total) reveal 43 polymorphic positions, which are listed in Table 1. Due to low heterozygosity (absence of heterozygotes and recessive homozygotes), only 21 SNPs could be tested for HWE in all three Croatian Roma groups. The remaining 22 SNPs were tested for HWE in only one or two Croatian Roma groups. Most of the SNPs which were polymorphic in all three Roma groups were in Hardy-Weinberg equilibrium (HWE); after applying Bonferroni correction, exceptions were intron variants 8602 A > G (rs2004511) in total Roma sample, 6188 G > A (rs1081004) in all three Roma groups and in the total Roma sample and noncoding transcript exon 5264 A > G (rs29001678) in total Roma sample. A total of 21 SNPs were polymorphic in all three Roma groups (Table 1). Chi-square test results showed that four SNPs differed significantly in frequencies of genotypes between the three analyzed groups after Bonferroni's correction: 8565 dup (rs1269631565), 6769 A > G (rs1135824), 6089 G > A (rs368389952), and 5264 A > G (rs1081000). A significant difference in allele frequencies between the three Croatian Roma groups was found in the same SNPs (after Bonferroni's correction): 8565 dup (rs1269631565), 6769 A > G (rs1135824), 6089 G > A (rs368389952), and 5264 A > G (rs1081000).
Furthermore, we compared minor allele frequencies of 43 polymorphic SNPs from this study with their frequencies in East Asian, South Asian, and European populations. The lowest number of polymorphic SNPs was found in East Asian populations, where 18 out of 43 investigated SNPs were monomorphic and 13 SNPs had MAFs less than 1%. In South Asian populations five SNPs were monomorphic and six SNPs had MAFs less than 1%, while only two SNPs were monomorphic in Europeans, but 16 SNPs had MAFs less than 1% (Supplementary Table S1). SNPs found to be polymorphic in the investigated Roma groups had higher MAFs in South Asian than in the European populations. The distribution of allele frequencies of SNPs with MAFs > 0.05 in at least one of the populations is shown in Figure 2.  Table S2). Intra-population analysis based on all reconstructed haplotypes revealed the highest diversity in the population of Roma from Baranja and the lowest among Roma from Medjimurje (Table 2). Pairwise population F ST distances showed the highest difference between Balkan and Medjimurje Roma (F ST = 0.01249), followed by similar distances between Roma from Balkan and Baranja (F ST = 0.0028) and between Baranja and Medjimurje (F ST = 0.0027). The exact test of sample differentiation based on haplotype frequencies showed significant differences between samples. In order to exclude the possible influence of selection on CYP2D6 haplotype distribution, the Ewens-Watterson test of selective neutrality was performed.
All Roma groups have higher observed than expected homozygosity, but results were insignificant which rules out the influence of directional selection. A total of 89 haplotypes were translated into star nomenclature, which resulted in 10 star alleles (*) ( Table 3). The reference CYP2D6*1 allele was the most common in each of the three Roma groups individually, and in the entire Roma sample (33.1%). In addition to *1, the other most prevalent alleles (total sample prevalence larger than 5%) were star alleles *2, *4, *10, and *41-the five listed star alleles were found in 96.4% of the total Roma sample. Of the five remaining star alleles, two were not found in all three Roma groups, and the CYP2D6*65 allele was found in only one person from Baranja. Chi-square test results showed that all three Roma groups significantly differ amongst themselves in the prevalence of the five most prevalent CYP2D6 star alleles (χ 2 = 34.996, p < 0.0001) ( Table 3). Comparing the three groups pairwise, results showed that Medjimurje Roma differ significantly from Balkan (χ 2 = 24.759, p < 0.0001) and Baranja Roma (χ 2 = 22.329, p < 0.0001), while Balkan and Baranja Roma do not (p = ns). We also compared the distribution of the five most frequent star alleles in the three Roma groups by comparing the prevalence of one star allele vs. other four merged ones; the distribution of four of them differs significantly among the three Roma groups (*1 − χ 2 = 6.2788, *2 − χ 2 = 6.2276, *10 − χ 2 = 8.6023, *41 − χ 2 = 16.1097; all have p < 0.05) ( Table 3), while the distribution of *4 allele did not differ.
In addition, MJ networks were constructed to show the potential phylogenetic relationships among star alleles and their diversity in the studied Roma groups. All three MJ networks showed three clusters: one with highly predominant suballeles of *1, the other with suballeles of *2 and *41, and the last with suballeles of *10 and *4, suggesting that alleles *2 and *41, as well as the alleles *10 and *4, are phylogenetically close. Suballeles of *22 and *34 cluster together with suballeles of *1, suballeles of *35 cluster with suballeles of *2, while suballeles of *39 take an intermediate position between clusters in Roma groups from Baranja and Medjimurje. In Roma group from Balkans, the suballeles of *39 are placed among suballeles of *2. Suballele of *65, present in the Roma group from Baranja, is placed among suballeles of *10 ( Figure 3). Six newly found haplotypes could not be translated into star alleles but according to the position in the MJ networks their classification to the star allele nomenclature could be estimated (Figure 3). We also compared the frequencies of alleles *1, *2, *4, *10, *39, and *41, estimated in the Croatian Roma groups, with the population size-weighted prevalence of the same star alleles in worldwide populations, grouped according to the ethnicity (listed in Supplementary Table S2 in the Gaedigk at el. 2016 paper). Alleles *10 and *41, predominantly found in Asian populations, are present in Croatian Roma with a substantially increased prevalence compared to their European average (Figure 4). The prevalence of diplotypes and their predicted phenotypes in the three groups of Croatian Roma are shown in Table 4. We found a total of 28 diplotypes, of which 14 define normal, 13 define intermediate and one diplotype-*4/*4-defines poor metabolizers. The most prevalent diplotypes in the Croatian Roma groups are *1/*4 diplotype, which defines intermediate metabolizers, and *1/*2 diplotype, which defines normal metabolizers-these two diplotypes were found in 28.3% of the total sample. They are followed by normal metabolizing *1/*41 diplotype and intermediate metabolizing *2/*4 diplotype prevalences. These four diplotypes were found in 46.5% of all Roma. On the other hand, 10 diplotypes were found only once in the total sample.   Figure 5). Altogether, the normal metabolizing phenotype was found in 57.9% of examinees, the intermediate metabolizing phenotype in 39.3%, and the poor metabolizing phenotype in 2.8% of the Croatian Roma.

Discussion
The Roma population is an example of founder populations, with centuries of sociocultural isolation. Due to the complex history of migration combined with the cultural practice of endogamy, the Roma appear as a structured group of populations [35,38]. Indeed, studies of mitochondrial DNA (mtDNA) showed a clear divergence of the Vlax Roma from the Balkan and other Roma groups that reached Europe as part of the first migration wave [50,51]. Similar results were found with autosomal [52] and Y STR loci [53]. All current research thus points to substantial differences in the genetic make-up of diverse Roma groups [54,55]. Due to the high influence of demographic history on the gene pool of the Roma, we were interested in finding out how it reflected SNP variations within the CYP2D6 gene.
There were 43 polymorphic SNPs within the CYP2D6 gene in the whole sample of Croatian Roma, some of which were considered almost monomorphic in the worldwide sample, and identified only in the gnomAD database [56]. One of these SNPs is the intron variant rs368389952, found only in the Balkan Roma with a MAF of 9.18%. Its frequency reported in the gnomAD database is 0.3% in South Asian populations and in the Uygur minority it has a frequency of 1% [57]. So far it has not been reported in the PharmVar database. The second intron variant, rs566383351, is also considered almost monomorphic in the world populations, but we found this polymorphism in Baranja (13.68%), Balkan (15.82%), and Medjimurje (10.65%) Roma, which is similar to the results of Ahmed (2018) who found this SNP's minor allele in 14.59% of the Pakistani population [58]. This polymorphism has been reported in the PharmVar as a subvariant of star alleles CYP2D6*1, CYP2D6*35, and CYP2D6*41. It has also been reported in the Leiden Open Variation Database (LOVD) [59] as CYP2D6*35B with unknown effects. SNP rs374672076 is also considered monomorphic in the world population, but we have found it to be polymorphic in Balkan (8.67%) and Baranja (7.26%) Roma, which is again similar to Ahmed (2018), who found a frequency of 12.7% in the Pakistani population [58]. This SNP is an intron variant and it has been reported in the PharmVar as CYP2D6*139.001 with unknown function. SNP rs17002852, whose MAF in our total sample is <1%, is found among Baranja Roma with a frequency of 5.98%. The highest rs17002852 MAF of 2.56% is found among Middle Eastern populations (GnomAD). This synonymous variant has been reported in the PharmVar as subvariants CYP2D6*2.003, CYP2D6*2.007, CYP2D6*41.002, CYP2D6*131.001, and CYP2D6*149.001. Its association with tramadol response has been reported in the ClinVar [60].
For seven SNPs (rs4987144, rs28371730, rs28371725, rs16947, rs2267447, rs3892097, and rs1065852), we found similar frequencies in the Croatian Roma, South Asian, and European populations, which is not unexpected since the Roma originated in South Asia and began their migration to Europe more than a millennium ago.
A 89 distinct CYP2D6 haplotypes of Croatian Roma were translated into the pharmacogenetically relevant star alleles. The least number of haplotypes were found among Roma from Medjimurje, which is in line with previous findings indicating that they are the least diverse Roma group in Croatia [51,53]. In our study, the most common star allele in all three Roma groups was the CYP2D6*1. It is considered a reference allele and makes up most of the star alleles in the European and African populations [1,61,62]. Of the three Croatian Roma groups, the lowest frequency of the reference allele *1 was observed among Balkan Roma. As shown in Figure 4, the Balkan Roma also have the lowest frequency of *1 when compared to many world populations. Star allele *2 was the second most common allele in the Balkan and Medjimurje Roma. Balkan and Medjimurje Roma groups have a similar prevalence of *2 as the European and South Asian populations, with Medjimurje frequency being closer to European and South Asian values. This is not surprising since Naveen (2006) showed that the distribution of CYP2D6*2 is similar between the European and South Asian populations [63]. The prevalence of CYP2D6*2 is about 28% among Europeans, 12-29% in the Asian populations, and 16-20% in people of African ancestry [33]. Sistonen (2007) proposed that long-term selective pressure maintains a high frequency of haplotypes encoding a fully functional enzyme, causing a homogeneous geographic distribution of *1 and *2 alleles [62].
The non-function CYP2D6*4 allele, which is predominantly found in European populations (18%), had the highest frequency in the Balkan and Baranja Roma groups, even higher than in European populations. The prevalence of the *4 allele in the Medjimurje Roma group is lower than in the European population but still higher than in other world populations. Our results are in concordance with those for Hungarian Roma, Czech Roma, and South Asians [64,65]. CYP2D6*4 creates a deficient protein [66] and contributes to most of the poor metabolizers observed in European populations. As a result of poor metabolization, a large accumulation of enzyme substrates occurs [67].
CYP2D6*10 is a decreased-function allele predominantly found in East and South Asian populations, where its prevalence ranges from 9% to 44%. Its frequency in African populations is between 4-6%, among Europeans < 2% [4,33], and it is also present in the Croatian Roma (6%). According to its prevalence, the Roma from Medjimurje (10%) are closest to the South Asians, especially South Indians [63], while the Balkan Roma (3%) are closer to European populations. This allele is considered an intermediate metabolizer phenotype, and individuals homozygous for this allele are at risk for adverse events, although not as severe as in poor metabolizers [68].
Finally, a decreased-function allele CYP2D6*41 is found in Croatian Roma at 14% frequency. Its prevalence among African populations is 4-11.5%, in Asian populations 2-12%, and about 9% in European populations [33]. Roma from Medjimurje (6%) are closer to European and South Asian populations, while the Balkan (17%) and Baranja Roma (18%) are closer to Middle East populations, which show the highest frequencies of *41 allele in the world (Figure 4).
Distribution of the five most frequent star alleles (*1, *2, *4, *10, and *41), which accounted for over 95% of the variance in all three Croatian Roma groups, did not significantly differ between Roma from Balkan and Baranja, while Roma from Medjimurje significantly differed from both these groups. The results on Roma from Medjimurje are in line with previous findings that showed the highest level of isolation compared to other Roma groups [51]. Although this research is missing the determination of structural variants, such as gene duplications, which are important for the accurate determination of phenotypes [37], we assessed the metabolizing phenotype from diplotype data. Distributions of metabolizing phenotypes are similar among Roma groups, but Roma from Medjimurje, with the highest frequency of normal metabolizers, the lowest frequency of intermediate metabolizers, and none of the poor metabolizers, are the most distinct. The results of Medjimurje Roma are similar to Hungarian Roma, which also had no poor metabolizers [64].
Because of their socio-economic status, the Roma have less access to medical care and are at higher risk of diseases like diabetes, cardiovascular diseases, and other complex diseases [69]. Since CYP2D6 metabolizes many commonly used drugs [70], it is of great interest for research not only in the general population but also in isolated or minority populations. Studies on clinical effects of several antiarrhythmic drugs including metoprolol, timolol and propafenone, and antidepressants and antipsychotics have not been unanimous. It is assumed that poor/intermediate metabolizers are prone to adverse drug reactions. Furthermore, for antidepressants and antipsychotics, there is a risk of overexposure in poor/intermediate metabolizers and underexposure in normal metabolizers [68]. For several opioid drugs (codeine, oxycodone, and tramadol) used to treat pain, genotypes have been shown to affect their efficacy and safety [68]. Cancer research studies are not in agreement with the role of CYP2D6 in the development of cancer [71,72]. Still, this enzyme is involved in the metabolism of cytotoxic drugs such as tamoxifen, and it has been shown that both poor and ultrarapid CYP2D6 metabolizers of tamoxifen have a worse prognosis compared with normal metabolizers [73,74].
Pharmacogenomic research is an important tool in drug development and health system improvement that leads to personalized medicine. In populations that are un-likely to have access to personalized medicine, a population profiling like this one can be of interest for medical practitioners since the results of such research can provide the basis for the avoidance (or careful monitoring of the effects) of the administration of drugs that contain substances that are not properly metabolized among the members of that population.

Conclusions
Summarizing our results, we can say that demographic history, predominantly migrations (from India to Balkans and across Southeast European areas) and endogamy, has indeed influenced the distribution of variations within the CYP2D6 gene. It can be seen in the accumulation of globally rare variants which is the result of genetic drift that operates in isolated populations such as the Roma. Additionally, traces of their South Asian origin can be seen in the frequencies of polymorphic variants that are similar to Asian populations in many SNPs, as well as in elevated frequencies of star alleles *10 and *41. Given metabolizing phenotype estimates, Croatian Roma generally have low levels of poor metabolizers. The three socio-culturally different Roma groups studied differ significantly in the distribution of star alleles, which confirms the importance of studying different Roma groups separately.