Phased Haplotype Resolution of the SLC6A4 Promoter Using Long-Read Single Molecule Real-Time (SMRT) Sequencing

The SLC6A4 gene has been implicated in psychiatric disorder susceptibility and antidepressant response variability. The SLC6A4 promoter is defined by a variable number of homologous 20–24 bp repeats (5-HTTLPR), and long (L) and short (S) alleles are associated with higher and lower expression, respectively. However, this insertion/deletion variant is most informative when considered as a haplotype with the rs25531 and rs25532 variants. Therefore, we developed a long-read single molecule real-time (SMRT) sequencing method to interrogate the SLC6A4 promoter region. A total of 120 samples were subjected to SLC6A4 long-read SMRT sequencing, primarily selected based on available short-read sequencing data. Short-read genome sequencing from the 1000 Genomes (1KG) Project (~5X) and the Genetic Testing Reference Material Coordination Program (~45X), as well as high-depth short-read capture-based sequencing (~330X), could not identify the 5-HTTLPR short (S) allele, nor could short-read sequencing phase any identified variants. In contrast, long-read SMRT sequencing unambiguously identified the 5-HTTLPR short (S) allele (frequency of 0.467) and phased SLC6A4 promoter haplotypes. Additionally, discordant rs25531 genotypes were reviewed and determined to be short-read errors. Taken together, long-read SMRT sequencing is an innovative and robust method for phased resolution of the SLC6A4 promoter, which could enable more accurate pharmacogenetic testing for both research and clinical applications.


Introduction
Antidepressants are the third most commonly prescribed drug class in the United States, surpassed only by antihyperlipidemics and analgesics [1]. In addition to depression, selective serotonin reuptake inhibitors (SSRIs) are used to treat obsessive-compulsive disorder (OCD), panic disorder, posttraumatic stress disorder and anxiety disorders; however, efficacy among the different SSRIs has been estimated at only~65% across treated patients [2]. In an effort to improve antidepressant efficacy, pharmacogenetic studies have recently been undertaken to guide treatment based on common variants in genes involved in SSRI pharmacokinetics (e.g., CYP2D6 and CYP2C19) and pharmacodynamics (SLC6A4) [3].
In addition to the 5-HTTLPR L>S variant (rs4795541), two SLC6A4 promoter sequence variants (rs25531 (c.-1936A>G), rs25532 (c.-1760C>T)) have also been reported to influence 5-HTT expression; however, they are most informative when considered as a haplotype with 5-HTTLPR [20,21]. The rs25531 (A>G) variant is located in the sixth repeat unit and can occur on either the L or S 5-HTTLPR haplotype. Notably, when this variant is found on the L haplotype ("L G "), it is associated with lower gene expression, similar to the S allele [20,21]. The rs25532 (C>T) variant is located~150 bp from rs25531 in the µ element. The presence of the T allele at this position also is associated with lower gene expression on both the L and S haplotypes [21]. Given the potential clinical significance of SLC6A4 and the importance of phasing multiple variants across the polymorphic repetitive promoter region, we developed a novel long-read single molecule real-time (SMRT) sequencing method to comprehensively characterize SLC6A4 promoter haplotypes.

Samples and Publicly Available Data
A total of 120 independent Coriell samples were subjected to long-read SMRT sequencing at the polymorphic SLC6A4 promoter region, primarily selected based on the availability of orthogonal short-read sequencing data. Among these samples, low-depth short-read genome sequencing (WGS) data (~5X) was available for 32 samples from the 1000 Genomes (1KG) Project; high-depth short-read WGS data (~45X) were available for 68 samples from the Centers for Disease Control and Genetic Testing Reference Material Coordination Program (GeT-RM) pharmacogenetic testing program [22]; and high-depth short-read capture-based panel data (~330X) were available for 21 samples from an internal dataset. DNA samples were acquired from the Coriell Biorepository (Camden, NJ, USA), which are summarized in Supplemental Table S1.

PCR Amplification of the SLC6A4 Promoter
Four overlapping primer sets (with a common forward primer) were used to amplify the SLC6A4 promoter region (Table 1), which included universal oligonucleotide tags for subsequent barcoding and SMRT sequencing. Long-range PCR reactions were performed in 20 µL containing~100 ng of DNA, 1X SequalPrep TM Reaction buffer (Invitrogen, Carlsbad, CA, USA), 0.5 µM of forward and reverse primers, and 1.8 units of SequalPrep TM polymerase. Amplification conditions were identical for all primer sets and consisted of an initial denaturation step at 94 • C for 2 min followed by 10 amplification cycles (94 • C for 10 s, 63.6 • C for 30 s, and 68 • C for 90 s), another 20 amplification cycles (94 • C for 10 s, 63.6 • C for 30 s, and 68 • C for 90 s + 20 s/cycle), and a final extension at 72 • C for 5 min. These products were used as templates for barcoding prior to SMRT sequencing. Table 1. Primer sequences used to amplify the SLC6A4 promoter region.

Primer ID Primer Sequence * Amplicon Size (bp)
Forward primer * Universal oligonucleotide tag sequence nucleotides are underlined.

SLC6A4 Promoter Sample Barcoding and Pooling
First round amplification products were used as template for a subsequent PCR that incorporated forward and reverse barcodes using forward ( Table S2). The PCR conditions for this reaction were identical to the first round PCR described above. Agencourt ® AMPure ® XP beads were used to purify all barcoded PCR amplicons, which were quantified by Nanodrop 1000. Following purification and quantitation, sample library pools were generated with equal molecule quantities of PCR amplicons using the following formula to calculate the required volume of each amplicon [26]:

-[barcode]-ATGGGTTCCAGAGTCAATC-3 ) and reverse (-[barcode]-GAAAGGTCTGGAGTCTTGAT-3 ) primers (Supplemental
where M is the total mass of pooled PCR amplicons, n is the total number of samples, V(i) is the volume of each PCR amplicon, and C(i) is the concentration of each amplicon. A total of 4000 ng of pooled PCR amplicons was submitted for SMRT sequencing.

Single Molecule Real-Time (SMRT) Sequencing
SMRT sequencing was executed according to the P6-C4 Pacific Biosciences protocol with a movie collection time of 180 min on the Sequel instrument, as per the manufacturer's instructions and as previously described [26]. SMRT sequencing analysis included demultiplexing, alignment, quality score recalibration, and variant calling. Raw sequencing data in FASTQ format were demultiplexed using NGSutils, sequencing reads were aligned using BWA-MEM, and variant calling was performed using GATK [26].

Sanger Sequencing Confirmation
Selected samples were validated by orthogonal Sanger sequencing using the following primer set: FWD 5 -CTTTGCGTTTTCTGTTGCCCT-3 ; REV 5 -CCCAGCAGGAGCCTATTGTT-3 . These primers generated a 1026 bp amplicon using the same PCR conditions described above. PCR fragments were analyzed using the Applied Biosystems 3730xl DNA Analyzer, and chromatogram traces were evaluated using NG_011747.2 as a reference sequence and Chromas v2.6.6 (South Brisbane, Australia).

SLC6A4 Promoter Long-Read SMRT Sequencing
All 120 samples were subjected to long-read SMRT sequencing across the polymorphic SLC6A4 promoter, and a representative sample with both short-and long-read sequencing data available is illustrated in Figure 1. Long-read SMRT sequencing results of additional SLC6A4 diplotypes are illustrated in Figure 2. In contrast to the short-read sequencing datasets, the 5-HTTLPR short (S) allele (rs4795541) was detected by long-read SMRT sequencing in 79 samples, resulting in a MAF of 0.467 (112/240). In addition, the SLC6A4 rs25531 and rs25532 variants were detectable in all samples and their MAFs were 0.113 (27/240) and 0.088 (21/240), respectively. In addition, long-read SMRT sequencing detected two samples with extra-long (XL) 5-HTTLPR alleles that were not detected by short-read sequencing. Six additional SLC6A4 promoter sequence variants were also detected and phased by long-read SMRT sequencing (Supplemental Table S3), which were inconsistently detected among the short-read sequencing datasets.    In addition to being more accurate at this locus than short-read sequencing, long-read SMRT sequencing also unambiguously phased the polymorphic SLC6A4 promoter in all samples, including complex compound heterozygous diplotypes (Figure 2). The three clinically significant SLC6A4 promoter variants (5-HTTLPR (rs4795541), rs25531, rs25532) were phased and the identified diplotypes and haplotype frequencies are summarized in Supplemental Table S2 and Table 2, respectively.  In addition to being more accurate at this locus than short-read sequencing, long-read SMRT sequencing also unambiguously phased the polymorphic SLC6A4 promoter in all samples, including complex compound heterozygous diplotypes (Figure 2). The three clinically significant SLC6A4 promoter variants (5-HTTLPR (rs4795541), rs25531, rs25532) were phased and the identified diplotypes and haplotype frequencies are summarized in Supplemental Table S2 and  Table 2, respectively.

Discussion
Given the emerging clinical significance of SLC6A4 in interindividual SSRI response variability and the repetitive architecture of the homologous SLC6A4 promoter, we developed a novel long-read SMRT sequencing assay using the PacBio platform to comprehensively characterize SLC6A4 promoter region haplotypes. Our innovative method enabled the phased resolution of complex SLC6A4 promoter diplotypes, which was not possible using short-read WGS data (~5X and~45X) or high-depth capture-based short-read sequencing data (~330X). This novel pharmacogenetic application of long-read SMRT sequencing adds to previously reported methods that employed the PacBio platform to interrogate clinically relevant low-complexity homologous regions of the human genome [26,27].
The SLC6A4 gene encodes 5-HTT, which plays a key role in the central nervous system by regulating serotonergic signaling via transport of 5-HT from the synaptic cleft back into the pre-synaptic terminal for re-utilization. As such, SLC6A4 has previously been associated with a range of behavioral and psychiatric disorders including depression, OCD, anxiety and schizophrenia [4]. Consistent with the role of 5-HTT in psychiatric phenotypes, variant SLC6A4 alleles have also been implicated in interindividual SSRI response [5][6][7][8]. The principal SLC6A4 variant associated with antidepressant efficacy is the 5-HTTLPR insertion/deletion polymorphism (rs4795541), which is located in the SLC6A4 promoter and regulates gene expression. However, this region is challenging to interrogate given that the promoter is a GC rich variable number tandem repeat (VNTR) comprised of 11 to 24 homologous units.
The commonly studied 5-HTTLPR long (L) and short (S) alleles have variable frequencies across ancestral populations. Specifically, the 5-HTTLPR short (S) allele has a higher frequency in Asian populations (55-82%) [28], and a frequency in Europeans and Africans of 35-50% [28] and 22-25% [29,30], respectively. Although our study was not intended to measure population frequencies of the 5-HTTLPR L>S insertion/deletion polymorphism, long-read SMRT sequencing of 120 reference material samples detected a minor allele (S) frequency of 46.7%, which is consistent with a general population frequency.
In addition to 5-HTTLPR, the neighboring rs25531 (c.-1936A>G) variant has also been associated with reduced SLC6A4 expression. Importantly, individuals who carry the 5-HTTLPR long (L) allele also have reduced SLC6A4 expression when they concurrently harbor rs25531 c.-1936G [20,21], underscoring the importance of phasing SLC6A4 variants into promoter haplotypes. Of note, the overall MAF of rs25531 in the gnomAD database is 0.1706 [31]; however, it is annotated as a low complexity region with a low genotype quality metric, and is only covered in fewer than 50% of individuals in gnomAD v2.1.1.
In addition to 5-HTTLPR and rs25531, rs25532 (c.-1760C>T) has also emerged as an independent determinant of SLC6A4 expression. Notably, the rs25532 c.-1760T allele has been correlated with a 15-80% reduction in SLC6A4 gene expression depending on the presence of other variants within the promoter haplotype [21]. Although the independent influence of rs25532 on SLC6A4 expression is significant, it is most informative when taken into consideration as a haplotype with 5-HTTLPR and rs25531 [32]. The overall MAF of rs25532 in the gnomAD database is 0.0683 [31]; however, it is also annotated as a low complexity region. Interestingly, our analysis of 120 samples by long-read SMRT sequencing did not identify any individuals with the rs25532 c.-1760T allele in cis with the rs25531 c.-1936G allele, which is consistent with previous studies that suggested rs25532 c.-1760T is in linkage disequilibrium with rs25531 c.-1936A [21].
Short-read sequencing is not effective at accurately interrogating the SLC6A4 promoter, particularly across the VNTR that includes the 5-HTTLPR insertion/deletion (L>S) polymorphism. This overarching limitation of short-read sequencing has previously been acknowledged, as low complexity regions and tandem repeats in the human genome are notoriously challenging for short-read platforms [33]. However, the ongoing improvements in long-read sequencing chemistry and throughput are increasingly enabling more accurate interrogation of these difficult regions for both research and clinical applications [34]. Our SLC6A4 long-read SMRT sequencing results further support the integration of third-generation sequencing platforms for more accurate and comprehensive characterization of low-complexity pharmacogenetic regions.
In addition to interrogating homologous regions, another advantage of long-read SMRT sequencing is the ability to phase variants and define diplotypes over kilobases of genomic sequence. We previously leveraged long-read SMRT sequencing for full-length characterization of the CYP2D6 gene [26], which led to the unambiguous detection of phased diplotypes and inferred phenotypes. Variant phasing is critical for more informed classification and interpretation in medical genetics, and defining full-gene haplotypes is an integral component of pharmacogenetics. The star (*) allele nomenclature system was developed over 20 years ago as a mechanism to characterize and report pharmacogenetic haplotypes, which is now centralized and administered by the NIH-funded Pharmacogene Variation (PharmVar) Consortium (https://www.pharmvar.org/) [35]. Although star (*) alleles have not yet been defined for SLC6A4, our long-read SMRT sequencing assay efficiently phased clinically relevant variants across the polymorphic homologous SLC6A4 promoter, including the rare extra-long (XL) allele.

Conclusions
In conclusion, SLC6A4 long-read SMRT sequencing is a reliable and validated third-generation sequencing technique that can accurately interrogate the low-complexity homologous SLC6A4 promoter region. This novel method adds to previously reported long-read SMRT sequencing applications in pharmacogenomics, underscoring its utility for high-throughput variant detection and haplotype phasing. The capability to detect and phase the 5-HTTLPR L>S (rs4795541), rs25531 (c.-1936A>G), and rs25532 (c.-1760C>T) variants in the polymorphic SLC6A4 promoter indicates that this method likely will have utility for both research and clinical testing applications.  Figure S1: Representative Sanger sequencing results from selected SLC6A4promoter reference material samples.