Long-Range PCR-Based NGS Applications to Diagnose Mendelian Retinal Diseases

The purpose of this study was to develop a flexible, cost-efficient, next-generation sequencing (NGS) protocol for genetic testing. Long-range polymerase chain reaction (PCR) amplicons of up to 20 kb in size were designed to amplify entire genomic regions for a panel (n = 35) of inherited retinal disease (IRD)-associated loci. Amplicons were pooled and sequenced by NGS. The analysis was applied to 227 probands diagnosed with IRD: (A) 108 previously molecularly diagnosed, (B) 94 without previous genetic testing, and (C) 25 undiagnosed after whole-exome sequencing (WES). The method was validated with 100% sensitivity on cohort A. Long-range PCR-based sequencing revealed likely causative variant(s) in 51% and 24% of proband from cohorts B and C, respectively. Breakpoints of 3 copy number variants (CNVs) could be characterized. Long-range PCR libraries spike-in extended coverage of WES. Read phasing confirmed compound heterozygosity in 5 probands. The proposed sequencing protocol provided deep coverage of the entire gene, including intronic and promoter regions. Our method can be used (i) as a first-tier assay to reduce genetic testing costs, (ii) to elucidate missing heritability cases, (iii) to characterize breakpoints of CNVs at nucleotide resolution, (iv) to extend WES data to non-coding regions by spiking-in long-range PCR libraries, and (v) to help with phasing of candidate variants.


Introduction
Inherited retinal diseases (IRDs) are a group of disorders affecting the retina and its function. They are characterized by high phenotypic variability and genetic heterogeneity, including different modes of inheritance [1]. These disorders may present as either isolated or syndromic, progressive or stationary. Also, IRDs may affect the entire retina (panretinal) or may be restricted to the macula. Moreover, they are often classified based on the primarily affected photoreceptor type. Panretinal forms include retinitis pigmentosa (RP), cone-rod dystrophy (CRD), cone dystrophy (COD), and others, whereas macular dystrophies (MDs) include Stargardt disease (STGD), Best disease (BEST), and others [1,2].
Missing heritability is defined as the portion of the phenotypic trait that could not (yet) be explained by genotype data [15]. Since inherited diseases are, by definition, characterized by complete or near-complete heritability, missing heritability in these diseases indicates cases in which genetic testing could not identify a molecular diagnosis [15].
Recent efforts to reduce missing heritability in IRDs (and improve diagnostic output) have focused on non-coding regions of the genome, and on the detection of structural or copy number variants (CNVs) [16][17][18][19]. These studies were performed using customized microarray probes [17], customized capturing probes [16,18], or genome sequencing [19].
Here, we present a flexible and cost-effective method to comprehensively sequence loci of interest. The method relies on high-throughput sequencing of pooled long-range (LR) polymerase chain reaction (PCR) fragments of up to 20 kb in size. We designed primers to sequence 35 genomic loci that have been associated with IRDs, particularly MDs and X-linked RP, including ABCA4, PRPH2, BEST1, CRB1, CNGB3, RP1, and RPGR.
As per design, WES only provides coverage for the coding portions of the locus. The custom capture probes result in coverage for the majority of the locus, including introns. However, gaps accounting for 15-20% of the locus are present. Finally, the LR PCR Figure 1. ABCA4 coverage comparison of different assays. Screenshot of the Alamut Visual software showing the coverage results for the ABCA4 locus from different next-generation sequencing assays. On the top, the software illustrates the relative exon locations. Below the gene structure, coverage plots for a typical whole-exome sequencing assay (top), custom capture probes assay (middle), and, finally, the long-range polymerase chain reaction method (bottom). The primers for loci that require multiple PCRs were designed in a way that amplicons of adjacent regions overlap with each other. As an example, Figure 1 shows the coverage obtained for the ABCA4 locus by different NGS assays: WES, custom capture probes, and LR PCRs.
As per design, WES only provides coverage for the coding portions of the locus. The custom capture probes result in coverage for the majority of the locus, including introns. However, gaps accounting for 15-20% of the locus are present. Finally, the LR PCR method can provide uniform coverage over the entire locus. Overlaps between neighboring LR amplicons are revealed by an increase in coverage. Similar results were obtained for the other loci included in the panel ( Figure S1 in the Supplementary Materials).
In order to verify sensitivity of the method, 108 IRD patients were selected, for whom a molecular diagnosis had been previously established, corresponding to variants in a locus included in the panel. LR PCRs for the respective loci were performed and sequenced. The sequencing of the validation cohort resulted in 100% sensitivity to the previously identified putative pathogenic variants (Table A1 in the Appendix A). We found no evidence for allele dropout (ADO). We envisioned the method to be useful in the following scenarios.

First-Tier Assay for Probands without Previous Genetic Testing
A subpanel of the 35 loci was selected and sequenced for 94 non-syndromic IRD cases composed of probands diagnosed with MD (92/94), CRD (1/94), and autosomal dominant COD (1/94) that had not previously undergone genetic testing. Locus selection was based on the provided clinical diagnosis and family history. This strategy resulted in the identification of a molecular diagnosis in 48/94 probands (51.1%) ( Table A2 in the Appendix A).
If candidate loci sequencing did not reveal a molecular diagnosis, WES was performed. Second-tier WES analysis revealed likely pathogenic variants in 15 additional probands, resulting in a total of 63/94 molecularly diagnosed probands (Table A2 in the Appendix A). Thirty-one probands (33.0%) were still lacking a genetic diagnosis after first-tier LR PCR-based candidate loci sequencing and second-tier WES. A list of rare variants (gnomAD MAF < 1%) found in these samples is provided (Table S1 in the Supplementary Materials).

Tackling Missing Heritability
The missing heritability cohort included 25 probands, in whom WES was previously performed and a single likely pathogenic variant in a recessively inherited locus was identified. LR PCRs of the relevant loci (i.e., those carrying a single likely pathogenic variant) were performed to assess the presence of a second likely pathogenic variant.

Copy Number Variant (CNV) Characterization
CNV analysis on capture data allowed for the identification of likely pathogenic exon-spanning deletions in ABCA4, KCNV2, RP1, and RS1.
A heterozygous ABCA4 deletion spanning exons 20 through 22 and parts of the adjacent introns was identified in two unrelated STGD patients and confirmed by multiplex ligation-dependent probe amplification (MLPA; Figure A1 and Table A2). A LR PCR product generated by using forward and reverse primers upstream of exon 19 and downstream of exon 23, respectively, was sequenced on a MiSeq instrument (Illumina, San Diego, CA, USA). Results highlighted similar breakpoints as in a recently published recurrent deletion [18]; however, mismatches and a longer polyadenine stretch seem to be present around the breakpoints. The breakpoint region was also sequenced by targeted locus amplification (TLA; Cergentis, Utrecht, The Netherlands), which appeared to confirm the 3' breakpoint at NC_000001.10:g.94511701. The 5' breakpoint probably lies within an A-rich sequence at NC_000001.10:g.94507655-94507699. Interestingly, the breakpoints are flanked by an AluY and an AluSx element on the 5 -(NC_000001.10:g.94507370-94507684) and 3 -side (NC_000001.10:g.94511717-94512041), respectively. Finally, a 10 bp microhomology sequence directly follows the 5 -side AluY element (NC_000001.10:g.94507690-94507699) and precedes the 3 -side AluSx element (NC_000001.10:g.94511701-94511710). However, unambiguous breakpoint identification was not possible due to repetitive flanking sequences (Table A2 and Figure A1).
A heterozygous deletion spanning the entire KCNV2 locus was identified in a patient diagnosed with COD (Table A1 in the Appendix A). To narrow down the breakpoint region of the deletion, microarray analysis was performed with an 850K single nucleotide polymorphism (SNP) chip. Based on these results, primers were designed for LR and the resulting amplicon was sequenced. Sequence alignment revealed a deletion of 70,036 bp in size and a 3 bp microhomology sequence flanking the breakpoints (Table A1 and Figure A2).
Finally, a hemizygous deletion involving exon 2 of RS1 and a homozygous deletion spanning exons 2-4 of RP1 were identified in a X-linked retinoschisis (XLRS) patient and a RP patient, respectively. The breakpoints of these deletions could be determined directly with the validated LR PCRs (Table S2 in the Supplementary Materials). The RP1 deletion is 11,116 bp in length with a 3 bp microhomology region flanking the breakpoints and has been described previously [27] ( Table A1 in the Appendix A). The RS1 deletion is 1005 bp in length and features a 4 bp microhomology around the breakpoints (Table A1 and Figure A3).

Exome Spike-In
In order to include the intronic regions of ABCA4, custom-capture probes were added to the TruSeq Exome kit capture probes before the hybridization step. The resulting data provided low coverage of the intronic regions of the gene ( Figure S2 in the Supplementary Materials). However, achieving the appropriate dilutions and proportions to ensure that the ABCA4 capture probes do not outnumber WES probes was challenging.
Similarly, LR PCR libraries spike-in was performed for the ABCA4, RPGR, and PCARE (C2orf71), to either increase coverage of poorly captured regions, such as RPGR's exon ORF15 (open reading frame 15) and PCARE exon 1, or to cover non-captured regions, such as introns. The method provided reliable data for variant detection over the entire loci, including exon ORF15 ( Figure S2 in the Supplementary Materials). Moreover, it proved to be technically less challenging compared to custom probes spike-in. Finally, the extra target regions can be personalized for each individual sample.

Read Phasing
In case of compound heterozygosity for recessive variants, chromosomal phasing for the candidate variants should be performed. Segregation analysis can achieve this, but only if family members are available for testing. Additional family members of 66 probands were available for segregation analysis in this study (66/227, 29.1%).
Being able to perform chromosome phasing directly on an individual's sample would greatly help with interpretation of genetic testing results. By visually inspecting reads resulting from LR PCR sequencing (read phasing), it was possible to ascertain compound heterozygosity in 5 cases (Tables A1-A3 in the Appendix A). Figure 2 shows a graphic representation of the concept (a), along with the simplest specific example (patient ID S220, Table A2) (b).

Discussion
We have presented a targeted sequencing approach based on LR PCR products that is flexible, versatile, and cost-effective. Primers for 124 LR PCRs to cover 35 IRD-associated loci have been established. The method has been validated with 108 molecularly diagnosed IRD cases. All previously identified variants could be verified, corresponding to 100% sensitivity. The target regions to be sequenced may be personalized according to the clinical phenotype and family history, so that costs can be reduced compared to standard genetic testing.
Although the method provides the most complete coverage of target loci (Figure 1, Figure S1 in the Supplementary Materials), several intronic regions could not be amplified (20 kb out of 1815 kb, 0.6%). Therefore, few loci (CRB1, EFEMP1, IMPG1, CNGB3, and TIMP3) have minor gaps in coverage. Moreover, whilst the method is very sensitive to deletions that lie within the amplicons, it is not able to detect deletions spanning primer-binding regions. In this case, ADO would occur, and the assay would not show any sign of the CNV other than lower PCR output and the fact that affected regions display exclusively homozygous-appearing variants.
It is not possible to eliminate the risk of ADO in any PCR; however, we did not observe any such events. Homozygous-appearing regions in non-consanguineous families should warrant caution. A final limitation of the method is the need for high molecular weight DNA as a template.
Since the sequencing targets are selected based on clinical phenotype and family history, this method is highly dependent on precise clinical assessment. Close collaboration between the molecular and the clinical diagnostics teams is therefore vital.
The method was used as a first-tier assay to sequence a subpanel of the 35 loci, selected based on clinical diagnosis, in 94 IRD patients (mostly MD) that did not undergo genetic testing previously. Additionally, LR PCR-based sequencing was used as a second-tier assay to discover "second hits" in 25 IRD patients of the missing heritability cohort. These strategies resulted in the identification of likely pathogenic variants in 69 patients of the 119 (58.0%). Published deep-intronic variants in ABCA4 contributed to the diagnosis of 4 probands (3.4%).
Variants in ABCA4, PRPH2, and BEST1 alone explained retinal disease in 60.7% of the MD cohort, similar to previously published results (57%) [13]. Even more striking are the results for the STGD sub-cohort, where variants in ABCA4 and PRPH2 were found to be the likely cause of disease in 76.8% of probands ( Figure 3). For this reason, we envision our method to be particularly applicable to diseases with low genetic heterogeneity (such as STGD), and as part of a tiered genetic testing strategy to reduce the number of more costly assays (such as WES) for more genetically heterogeneous diseases (such as unspecified MD and Leber congenital amaurosis). In our study, use of a tiered strategy composed of LR PCR sequencing of ABCA4, PRPH2, and BEST1, followed by WES for the remaining undiagnosed samples, in the MD cohort would have saved 34% in material costs. A similar system was previously found to be more sensitive and less expensive than standard WES analysis in one of the largest IRD cohorts yet reported [12].
Furthermore, the protocol can be useful in analyzing challenging regions that are typically not covered by other methods, such as RPGR's exon ORF15. It has been shown previously that NGS of a LR PCR over this region provides good coverage and we have developed a secondary data analysis pipeline to improve sensitivity and specificity [28,29]. This strategy permitted the identification of a novel 20 bp insertion (NM_001034853.1:c.2819_2838 dup, Table 3, Table A1) that was not detected by standard analysis pipelines.
As demonstrated by the examples of the ABCA4, RP1, RS1, and KCNV2 deletions, the protocol can be adapted easily to characterize the breakpoints of identified CNVs. Even though unambiguous breakpoint characterization was not possible for the ABCA4 deletion, sequencing revealed the deletion to be flanked by two Alu elements and a 10 bp microhomology sequence. The two flanking Alu elements are characterized by variants that increase their homologies, which is suggestive of a gene conversion event [30]. Since the likelihood of these events is indirectly correlated with the distance of the elements, it may be secondary to the deletion event [30].
As a proof of concept, we showed that finalized LR libraries can be spiked into an exome library to either enhance coverage of poorly captured exonic regions (such as RPGR's exon ORF15) or to obtain coverage of otherwise uncaptured regions (such as ABCA4's introns) ( Figure S2 in the Supplementary Materials).
Finally, the method can facilitate chromosomal phasing. It allows for segregation analysis of multiple variants on the same locus in a single experiment, when samples from family members are available. Moreover, when family members are not available for testing, read phasing based on the LR PCR sequencing data of the index patient might confirm or exclude compound heterozygosity. However, this depends on the density of heterozygous informative variants in the region of interest (Figure 2).
The method may also be beneficial for other Mendelian diseases, such as CFTR-related diseases, Tay-Sachs disease, Marfan syndrome, and tuberous sclerosis.

Patients and Family Members
Unrelated probands (N = 227) were referred to us for genetic testing from different eye clinics with a clinical diagnosis and information about family history. Samples (N = 166) from 66 families were available for segregation analysis. Written informed consent was obtained from all probands and family members included in this study, which was conducted in accordance with the 2013 Declaration of Helsinki.
The probands included in this study were assigned to one of three groups: (i) validation cohort (molecular diagnosis previously established by using WES, n = 108), (ii) missing heritability cohort (no molecular diagnosis established with previous WES analysis, n = 25), (iii) probands without previous genetic testing (n = 94).

Genomic DNA
The majority of genomic DNA (gDNA) samples (n = 204) were extracted in duplicate with the automated chemagic MSM I system according to the manufacturer's specifications (PerkinElmer Chemagen Technologie GmbH, Baesweiler, Germany). The remaining gDNA samples (n = 23) were extracted at external labs and sent to us. Genomic DNA integrity and concentration were evaluated on a Nanodrop instrument (Life Technologies, Carlsbad, CA, USA). Aliquots of gDNA were diluted to 10 ng/µL with ddH 2 O for use in PCRs.

Long-Range PCR primers
Primers for LR PCR were designed on NCBI's Primer-Blast to be 28 bp (range [26][27][28][29][30] long and to have a melting temperature (Tm) of 67 • C (range 65-68 • C) under default settings [31]. The size of PCR products depended on the target locus and ranged from 4.  Table S2 in the Supplementary Materials.

Long-Range PCR
The LR PCR products used in this study were generated using Takara's long and accurate (LA) Taq polymerases (Takara Bio, Kusatsu, Japan) according to the following PCR mixture: 30 µL total reaction volume, 11.5 µL of ddH 2

PCR Quality Control and Pooling
The expected size of amplicons was confirmed by electrophoresis on 0.6% agarose gels (run at 60 V) and 1 µL of the reaction mixture was used to measure the amplicon concentration with the Qubit dsDNA High-Sensitivity Assay kit (ThermoFisher, Waltham, MA, USA).
All amplicons that would be sequenced with the same index sequence (either all PCRs of a specific proband or non-overlapping amplicons of different probands) were diluted to 10 ng/µL in 1× Tris-EDTA (TE) buffer (Integrated DNA Technologies, Coralville, IA, USA).
The volume of each diluted PCR added to the pool was proportional to its size, for a final pool of at least 130 µL in total volume.
Successful fragmentation was checked by running 1 µL of the sheared pool with a Bioanalyzer High-Sensitivity DNA kit on a Bioanalyzer 2100 instrument (Agilent Technologies, Santa Clara, CA, USA).
The validated pools were then processed with the TruSeq DNA Nano Low-Throughput Library Prep kit and TruSeq DNA Single Index Set A and B, according to the manufacturer's protocol (Illumina, San Diego, CA, USA).
Molarities of the libraries were calculated and diluted to a 4 nM working concentration. The different libraries that were to be sequenced together (with different indexes) were pooled proportionally to their total genomic target size. Subsequently, these pools were denatured with 0.2N NaOH and loaded into a MiSeq Reagent Kit V2 (300 cycles) cartridge, according to the manufacturer's recommendations (Illumina, San Diego, CA, USA). Pairedend sequencing (2× 151 cycles) was performed on a MiSeq instrument (Illumina, San Diego, CA, USA).
A detailed step-by-step protocol is provided in the Supplementary Materials (Text S1).

ABCA4 Capture Sequencing
Custom-capture probes (xGen Lockdown Probe Pools) were designed for the ABCA4 locus by IDT (Integrated DNA Technologies, Coralville, IA, USA). Genomic DNA libraries were constructed according to the ThruPLEX DNA-seq 96D kit protocol (Takara Bio, Kusatsu, Japan). Libraries were pooled equimolarly for a total of 1600 ng of DNA. Subsequently, the ABCA4 capture probes were used according to the manufacturer's protocol to enrich the ABCA4 locus (Integrated DNA Technologies, Coralville, IA, USA). The resulting libraries were sequenced on a MiSeq according to manufacturer's recommendations (Illumina, San Diego, CA, USA).

Exome Spike-In Applications
To assess the feasibility of spike-in to WES experiments to enhance coverage of regions of interest, we tested two alternatives: by adding the custom xGen Lockdown Probe pool for the ABCA4 locus to the TruSeq Exome kit, capturing probes mix before hybridization, or by adding the processed LR PCR libraries just before final library denaturation.
An aliquot of the xGen Lockdown Probes was diluted to a concentration of 85.5 aM and a total of 250 amol were added to the TruSeq Exome capture mix for spike-in. Conversely, for the LR libraries spike-in, the amount of library to be added was calculated based on the ratio of the total LR library target regions and the WES target region.

Sequencing Data Analysis
Illumina sequencing data was aligned to the human reference genome hg19 with Burrows-Wheeler Aligner (BWA), and variant calling was performed by Genome Analysis Toolkit (GATK) [32,33]. The resulting Variant Call Format files (VCFs) were annotated with Alamut Batch v1.8 (SophiaGenetics, Lausanne, Switzerland) using a gene list corresponding either to the loci present in the RetNet database (n = 276) for WES data, or to the target loci for LR PCRs data. CNVs analysis on target-capture sequencing data (ABCA4 capture and WES results) was performed using panelcn.mops [34] and the SeqNext module of Sequence Pilot v5.2 (JSI medical systems, Ettenheim, Germany).
LR PCR sequencing data of RPGR's exon ORF15 was assembled de novo using SPAdes and the resulting contig was aligned to the reference sequence, as previously described [29,35].

CNV Validation and Breakpoints Characterization
MLPA (MRC Holland, Amsterdam, The Netherlands) was used to validate candidate CNVs identified by panelcn.mops and/or SeqNext in ABCA4, BEST1, GUCY2D, and RPGRIP1 and to screen for CNVs in ABCA4, BEST1, and PRPH2.
Deletions encompassing exons 20 through 22 of ABCA4, exon 2 of RS1, exons 2 through 4 of RP1, and the entire KCNV2 gene were identified by panelcn.mops and SeqNext or by comparison of coverage plots. These deletions were verified and characterized by LR PCR. The primers used are listed in Table S3 in the Supplementary Materials. Briefly, primers were designed based on the estimated breakpoints location, LR PCR was performed, and the amplicon was sequenced as described above (see amplicons pool library construction and sequencing). In addition, breakpoints were confirmed by Sanger sequencing (primers in Table S4 in the Supplementary Materials).
To narrow down the breakpoint region of the KCNV2 deletion, an Infinium CytoSNP-850K BeadChip array analysis was performed according to the manufacturer's protocol (Illumina, San Diego, CA, USA).
The ABCA4 deletion was further characterized and refined by TLA (Cergentis, Utrecht, The Netherlands).

Chromosomal Phasing
Segregation analysis was performed if samples from family members were available, either by Sanger sequencing (primers on Table S4 in the Supplementary Materials) or by LR PCR sequencing, as described above.
Alternatively, when no family members were available for testing, read phasing was performed. For this, reads in the Binary Alignment Map (BAM) file were visually inspected for informative heterozygous variants between the two putative compound heterozygous candidate variants (Figure 2).

Data Availability Statement:
The data presented in this study are available in the article and the supplementary materials. Raw data are not publicly available due to data protection regulations.

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Appendix A
Reported here are the genetic findings for the three cohorts discussed in the main text: Validation cohort (n = 108), no previous genetic testing cohort (n = 94), and missing heritability cohort (n = 25). Table A1 summarizes the likely pathogenic variants identified in the validation cohort. Similarly, Table A2 highlights the likely pathogenic variants found in molecularly diagnosed patients included in the cohort without previous genetic testing. Finally, Table A3 reports the genetic findings in molecularly diagnosed patients of the missing heritability cohort.        The blue line under the sequence shows the part of the junction region that originates from the proximal side of the deletion. The green line highlights the 4 bp-long homologous sequence that is present on both sides of the deleted region. The orange line marks the part of the sequence coming from the distal side. (d) Electropherogram of the Sanger sequencing results of the junction fragment for the index patient. The blue line above the sequence shows the part of the junction region that originates from the proximal side of the deletion, which can also be seen in the NGS results (c). The green line shows the 4 bp-long homologous sequence that is present on both sides of the deleted region (also marked in (c)). The orange line highlights the part of the sequence coming from the distal side.