The Potential of HTS Approaches for Accurate Genotyping in Grapevine (Vitis vinifera L.)

The main challenge associated with genotyping based on conventional length polymorphisms is the cross-laboratory standardization of allele sizes. This step requires the inclusion of standards and manual sizing to avoid false results. Capillary electrophoresis (CE) approaches limit the information to the length polymorphism and do not allow the determination of a complete marker sequence. As an alternative, high-throughput sequencing (HTS) offers complete information regarding marker sequences and their flanking regions. In this work, we investigated the suitability of a semi-quantitative sequencing approach for microsatellite genotyping using Illumina paired-end technology. Twelve microsatellite loci that are well established for grapevine CE typing were analysed on 96 grapevine samples from six different countries. We redesigned primers to the length of the amplicon for short sequencing (~100 bp). The primer pair was flanked with a 10 bp overhang for the introduction of barcodes on both sides of the amplicon to enable high multiplexing. The highest data peaks were determined as simple sequence repeat (SSR) alleles and compared with the CE dataset based on 12 reference samples. The comparison showed that HTS SSR genotyping can successfully replace the CE system in further experiments. We believe that, with next-generation sequencing, genotyping can be improved in terms of its speed, accuracy, and price.


Introduction
Molecular marker technologies have changed plant genetics research enormously since their introduction in the 1980s and have provided researchers with a tool that is able to analyse an unlimited number of markers independent of environmental influences. Since then, there has been an impressive improvement in technology, which has moved from single to highly multiplexed analysis that includes southern hybridization probing [1]; random and specific PCR amplification methods [2,3]; quantitative PCR approaches [4]; microarrays [5]; and, more recently, next-generation (NGS) sequencing for plant genotype determination [6]. This progress has led to numerous publications describing dense genetic maps [7], finding quantitative trait loci (QTL) of great agronomic interest [8] and completely genotyped germplasm resources [9], to name a few.
The characterization of plant varieties or germplasm resources, such as grapevine, Vitis vinifera L., is a requirement driven by related economic interests, seed certification, plant variety rights, and scientific knowledge. Molecular marker methods for variety identification have undoubted advantages, including microsatellites, which have proven to be powerful tools for the identity,

DNA Extraction
The grapevine samples were obtained from different countries (Table 1), and DNA was extracted from fresh young leaves at the Biotechnical Faculty, University of Ljubljana, Slovenia. For this purpose, the modified cetyl trimethylammonium bromide (CTAB) method [28] was used. After measuring the concentrations (Amersham Biosciences DyNAQuant 200), the DNA samples were stored in a TE Buffer (Invitrogen™, Carlsbard, CA, USA) at −20 • C.

PCR Amplification
The grapevine cultivars were genotyped using SSR amplicon sequencing. The primers were redesigned to the length of the amplicon for short sequencing (~100 bp) ( Table 2) and amplified according to the established protocol. For this purpose, two rounds of PCR amplification were performed ( Figure 1) according to the protocols of Gohl et al. [29] and Vartia et al. [30]. The modified protocol consisted of amplification with locus-specific primers (forward and reverse) adapted to contain a universal primer sequence ( Figure 1; Table 3), and the incorporation of two barcodes by two barcoded universal primers into both ends of the resulting amplicons. A total of 12 forward and 8 reverse DNA barcodes enabled the recovery of 96 unique individuals (Supplementary Material, Figure S1).
Genes 2020, 11, x FOR PEER REVIEW 3 of 18 product length below 150 bp ( Table 2). The primers were designed using the Primer3 software [26]. A subset of HTS data was compared with the CE data of cultivars from the French collection (Table  1), obtained in a previous study of grapevine SSR genotyping, performed at National Research Institute for Agriculture, Food and Environment (INRAE), France [27].

DNA Extraction
The grapevine samples were obtained from different countries (Table 1), and DNA was extracted from fresh young leaves at the Biotechnical Faculty, University of Ljubljana, Slovenia. For this purpose, the modified cetyl trimethylammonium bromide (CTAB) method [28] was used. After measuring the concentrations (Amersham Biosciences DyNAQuant 200), the DNA samples were stored in a TE Buffer (Invitrogen™, CA, USA) at −20 °C.

PCR Amplification
The grapevine cultivars were genotyped using SSR amplicon sequencing. The primers were redesigned to the length of the amplicon for short sequencing (~100 bp) ( Table 2) and amplified according to the established protocol. For this purpose, two rounds of PCR amplification were performed ( Figure 1) according to the protocols of Gohl et al. [29] and Vartia et al. [30]. The modified protocol consisted of amplification with locus-specific primers (forward and reverse) adapted to contain a universal primer sequence ( Figure 1; Table 3), and the incorporation of two barcodes by two barcoded universal primers into both ends of the resulting amplicons. A total of 12 forward and 8 reverse DNA barcodes enabled the recovery of 96 unique individuals (Supplementary Material, Figure S1). Amplifying begins with locus-specific amplification (step 1) using locus-specific forward (F) and reverse (R) primers extended with universal tails ( Table 2); tail 1 (for F primer) is AATTAACCCT, tail 2 (for R primer) is CAGTCGGGCG. In step 2, the loci are pooled by sample and re-amplified to integrate the barcoding primers (BC-F, BC-R) listed in Supplementary Material, Figure S1. Amplifying begins with locus-specific amplification (step 1) using locus-specific forward (F) and reverse (R) primers extended with universal tails (Table 2); tail 1 (for F primer) is AATTAACCCT, tail 2 (for R primer) is CAGTCGGGCG. In step 2, the loci are pooled by sample and re-amplified to integrate the barcoding primers (BC-F, BC-R) listed in Supplementary Material, Figure S1.

PCR for Locus-Specific Amplification
Primary PCR amplification was performed in a final volume of 10 µL containing 5 µL of 5X Q5 Hot Start HiFi buffer, 0.3 µL of 10 mM dNTPs, 5 µL of Q5 Enhancer, 0.1 µL of Q5 Hot Start HiFi Polymerase, 0.25 µL (10 µM) of each locus-specific primer (forward and reverse), and 20 ng of DNA. The cycling conditions were as follows: initial denaturation at 95 • C for 5 min, followed by 35 cycles of 98 • C for 10 s, 65 • C for 20 s, and 72 • C for 10 s. A final extension was performed at 72 • C for 2 min, and then the reaction was cooled down to 4 • C.

PCR for Barcode Integration
We performed the second dual barcoding PCR in a volume of 10 µL containing 5 µL of primary PCR at a 1:100 dilution, 3 µL of 5 µM oligo for each index/barcode, 1.5 µL of 10x KAPA HiFi buffer, 0.3 µL of 10 mM dNTPs, and 0.08 µL of KAPA HiFi Polymerase. The following cycling conditions allowed the efficient incorporation of barcodes to PCR amplicons: initial denaturation at 95 • C for 5 min, followed by 25 cycles of 98 • C for 30 s, 45 • C for 30 s, and 72 • C for 1 min. A final extension was performed at 72 • C for 8 min, and the reaction was cooled down to 4 • C.

Pooling and Sequencing
After the second dual indexing PCR, the amplification products were checked using agarose gel electrophoresis across all loci and diluted appropriately to minimise the amplification rate differences between samples. Two microliters of each PCR product (across all loci and all specimens) were pooled together and cleaned using the Illustra GFX PCR and a gel band purification kit (GE Healthcare, Chicago, IL, USA), following the recommended procedures to remove shorter oligonucleotides. The cleaned sample was eluted in 25 µL, analysed with a highly accurate DNA electrophoresis Bioanalyzer Genes 2020, 11, 917 7 of 16 2100 system using a DNA 1000 kit (Agilent, Santa Clara, CA, USA), diluted to the final concentration of 20 ng/µL, and submitted for the Illumina 150 bp paired-end sequencing at GATC Biotech (Ebersberg, Germany). The project was designed to obtain approximately 5 M paired-end reads per DNA library. The reads were delivered as two FASTQ non-interleaved files.

Bioinformatics Analysis
Reference loci sequences were acquired through the Grape genome browser (12X coverage) (http://www.cns.fr/externe/GenomeBrowser/Vitis/) and adapted to shorter lengths ( Table 2). The raw sequencing reads were mapped to the reference sequences using the "Map Reads to reference" tool implemented in CLC Genomics Workbench 20 (Version 20.0.3) (Qiagen, Hilden, Germany) to obtain the sequencing statistics per locus.
We used two different approaches to assign amplicon sequences to each cultivar and locus. The first approach consisted of mapping the raw sequencing data against the Pinot Noir genomic reference sequences. In the second approach, we demultiplexed the sequencing data by the cultivarand locus-specific sequences present in the amplicon sequences. Briefly, the pair-end sequencing data were demultiplexed in two steps using the fastq-multx tool [34]. In the first step, the sequencing reads were demultiplexed based on the cultivar-specific barcodes introduced into amplicons in the second PCR step and, thus, sorted into the corresponding cultivar samples. After this, Cutadapt ver. 1.18 [35] was used to trim the cultivar-specific barcode sequences from the 3 and 5 ends of the reads.
In the second step, demultiplexing based on primer sequences, which are considered as locus-specific barcode sequences, was performed for each cultivar, and reads with locus-specific sequences on both ends of the reads were kept, thus retaining only full-length sequences. With this procedure, we filtered out incomplete amplicons and kept the reads that fully defined the microsatellite region. The filtered FASTQ files were converted to FASTA files and analysed using (1) the MISA Perl script [36] for the presence of perfect as well as compound microsatellites and (2) the Infoseq tool [37] to obtain the number of sequences with the same length.
The results were analysed with bash tools using the following procedure. The sizes of the microsatellites (no. of repeats or length of alleles) were reported for each read or amplicon sequence, and the number of unique values (sizes) were reported in a table-wise manner. The number of sequencing reads with obtained SSR sizes (MISA output) and the number of sequencing reads with obtained lengths (Infoseq output) were further used as an input for SONiCS [38], a tool that enables stutter noise correction and the determination of true alleles. The tool was run in Monte Carlo mode, with 5000 simulation repetitions. Analyses with SONiCS were applied for only a subset of data (12 French cultivars), for which we were able to make a comparison on the previously reported CE data [27].

Sequencing Analysis
The Illumina paired-end sequencing yielded 24,360,664 reads with an average size of 151 nt, yielding a total of 3,678,460,264 (3.68 Gb) bp of data. Theoretically, the even distribution over 12 loci should be approximately 306.5 Mb. The mapping of the reads to the reference alleles (Table 4) showed that the majority of the reads were of high quality, as 22 M of reads (90.7%) were assigned to 12 loci. However, the distribution of the reads across the loci was not uniform, with an acceptable range between 0.79 M for locus VVIq52 and 3.6 M for locus VMC1b11. This is most likely the consequence of competition among loci in the PCR during the library preparation. The approach of using reference microsatellite sequences and further demultiplexing sequences based on mapping results did not prove to be the method of choice in our example. Microsatellite repeats can be similar between loci, which leads to incorrect mapping, especially for long alleles. Therefore, we chose a demultiplexing approach based on filtering out those sequences that contained correct locus-specific primer-to-primer information and were considered for the final genotyping. The final number of obtained reads was slightly lower than the number of mapped reads (19.4 M, 79.8%); however, they represented high-quality data that were confirmed twice by sequencing (the paired-end approach). Similarly, the demultiplexing approach yielded from 0.7 M (VVIq52) to 2.9 M (VMC1b11) full-length amplicons per locus (Table 4). Using the mapping approach, we obtained a slightly higher number of sequences for most loci; this was likely mainly due to the inclusion of sequences that did not cover the entire microsatellite sequences.
The minimum length of the amplicons demultiplexed by the locus ranged from 73 nt (VVIq 52 and VVIv37) to 99 nt (VVMD25), and the maximum length ranged from 85 nt (VVIq52) to 131 nt (VVMD25), corresponding to the allele lengths shown in the Supplementary Material, Figure S2.

Comparison of CE and HTS Approaches
The results of the comparisons between the HTS and CE methods for microsatellite analyses are presented in the Supplementary Material, Table S1. In examining the HTS approach, the sequences were analysed according to the number of microsatellite repeats (MISA script) and the full lengths of the sequenced amplicons (Infoseq script). The SSR lengths obtained by the MISA script and the amplicon lengths obtained by Infoseq were first analysed with SONiCS. During the visual inspection of the results, we found some allele calling errors when using automated SONiCS analyses, and thus we concluded that the approach using solely SONiCS was not appropriate for the determination of true alleles.
In the past, some other bioinformatics tools have been developed for retrieving SSRs from HTS data, such as LobSTR [39], RepeatSeq [40], STRViper [41], STR-FM [42], PSR [43], rAmpSeq [44], and STRScan [45]. We decided to use the software SONiCS, as it performs simulations of PCR reactions to correct allele calling due to the stutter bands, which are amplified at most grapevine SSR loci used in this study. SONiCS uses the length and depth of the sequenced alleles as input data, and, after each simulation, the log likelihood is calculated to estimate the probability of generating the observed data (input data) from the assumed simulated results. SONiCS selects the alleles for which the model has the highest likelihood. In 144 comparisons (12 loci × 12 cultivars) between MISA or Infoseq and the CE approach, SONiCS showed a 58% success rate in genotyping using MISA data, as 75 alleles were correctly called and 8 alleles differed only by 1 bp. When calling genotypes based on sequence length (Infoseq), SONiCS performed better compared to the approach using the MISA data, as it showed a 77% success rate in genotyping, as 102 alleles were correctly called and 9 alleles differed only by 1 bp.
However, due to missing some longer alleles with lower read counts, we continued to call alleles from the Infoseq output data by visual determination. The CE approach served as a standard. The comparison of the differences for the two alleles (per locus per sample) revealed some discrepancies between the HTS and CE methods, as shown in the Supplementary Material, Table S1. When comparing the MISA data with the CE data for 144 data points (12 loci × 12 cultivars), we obtained 75 alleles that showed the same difference between the alleles within the locus and 8 that differed only by 1 bp. Comparing the Infoseq data with the CE data for 144 data points, we obtained 102 alleles that showed the same difference between the alleles within the locus and 9 that differed only by 1 bp. The reported differences could be due to the development of new primers for HTS analyses that could lead to new null alleles, so that, in some cases, the homozygosity was higher than the expected heterozygosity for the HTS approach (Richter110, locus VVMD25), and, conversely, in some cases the homozygosity was higher than the expected heterozygosity for the CE approach (e.g., Merlot, locus VrZAG79).
The clustering of cultivars based on simple-matching dissimilarity coefficients was performed for the CE and HTS allelic data and resulted in two trees ( Figure 2), with bipartition complexities of 0.94 and 0.91. The value for the consensus tree was 0.52, and the obtained distance between the trees was 0.82. Certain clusters supported with high bootstrapping values (e.g., a cluster of Muscat cultivars and cluster of Pinot Noir-Chardonnay) appeared equally in both approaches, and the Richter 110 rootstock was the most different from other V. vinifera cultivars in both approaches ( Figure 2).

The HTS Approach Creates a Bias in Calling True Alleles for Some Loci
The number of read counts of full-length sequences (alleles) for 12 cultivars over 12 loci are presented as histograms (Supplementary material, Figure S2), with the corresponding alleles determined (Supplementary material, Table S1; columns K and L). We observed that some loci are more problematic for the HTS approach than others; e.g., for the loci VVIq52, VVIb01, and VVMD24, we did not observe any discrepancies in the intra-allelic length comparison between different approaches (Supplementary Material, Table S1), whereas for locus VVMD27, for example, 6 out of 12 comparisons resulted in inconsistencies (Supplementary Material, Table S1). In locus VVIb01, the alleles were short (from 87 to 97 bp), and were similarly so in locus VVIq52 (from 75 to 83 bp) and VVMD 24 (from 97 to 108 bp), while in locus VVMD27 the allele lengths were from 110 to 125 bp and certain long-sized alleles could be overlooked due to their poor sequence coverage (Figure 3, Furmint, allele 125 bp). A similar problem was observed for the locus VVMD25 (Figure 3, Mourverde, allele

The HTS Approach Creates a Bias in Calling True Alleles for Some Loci
The number of read counts of full-length sequences (alleles) for 12 cultivars over 12 loci are presented as histograms (Supplementary material, Figure S2), with the corresponding alleles determined (Supplementary Material, Table S1; columns K and L). We observed that some loci are more problematic for the HTS approach than others; e.g., for the loci VVIq52, VVIb01, and VVMD24, we did not observe any discrepancies in the intra-allelic length comparison between different approaches (Supplementary Material, Table S1), whereas for locus VVMD27, for example, 6 out of 12 comparisons resulted in inconsistencies (Supplementary Material, Table S1). In locus VVIb01, the alleles were short (from 87 to 97 bp), and were similarly so in locus VVIq52 (from 75 to 83 bp) and VVMD 24 (from 97 to 108 bp), while in locus VVMD27 the allele lengths were from 110 to 125 bp and certain long-sized alleles could be overlooked due to their poor sequence coverage (Figure 3, Furmint, allele 125 bp). A similar problem was observed for the locus VVMD25 (Figure 3, Mourverde, allele 131 bp).  In locus VrZag79, in many cases (for cultivars Muscat Blanc a Petits Grains, Muscat d'Alexandrie, Mourvedre, Furmint, Cabernet franc, etc.) a three-allelic profile or high debris (reads of 83 and 89 bp) appeared. Figure 4 shows the Mourverde cultivar for locus VrZag79 with a tri-allelic profile (83, 89, and 97 bp). The three-allelic profiles discovered for this locus were previously observed in studies when extracting DNA from leaves. The presence of a third allele in leaf tissue indicates a periclinal chimera [46]. The locus VVMD7 showed, in some cases, a very intensive amplification of stuttering bands ( Figure 5), which can hinder the calling of true alleles. Small and unexpected mutations associated with locus VVMD7 were also reported earlier [46][47][48][49][50] and may, in some cases, be a consequence of the impaired allele calling. In locus VrZag79, in many cases (for cultivars Muscat Blanc a Petits Grains, Muscat d'Alexandrie, Mourvedre, Furmint, Cabernet franc, etc.) a three-allelic profile or high debris (reads of 83 and 89 bp) appeared. Figure 4 shows the Mourverde cultivar for locus VrZag79 with a tri-allelic profile (83, 89, and 97 bp). The three-allelic profiles discovered for this locus were previously observed in studies when extracting DNA from leaves. The presence of a third allele in leaf tissue indicates a periclinal chimera [46].  In locus VrZag79, in many cases (for cultivars Muscat Blanc a Petits Grains, Muscat d'Alexandrie, Mourvedre, Furmint, Cabernet franc, etc.) a three-allelic profile or high debris (reads of 83 and 89 bp) appeared. Figure 4 shows the Mourverde cultivar for locus VrZag79 with a tri-allelic profile (83, 89, and 97 bp). The three-allelic profiles discovered for this locus were previously observed in studies when extracting DNA from leaves. The presence of a third allele in leaf tissue indicates a periclinal chimera [46]. The locus VVMD7 showed, in some cases, a very intensive amplification of stuttering bands ( Figure 5), which can hinder the calling of true alleles. Small and unexpected mutations associated with locus VVMD7 were also reported earlier [46][47][48][49][50] and may, in some cases, be a consequence of the impaired allele calling. The locus VVMD7 showed, in some cases, a very intensive amplification of stuttering bands ( Figure 5), which can hinder the calling of true alleles. Small and unexpected mutations associated with locus VVMD7 were also reported earlier [46][47][48][49][50] and may, in some cases, be a consequence of the impaired allele calling. The locus VVMD7 showed, in some cases, a very intensive amplification of stuttering bands ( Figure 5), which can hinder the calling of true alleles. Small and unexpected mutations associated with locus VVMD7 were also reported earlier [46][47][48][49][50] and may, in some cases, be a consequence of the impaired allele calling.

Analyses of 96 V. vinifera Samples
The sequencing analyses (i.e., the number of reads for the sequenced amplicons) for 96 different V. vinifera cultivars over 12 loci are presented in the Supplementary Material, Table S2. In the analysed data set, we included five counterparts from French and Slovenian collections (Chardonnay, Merlot, Pinot Noir, Cabernet Sauvignon, and Sultanine), and the comparison over 12 loci yielded 55 exact matches and 5 discrepancies (Supplementary data, Table S2); three out of five were different for only two bp for the compared alleles and two were within the locus VVMD27, which was previously confirmed as one of the loci with triallelic profiles (chimerism) that showed a high intra-clonal variability [51,52]. Discrepancies were found in the Merlot and Pinot Noir cultivars, with previously reported intra-clonal genetic variation [46,51,52]. Studies have previously reported polymorphisms identified by microsatellite markers, which indicate the presence of trialellic loci, referred to in grapevines as chimeras [46,49], caused by mutations in the cells of the meristem layers L1 and L2 [53].

HTS Genotyping Economy
HTS systems offer extremely cost-effective sequencing generation for large amounts of data. Therefore, HTS systems are already used in genotyping projects that employ different strategies to find polymorphisms, such as genotyping by sequencing [54], capturing strategies [55][56][57], or the shotgun sequencing of entire genomes [58]. Microsatellites are multiallelic markers, which makes them ideal for the management of plant germplasm. In our project, we investigated the possibility of using a sequence counting approach for genotyping microsatellite alleles.
There are also economic reasons behind switching from capillary-based systems to HTS platforms. The first important reason is the price of a capillary-based instrument, which is higher than for medium-throughput NGS systems. The price of the instrument is worth considering, especially for those laboratories that are considering either replacing their capillary systems or buying new ones. The second reason is the operating costs. The sequencing cost of our project was 531 € (VAT excluded), and we have produced more than 12 million sequences. Our data contained 1152 data points (96 cultivars by 12 loci), which means 0.46 € per data point. However, the sequencing coverage was extremely high (10,000× on average). We believe that we were able to reduce the coverage by at least five times, which is 0.09 € per data point. The running costs for capillary instruments are higher than 1 € per sample (data point), and genotyping providers usually charge 2.5-3 € per sample. Therefore, the economic situation speaks in favour of HTS typing.

Conclusions
The remarkable advances in high-throughput sequencing technologies have significantly increased their application in genetic diversity studies, population structure analyses, and conservation genetics. The HTS approach has the advantage of the large-scale genotyping of individuals at multiple loci simultaneously using an amplicon barcoding system that allows large-scale analysis, generating a large amount of data in less time and at a surprisingly lower cost [59,60]. The HTS approach showed significant advantages over the fragment length variation-based approach using conventional capillary and gel electrophoresis [21,30,59,61]. Studies [21,59] reported that HTS technology increased the number of detected alleles compared to the electrophoresis-based method, overcoming the effect of microsatellite length homoplasy, resolving the hidden variations, and maximizing the genetic information obtained. While homoplasy was reported in certain previous studies, it was not detected in any of the loci we investigated. Homoplasy is more likely to be detected in less closely related genotypes.
According to our observations, the limitation of HTS-SSR genotyping is in the automation of allele retrieval, which is crucial for HTS approaches with high multiplexing and large amounts of data. Due to the high degree of mismatching observed for some microsatellite loci when using SONiCS bioinformatics tools for retrieving SSRs from HTS data, we recommend that other tools should be investigated and/or improvements made to the existing tool (e.g., the normalization of the read counts according to the amplicon length and sequencing depth of the libraries) to reduce the distortion obtained from the amplification and sequencing process.
The HTS-SSR approach has huge potential in terms of its speed and cost effectiveness. As our study is one of the first studies of this kind presented for plants, an additional optimization and validation process should be performed before the routine use of HTS genotyping instead of the CE approach, especially as we have shown that not all loci are equally suitable for the sequencing approach.
Supplementary Materials: The following are available online at http://www.mdpi.com/2073-4425/11/8/917/s1: Figure S1: (a) Forward and reverse barcodes with universal tail (in bold); (b) Barcoding system in a 96-well plate; 12 forward barcodes and 8 reverse barcodes enabling the barcoding of 96 samples. Figure S2: Histograms generated from the number of read counts of full-length sequences (alleles) obtained with the Infoseq approach for twelve V. vinifera cultivars at twelve different loci. Table S1: Comparison of three different approaches to determine the genotypes of 12 different V. vinifera cultivars at 12 different loci, i.e., by capillary electrophoresis (CE), by the calling length of SSR (MISA), and by the calling allele lengths (Infoseq). The genotype data obtained by capillary electrophoresis are publicly available [27]; SONiCS was used to call genotypes from the data obtained by MISA, and visual determination of genotypes was done to call the alleles obtained by Infoseq.