Development of SSR Databases Available for Both NGS and Capillary Electrophoresis in Apple, Pear and Tea

Developing new varieties in fruit and tea breeding programs is very costly and labor-intensive. Thus, establishing a variety discrimination system is important for protecting breeders’ rights and producers’ profits. Simple sequence repeat (SSR) databases that can be utilized for both next-generation sequencing (SSR-GBS) and polymerase chain reaction–capillary electrophoresis (PCR-CE) would be very useful in variety discrimination. In the present study, SSRs with tri-, tetra- and pentanucleotide repeats were examined in apple, pear and tea. Out of 37 SSRs that showed clear results in PCR-CE, 27 were suitable for SSR-GBS. Among the remaining markers, there was allele dropout for some markers that caused differences between the results of PCR-CE and SSR-GBS. For the selected 27 markers, the alleles detected by SSR-GBS were comparable to those detected by PCR-CE. Furthermore, we developed a computational pipeline for automated genotyping using SSR-GBS by setting a value “α” for each marker, a criterion whether a genotype is homozygous or heterozygous based on allele frequency. The set of 27 markers contains 10, 8 and 9 SSRs for apple, pear and tea, respectively, that are useful for both PCR-CE and SSR-GBS and suitable for automation. The databases help researchers discriminate varieties in various ways depending on sample size, markers and methods.


Introduction
Fruit and tea breeding requires tremendous cost and time to release varieties because of their long juvenile stage and large plant size [1]. Once elite genotypes are selected and popularized by producers and consumers, the genotypes are able to be duplicated easily by clonal propagation. For this reason, it is difficult to protect breeders' rights because breeders cannot control illegal spreads of scions. The period of protection for perennial plants is long, 25 years, as determined by the International Union for the Protection of New Varieties of Plants. Unlike vegetable and flower breeding programs, which are mainly supported by private companies, fruit and tea breeding programs have been mainly supported by governments, universities and individuals interested in plant breeding [2][3][4]. In Japan, because some elite new varieties registered by public research organizations were transported to other countries without authorization, Japanese organizations and breeders have started to register varieties in other countries and have engaged in developing variety identification methods that must be robust and reliable [5]. To further develop varieties for fruit and tea, it is important to protect breeders' rights and continue organized breeding programs.

Genotyping by PCR-CE
Prior to determining the SSR genotypes in each variety, SSR positions were confirmed using BLAST+ [29] against the 'Golden Delicious' double-haploid (GDDH13) apple genome [30], the Japanese pear 'Nijisseiki' genome [31] and a draft genome of C. sinensis var. sinensis 'Shuchazao' (CSS-SCZ) [32] (Table S1). All of the markers except Hi15h12 in apple and TM350 in tea were matched to specific mapped regions of reference genome sequences within repeat-containing regions. Hi15h12 was mapped to fictive chromosome (0) of GDDH13, and TM350 was mapped to contig805 of CSS-SCZ. The remaining markers were mapped to different chromosome regions within each genome, which is beneficial for variety discrimination.
The markers identified for each species were successfully genotyped by PCR-CE in all of the corresponding varieties (43 apple, 29 pear and 44 tea) without any missing  (Table 1 and S2-S4). The SSR fragments were clear, and signals for each allele were constant for each marker. The average observed heterozygosity, expected heterozygosity and polymorphism information content were, respectively, 0.62, 0.59 and 0.52 in apple, 0.46, 0.41 and 0.35 in pear and 0.56, 0.52 and 0.46 in tea ( Table 2). In each of the three variety collections, all of the varieties used in this study were successfully differentiated based on genotypes by using those marker sets. The unweighted pair group method with arithmetic mean (UPGMA) dendrograms were created in each variety collection, showing some cultivars that contain the genetic structure of different species are genetically distant from the main varieties used in this study ( Figure S1). For example, JM varieties developed from crossing between M. prunifolia Borkh and M. pumila Miller in apple formed a subgroup which is distant from M. domestica varieties. 'Cha Chukanbohon No. 3', which is C. sinensis var. assamica, 'Cha Chukanbohon No. 6', which is a chance seedling of C. taliensis and their relatives formed a subcluster that is distant from the main tea varieties used in this study (C. sinensis var. sinensis).

Genotyping by SSR-GBS
Illumina short-read sequencing data from 9.5 M reads, 7.7 M reads and 7.7 M reads were obtained from the 2 replicates of 43 varieties of apple (average of 110.2 K reads), 29 varieties of pear (average of 132.9 K reads) and 44 varieties of tea (average of 133.0 K reads), respectively. After combining the reads from forward and reverse sequences, 4.2 M, 3.8 M and 3.7 M reads were obtained, respectively. The average number of reads per variety was 49.0 K reads for apple, 66.5 K for pear and 51.4 K for tea. The reads were then demultiplexed based on the primer sequences. The average number of reads per marker ranged from 14 (TM336) to 25,406 (TsuGNH194), with an average of 3866 (Table 3). All 12 SSR markers in apple had an average of more than 50 reads, as did 9 of 10 SSRs in pear and 14 of 15 SSRs in tea. Because the markers that had a small number of reads had lower reliability, they were excluded from further analyses. The numbers of reads in the replicated analysis were similar to those in the original analysis. For example, in tea, the number of reads was highest in MSE0354 (10,510 and 8783) and lowest in TM336 (14 and 12) for the 2 replicated analyses. There were several markers that had an insufficient number of reads in a specific variety even though the overall average number of reads was high. The numbers of reads obtained for TsuGNH164 in 'Oushuu' and CsFM1595 in 'Cha Chukanbohon No. 6' were quite low, making it impossible to determine the genotypes of these markers in these varieties. Table 3. Summary of each marker tested for SSR-GBS. Average depth indicates the number of the reads obtained after demultiplexing based on forward and reverse primer sequences in each marker. Replicated analyses with a slight change in the multiplex PCR set were conducted for validation. Using the pipeline described in Materials and Methods, we calculated the allele frequencies for each marker in each variety. For each marker, we retrieved allele frequencies for only the four most common alleles, which provided enough information to call the genotypes and understand the details of the PCR products ( Figure 1). Then, each genotype was estimated manually based on allele frequency. In most cases, the allele frequency of the first allele was >0.8 when the genotype was homozygous and around 0.4 when the genotype was heterozygous. The distribution of allele frequencies fluctuated depending on the marker owing to background arising during amplification and sequencing. For CH01b09b, which has dinucleotide repeats, genotyping errors were found in 'Jonathan' and 'Sensyu' because of the presence of stutter bands. We used tri-, tetra-and pentanucleotide repeats for all markers except for CH01b09b and Mdo.chr1.18, so there were few stutter bands for the other markers. The detected sizes of some alleles in SSR-GBS were obviously different from those detected in PCR-CE ( Figure 2). In particular, NZmsEB119405 in apple, and MSE0354, TM043, TM350 and TM464 in tea, showed allele dropout (null alleles) in SSR-GBS, meaning that some alleles detected in PCR-CE were undetected in SSR-GBS ( Figure 2).
Except for the cases of allele dropout, the alleles detected by SSR-GBS corresponded to those detected by PCR-CE. There were some small differences between digital sizes in SSR-GBS and the artificially determined sizes in PCR-CE, which ranged from −4 bp to 3 bp, but this difference could be corrected by adding or subtracting the indicated value for each marker (Table 3). Following this analysis, 10 SSRs for apple, 8 for pear and 9 for tea were selected as suitable markers to use in SSR-GBS for variety discrimination (Tables S5-S7). All of the varieties used in this study could be successfully differentiated based on genotypes determined by those marker sets.

MSE0354
Tea 10,510 8783 1 Allele dropout was observed in some varieties

Automated Genotyping in SSR-GBS and Repeatability
To develop an accurate automated genotyping system, we set the value of a new variable, α, for each marker ( Table 4). The value of α is used as a criterion whether a genotype is homozygous or heterozygous. If the first (most common) allele frequency is more than α, the genotype is assumed to be homozygous. If the first allele frequency is less than or equal to α, the genotype is called heterozygous for the first and second alleles. The range of α was largest (estimated as 0.44) for TsuGNH207 and TsuGNH250, and it was smallest for Mdo.chr1.18 (estimated as 0.01). The α for Hi15h12 showed low values (0.34-0.52). This marker produced some slippage reads, resulting in alleles from PCR-CE that were 1 bp larger than the target alleles in SSR-GBS. The α values for TsuGNH179 and TM533 were high (0.89-0.97 and 0.87-0.97, respectively). For these markers, the frequency of the second allele was sometimes less than that of the slippage and stutter band from the first allele, probably due to sequence variation in the primer binding site. Because of this fluctuation, we set the optimum α value for each marker based on the average of the minimum and maximum values, which ranged from 0.43 to 0.98 and averaged 0.75. Table 4. Optimum range of variable α and correlation of first-allele frequency between the two replicates for each marker. The range of α for each marker was determined based on the criterion that the genotypes identified by SSR-GBS and PCR-CE were exactly the same for each of the varieties. To estimate the repeatability of the analyses, we conducted replicated analyses with a slight change in the multiplex PCR set. In the first replicate, we used all of the primers for each species in a single multiplex PCR, whereas in the second replicate, the primers for each species were split into two sets. Otherwise, the steps were the same. The correlations of the first allele frequency among 2 replicated analyses were quite high (0.98-1.00) regardless of the marker, suggesting that this method has high repeatability when the same procedure and conditions are used.

Discussion
We developed databases suitable for both PCR-CE and SSR-GBS containing 43 varieties with 10 SSRs for apple, 29 varieties with 9 SSRs for pear, and 44 varieties with 8 SSRs for tea. We were able to distinguish all the varieties that covered most of the varieties planted in Japan; even the number of the markers were around ten. The databases were available to protect breeders' rights of those varieties. It can also help researchers discriminate varieties in various ways depending on sample size and number of markers. The cost of SSR-GBS is lower than that of PCR-CE when the number of varieties tested is large [33]. The allele scores for PCR-CE and SSR-GBS differed somewhat (Table 3, Figure 2) because the allele sizes fluctuated depending on the type of fluorescent dyes and amplified nucleotide composition. By using the varieties in this study as controls, researchers can adjust the scores of PCR-CE to those of SSR-GBS and vice versa.
SSR-GBS has mainly been used in studies of population genetics, and few studies related to variety discrimination have been conducted. Here, we were able to select SSRs that were suitable for variety discrimination using SSR-GBS. Out of 37 SSRs tested in this study, 33 were successfully amplified (all except for TsuGNH164, TsuGNH184, CsFM1595 and TM336), but only 27 were suitable for SSR-GBS. The rate of successful PCR amplification was similar to those in other studies using SSR-GBS [25,34,35]. Out of the 27 SSRs suitable for automated genotyping, 19 had a range of α values of more than 0.2, suggesting that those markers had high reliability for SSR-GBS. Although the digital allele size determined by SSR-GBS is unambiguous, marker allele frequencies may fluctuate because of differences in enzymes, thermal cyclers and other laboratory equipment. The markers that have a greater range of α have equal amounts of amplification of the first and second alleles when heterozygous and have small numbers of nonspecific PCR products. Thus, those markers could be used as anchoring markers to connect databases. One of the merits of SSR-GBS is that it can detect length homoplasy (the occurrence of nonhomologous fragments of the same size) because it reveals SNPs [25]. However, we did not use SNPs to create the databases because the objective of this study was to develop databases available for both NGS and capillary electrophoresis that do not focus on SNPs but rather on the sizes of the PCR products.
Applying multiplex PCR and/or mixing several PCR products is necessary for NGS because the NGS platform needs sequence diversity in order to produce high-quality reads. On the other hand, applying different enzymes suitable for multiplex PCR at the same annealing temperature produced the differences between the results of SSR-GBS and PCR-CE. In this study, some allele dropouts were observed in SSR-GBS, as has been observed in other studies [33,36]. Each primer used in this study has an optimal annealing temperature, but it was not applied in these experiments, because the cost of genotyping and labor for the experimental procedure would increase if we subdivided the marker set for multiplex PCR. It is possible that some primers did not amplify alleles that had somewhat different sequences in the primer binding site. In fact, allele dropout was observed in varieties that were genetically distant from the main varieties in the database, i.e., JM varieties, which were developed from crossing between M. prunifolia Borkh and M. pumila Miller in apple; 'Oushuu', one of whose ancestors is Chinese pear; and 'Cha Chukanbohon No. 3', which is C. sinensis var. assamica introduced from India ( Figure S1).
Although the accuracy of SSR-GBS, which suffers from allele dropout, is lower than PCR-CE, there are merits of SSR-GBS such as ease of automation and digital allele length, which does not vary among laboratories. We arranged the pipeline for SSR-GBS using the pipeline of Tibihika et al. [25] as a reference. Our pipeline is useful for complete automation of genotyping after the user sets the value of α for each marker. By using fastq reads, primer set sequences and α, the genotype score is easily obtained. The pipeline is simple and is composed of four files with the following functions: (1) combining the pairs of fastq files, (2) demultiplexing based on markers, (3) counting alleles and (4) determining genotypes. This pipeline works in a basic Linux system and is useful not only for variety discrimination but also for marker-assisted selection (MAS). Because the important molecular markers for fruit breeding programs were developed using SSRs and indels, the SSR-GBS pipeline can be applied to practical MAS with a slight change. Specifically, by examining four major alleles, the extent of slippage and stutter bands can be estimated because those increase the frequencies of third and fourth alleles. Even though the third and fourth alleles are not necessary for the determination of genotypes in diploid species, output of that information helps to reduce genotyping errors due to artifacts and helps to confirm whether the result of automatic genotyping is accurate.
In conclusion, the databases of apple, pear and tea developed here can provide basic information for PCR-CE and SSR-GBS. These databases and the methods developed in this study are useful for variety discrimination and protecting breeders' rights. We also identified markers that have high reliability and are suitable for the automation of SSR-GBS. These highly reliable markers can help to connect databases and facilitate automated genotyping using a simple pipeline.

Plant Materials
We used a total of 116 varieties consisting of 43 varieties of apple, 29 varieties of pear and 44 varieties of tea ( Table 2). The varieties of apple and pear were preserved at the Institute of Fruit Tree and Tea Science, NARO, in Morioka and Tsukuba, respectively. The tea varieties were preserved at the Institute of Fruit Tree and Tea Science, NARO, Makurazaki, Green Tea Laboratory; Saitama Prefectural Agriculture and Forestry Research Center; and Tea Branch Facility, Miyazaki Agricultural Research Institute. Genomic DNA was extracted from young leaves with a DNeasy Plant Mini Kit (Qiagen, Germany). To understand an overview of genetic relationships among varieties, UPGMA dendrograms were created using R package "pvclust" [37]. The confidence levels for branches of the dendrograms were determined by calculating approximately unbiased (AU) p-values using multiscale bootstrap resampling based on 10,000 replications.

SSR Genotyping for PCR-CE
The SSR markers used in this study are listed in Table S1 and Table 2. For apple, 12 SSRs that were mainly tri-and tetranucleotide SSRs and that showed clear fragments in PCR-CE in prior analyses were used for genotyping [22,[38][39][40][41]. The 10 SSRs for pear were similar to those developed in the study by Yamamoto et al. [42], which were composed of tetra-and pentanucleotide repeats. For tea, 15 SSRs previously developed and composed of tetra-and pentanucleotide repeats were used in this experiment [43,44]. BLAST+ was used to estimate the positions of the SSRs in each reference genome [29]. The recent reference genomes were mainly composed of pseudomolecules that covered most of the genes. Chromosome positions of markers were mapped easily to design and to select optimal marker sets in genetic studies. By using reference genomes, researchers can also conduct high through-put genotyping [17,18]

SSR Genotyping Using SSR-GBS
SSR-GBS was performed based on the methods described in previous reports with slight changes in sequence data analysis [25,34]. When constructing each library for SSR-GBS, two-step PCR was applied to add the Illumina adapter sequences that are required for Illumina flow-cell binding (Figure 3). The first PCR was performed using specific primers extended with the part of the Illumina adapters P5 (ACACTCTTTCCCTACAC-GACGCTCTTCCGATCT) and P7 (GTGACTGGAGTTCAGACGTGTGCTCTTCCGATC) as the forward and reverse primers, respectively. In the first replicate, the previously reported primers, consisting of 12-, 10-and 15-primer sets for apple, pear and tea, respectively, were used in a single multiplex reaction. For the second replicate of this experiment, we divided each primer set into 2 subsets, i.e., 6 primers × 2 for apple, 5 primers × 2 for pear and 7 and 8 primers for tea. PCR amplification was performed in a 10-µL reaction containing 5 µL QIAGEN Multiplex PCR Kit (QIAGEN), 0.2 µL primer solution (10 µM each of forward and reverse), 3.8 µL H 2 O, and 1 µL genomic DNA. The PCR was performed using the following steps: an initial denaturation at 95 • C for 15 min followed by 25 cycles of denaturation at 95 • C for 30 s, primer annealing at 55 • C for 1 min, and extension at 72 • C for 30 s and, finally, followed by a final extension at 72 • C of 10 min. The second PCR was performed using long primers P5 (AATGATACGGCGACCACCGAGATCTA-CAC[Index]ACACTCTTTCCCTACACGACG) and P7 (CAAGCAGAAGACGGCATACGA-GAT[Index]GTGACTGGAGTTCAGACGTGT) in which Illumina flow-cell binding sites and two different indexes of 8 bp were contained. The primers used for this experiment are listed in Table S1. PCR amplification was performed in a 10-µL total volume consisting of 5 µL 2× Green GoTaq G2 Hot Start Master Mix (0.4 mM each dNTP, Taq DNA polymerase and 4 mM MgCl 2 , pH 8.5; Promega), 1 µL P5 and P7 primers (1 µM) 3 µL H 2 O and 1 µL of the first PCR products. PCR reactions included the following steps: an initial denaturation of 95 • C for 5 min followed by 15 cycles of 94 • C for 30 s, 60 • C for 1 min and 72 • C for 30 s. This was followed by a final extension of 10 min. All of the second PCR products were mixed equally in a single tube and purified using AMPure XP beads (Beckman Coulter, Inc., Bree, CA, USA) following the Agencourt AMPure XP PCR Purification protocol. The library was sequenced using PE 300-bp sequencing on an Illumina MiSeq platform (Illumina, Inc., San Diego, CA, USA).
The reads from Illumina MiSeq were demultiplexed to each variety based on the index sequences using the Illumina system, and the paired fastq files of each variety were obtained. We used flash2 [45] to merge the paired fastq files with parameter "-M 300 -× 0.5 -allow-outies" (described in the script '"1.combine.sh"). The merged reads were then demultiplexed based on primer sequences by using the script "2.demultiplex.sh". The number of reads of each allele was counted for each variety by using the script "3.CountSSR.sh". (These scripts are provided in Files S1, S2 and S3, respectively.) We took only the four most common alleles, and the allele frequencies of the four were calculated. Only the first and second alleles were used to determine the genotypes of varieties, but the third and fourth alleles helped in detecting the presence of a stutter band and the accuracy of the called genotype.
To construct a database useful for both SSR-GBS and PCR-CE, we thoroughly compared the genotypes obtained by SSR-GBS with those obtained by PCR-CE and vice versa. First, the markers that showed fewer than 50 average reads in SSR-GBS were discarded. The correspondence of each allele of each marker obtained by SSR-GBS to the alleles obtained by PCR-CE was confirmed manually to identify cases of allele dropping in SSR-GBS. If allele dropping was found with a marker, that marker was not included in the database for SSR-GBS. Figure 3. Summary of SSR-GBS procedure used in this study. Green indicates the SSR being amplified, with SSR primer annealing sequences (red) on each end. A different set of indexes (orange) is added to each sample to allow the sequences from that sample to be distinguished after multiplexing. The sequences of P5 flow cell binding, P5 sequence primer, P7 flow cell binding and P7 sequence primer were AATGATACGGCGACCACCGAGATCTACAC, ACACTCTTTCCCTACACGACG, CAAGCAGAAGACGGCATACGAGAT and GTGACTGGAGTTCAGACGTGT, respectively.
The reads from Illumina MiSeq were demultiplexed to each variety based on the index sequences using the Illumina system, and the paired fastq files of each variety were obtained. We used flash2 [45] to merge the paired fastq files with parameter "-M 300 -x 0.5 --allowouties" (described in the script '"1.combine.sh"). The merged reads were then demultiplexed based on primer sequences by using the script "2.demultiplex.sh". The number of reads of each allele was counted for each variety by using the script "3.CountSSR.sh". (These scripts . Summary of SSR-GBS procedure used in this study. Green indicates the SSR being amplified, with SSR primer annealing sequences (red) on each end. A different set of indexes (orange) is added to each sample to allow the sequences from that sample to be distinguished after multiplexing. The sequences of P5 flow cell binding, P5 sequence primer, P7 flow cell binding and P7 sequence primer were AATGATACGGCGACCACCGAGATCTACAC, ACACTCTTTCCCTACACGACG, CAAGCAGAAGACGGCATACGAGAT and GTGACTGGAGTTCAGACGTGT, respectively.

Automated Genotyping in SSR-GBS and Repeatability
In previous studies of SSR-GBS, allele calling was performed based on whether the frequency of the first allele exceeded 0.9 or on manual evaluation [27]. However, each marker has different features such as stutter bands and base variations in the primer binding site, making it difficult to set a single value for all marker sets. Here we set the variable α, which was determined separately for each marker. If the frequency of the first (most common) allele exceeded α, we called the genotype a homozygote for the first allele; otherwise, we called it a heterozygote for the first and second alleles (described in the script '4.genotyping.sh' provided in File S4). The range of α for each marker was determined based on the criterion that the genotypes identified by SSR-GBS and PCR-CE were exactly the same for each of the varieties. In other words, the α values were set so that the genotyping results obtained by SSR-GBS in an automated system would match those previously obtained using PCR-CE, in which genotyping was done by eye. As an example of file structure, the files used to designate the apple cultivars and primers in this pipeline are provided in Files S5 and S6, respectively.