SPInDel Analysis of the Non-Coding Regions of cpDNA as a More Useful Tool for the Identification of Rye (Poaceae: Secale) Species

Secale is a small but very diverse genus from the tribe Triticeae (family Poaceae), which includes annual, perennial, self-pollinating and open-pollinating, cultivated, weedy and wild species of various phenotypes. Despite its high economic importance, classification of this genus, comprising 3–8 species, is inconsistent. This has resulted in significantly reduced progress in the breeding of rye which could be enriched with functional traits derived from wild rye species. Our previous research has suggested the utility of non-coding sequences of chloroplast and mitochondrial DNA in studies on closely related species of the genus Secale. Here we applied the SPInDel (Species Identification by Insertions/Deletions) approach, which targets hypervariable genomic regions containing multiple insertions/deletions (indels) and exhibiting extensive length variability. We analysed a total of 140 and 210 non-coding sequences from cpDNA and mtDNA, respectively. The resulting data highlight regions which may represent useful molecular markers with respect to closely related species of the genus Secale, however, we found the chloroplast genome to be more informative. These molecular markers include non-coding regions of chloroplast DNA: atpB-rbcL and trnT-trnL and non-coding regions of mitochondrial DNA: nad1B-nad1C and rrn5/rrn18. Our results demonstrate the utility of the SPInDel concept for the characterisation of Secale species.


Introduction
Rye (Secale cereale L.) is commonly grown in the fields of Northern and Eastern Europe. S. cereale L. is highly tolerant to diverse environmental stresses, as drought and frost [1,2]. It contains genes conferring biotic stress resistance against various diseases (e.g., stem rust, leaf rust, and yellow rust [3] or powdery mildew [4]. In view of that the wild and weedy forms may crossbreed with cultivated rye [5], these taxa, along with the landraces, constitute gene pools for desirable genes. They can be regarded as genetic resource reservoirs for new niches and future breeding programs of wheat, triticale and other crops [6]. In spite of the high economic value of rye, its huge genome size (2n = 2 × = 14, 1Cx~7.9 Gbp) [7] (the largest among the diploid species of Poaceae family) and highly repetitive genome 92% [8] have prevented the utilization of its genome information for crop improvement. Hence understanding genetic structuring of the genus Secale and distribution of genetic diversity within the genus is extremely important.

Analysis of Variable-Length cpDNA Sequences
A total of 3031 bp from 140 sequences were analyzed. The species and characteristic of each target region is described in Table S1.
The potential use of SPInDel profiles for species identification purposes requires the existence of 'species-specific SPInDel profiles': those that are only found in one species within a taxonomic group and allow their unequivocal identification. The mean number of species-specific profiles (Nsp) was 2.5 (from 1 to 4), while the mean number of species with shared profiles (N(species)sh) was 3.25 (varied from 2 to 4) ( Table 1), i.e., almost all regions in this group had specific profile. If all profiles were specific, Nsp will be equal to the number of different profiles (N). If some profiles are shared, then N > Nsp. A profile can be shared between two or more species, therefore the number of species with shared profile can be higher than the number of species-shared profiles.
The frequency of species-specific profiles is 1.00, indicating that all species have a unique SPInDel profile. The mean frequency of species-specific profiles (f sp ) was 0.83, ranging from 0.72 to 0.91 (Table 1). This result suggests that all cpDNA regions had almost unique combination of fragments lengths.
If we consider species analysis, the mean number of species-specific profiles (Nsp) was 11.69 (from 4 to 16), and the mean number of species with shared profiles (N(species)sh) was 0 (Tables S2-S5), i.e., no species in this group had shared profile. All analyzed sequences were species specific, whether or not they represented a subspecies group, therefore the mean frequency of species-specific profiles (f sp ) was 1.  The four non-coding cpDNA regions from 35 accessions were concatenated. The maximum frequency of species-specific profiles of the concentration (f sp = 0.35) was reached with the use of hypervariable regions trnT(UGU)-trnL5'exon ( Figure 2A). The average number of pairwise differences for the concatenated regions was 7.73, a value close to the maximum that can be obtained with four hypervariable regions. A total of 455 pairwise comparisons yielded differences in the four hypervariable regions, while 384 cases (84%) had four differences. Only 116 cases were different by less than two hypervariable regions ( Figure 2B). Figure 2C shows the 'region by region' analysis for the concatenated for cpDNA. The regions trnL(UAA) intron vs. trnD[tRNA-Asp(GUC)]-trnT [tRNA-Thr(GGU)] had the highest average pairwise differences, with p = 0.73.

Analysis of Variable-Length mtDNA Sequences
A total of 4.171 bp from 210 sequences were analyzed. The species and characteristic of each target region is described in Table S1.
The mean number of species-specific profiles (Nsp) was 2.67 (from 1 to 7), while the mean number of species with shared profiles (N(species)sh) was 2.17 (varied from 1 to 4) ( Table 1). The mean frequency of species-specific profiles (f sp ) was 0.73, ranging from 0.45 to 0.89 (Table 1). i.e., almost all regions in this group had specific profile.
If we consider species analysis, the mean number of species-specific profiles (Nsp) was 17.54 (from 6 to 24), and the mean number of species with shared profiles (N(species)sh) was 0 (Tables S6-S11), i.e., no species in this group had shared profile. All analyzed sequences were species specific, whether or not they represented a subspecies group, therefore the mean frequency of species-specific profiles (f sp ) was 1.
The six non-coding mtDNA regions from 35 accessions were concatenated. The maximum frequency of species-specific profiles of the concentration for mtDNA (f sp = 0.23) was reached with the use of hypervariable regions rps12-1/nad3-2, rps12-2/nad3-1 and rrn5/rrn18-1 ( Figure 3A). The average number of pairwise differences for the concatenated regions was 6.61, a value close to the maximum that can be obtained with five hypervariable regions. A total of 595 pairwise comparisons yielded differences in the five hypervariable regions, while 455 cases (76%) had three differences. Only 122 cases were different by less than two hypervariable regions ( Figure 3B). Figure 3C shows the 'region by region' analysis for the concatenated for mtDNA. The regions nad4L-orf 25 vs. rps12-1/nad3-2 and nad4/1-2 vs. nad4L-orf25 had the highest average pairwise differences, with p equal, respectively, 0.72 and 0.71.

Discussion
The cpDNA and mtDNA regions are widely used as markers in phylogenetic and phylogeographic studies [46,47]. Additionally, for most species, including rye, the data is incomplete. So far, S. cereale chloroplast genome sequence has been published [48], while rye mtDNA has not been fully sequenced.
The cpDNA appears to have much more stable structure than the mitochondrial genome of plants in case of intramolecular rearrangement. The plastid genome substitution rate is 3-4 times higher than that of plant mtDNA [49]. In cpDNA noncoding regions most of the variability observed concerns insertion-deletion (indel) mutations. As stated by other authors, however, they should be treated with caution as they may indicate heteroplasmy [50]. Nevertheless, indels were analyzed in the following studies due to the fact that they were found to be common and often phylogenetically informative [51][52][53].
The cpDNA shows a much more stable structure in case of intramolecular rearrangement than the mitochondrial genome of plants. The rate of plastid genome substitution is 3-4 times higher than that of plant mtDNA [49]. Most of the variability observed in cpDNA noncoding regions concerns insertion-deletion (indel) mutations, but, as stated by other authors, they should be treated with caution as they may indicate heteroplasmy [50]. However, indels were analyzed in the following studies due to the fact that they were found to be common and often phylogenetically informative [51][52][53].
The multiple insertions and deletions (indels), which are present in the target genomic regions is generally regarded as a problem, as it introduces ambiguities in sequence alignments. However, in a few recent works a high level of species discrimination is attainable in all taxa of life by considering the length of hypervariable regions defined by indel variants has been shown [54]. Each species is tagged with a numeric profile of fragment lengths.
Indels are less prone to back and recurrent mutations, and thanks to that the probability of misclassifications is greatly reduced. Genomic regions with have multiple indels was used for species-identification procedures with high efficiency not only in plants chloroplast trnL(UAA) intron [55], but also in bacteria [56], fungi [57] and animals [58,59]. Unfortunately, there is no such information on the species of rye.
For indel detection several software tools are available [60,61]. In the present study a multifunctional computational workbench (named SPInDel for SPecies Identification by Insertions/Deletions) was used. The SPInDel approach relies on the analysis of multiple loci, which presents a clear advantage over methods targeting a single locus, therefore the occurrence of intraspecific or intraindividual variability in hypervariable regions does not pose serious problems [62][63][64]. A correct identification is still possible based on the information from the remaining loci even in cases where one (or more) SPInDel hypervariable region(s) have a different from the reference length.
In our previous studies the phylogenetic indels context were not analysed because algorithms describing substitution models are not able to model indel-type changes and they are removed by tree-creating programs. However, basic data on length and number of indels were obtained [38]. The number of identified indels in cpDNA was not very high compared to the results presented by other authors [65]. While the number of identified indels in mtDNA was comparable to the results of other authors in mtDNA of plants [66,67].
The occurrence of deletions and insertions as well as numerous nucleotide substitutions is a common phenomenon in the atpB-rbcL region. The majority of non-coding regions rich in these base pairs show a low number of functions [68]. SPinDel analysis of this regions showed the highest frequency of species-specific profiles (f sp = 0.91) ( Table 1), which means that almost all species have a unique profile.
The region trnT(UGU)-trnL(UAA)5 exon shows a high frequency of insertions or deletions, depending on the species, which makes it possible to use them as genetic markers [38].
The mean f sp of the species analysed for this region (f sp = 0.83) was very close to the mean f sp for the atpB-rbcL region ( Table 1).
One of analysed regions, the chloroplast trnL(UAA) intron, is known for its potential as species-specific marker due to low intra-and higher inter-specific genetic variation [32,55]. Our results show that trnL does not represent the most variable non-coding region of chloroplast DNA (Table 1).
We detected maximum diversity values in the intraspecific data sets for all target regions ( Figure 2C). This is the first analysis performed using the SPInDel program, indicating the usefulness of the analyzed regions in Secale species.
Sequences from these cpDNA regions are often used in combination with other sequences in order to obtain additional data and a better resolution for phylogenetic studies [37,[69][70][71][72]. Our results also showed that the regions trnL(UAA) intron vs. trnD[tRNA-Asp(GUC)]-trnT[tRNA-Thr(GGU)] had the highest average pairwise differences ( Figure 2B). Moreover, the trnD[tRNA-Asp(GUC)]-trnT[tRNA-Thr(GGU)] region in our previous studies showed the greatest variability of rye species compared to the entire pool of analyzed cpDNA regions, which predisposed it to study closely related species.
Indels polymorphisms have a sufficiently rapid evolutionary rate of accumulation despite the low intra-specific diversity in cpDNA genes, that allows for discrimination between closely related taxa [73]. The frequency of species-specific SPInDel cpDNA profiles reached the maximal possible value (f sp = 1) in the intraspecific level (Tables S2-S5) and 0.91 for the atpB-rbcL region ( Table 1). The real potential of the SPInDel concept was revealed by the concatenation of the cpDNA target regions ( Figure 2, Table 1). The results of this study confirmed this, because they focus on the high variability of the studied regions of the chloroplast genome in the majority of taxa ( Table 1). The trnT(UGU)-trnL(UAA)5 exon and trnD[tRNA-Asp(GUC)]-trnT[tRNA-Thr(GGU)] regions were used to the greatest extent for the analysis of closely related species species, which gave the best results in combination.These regions should be a useful tool as molecular markers for the study of closely related species, especially at the interspecies level of the genus Secale.
The mitochondrial genomes of plants are characterized by the presence of a relatively large number of group II introns compared to fungal and baterial mtDNA [74,75]. Group I introns, which are located in the coxI gene, are also contained in several plant genera, including Peperomia and Marchantia [76]. However, there is no correlation between phylogenesis and the presence of this intron, indicating that it was introduced by horizontal gene transfer. Fungal species were probably the donor.
In our previous study, a total of 45 indels in mtDNA have been identified [38]. It is comparable to the results of other authors in plant mtDNA. Rye mtDNA sequence data are not available. Only the winter wheat (Triticum aestivum cv. Chinese Yumai) mtDNA sequence was was published and was found to be very similar to the spring wheat sequence (T. aestivum cv. Chinese Spring) [77]. We determined 0 to 40 (in nad4/1-2) indels with sizes ranging from 1 bp to 5 bp in the rps12-nad3(2) and nad4/1-2 regions, respectively [38].
Mitochondrial nad1B-nad1C region, which is located in exon b and c [78], has highly conserved nature whithin this group of introns [38]. Nevertheless, the nad1 intron region may serve as a useful molecular marker in population studies. SPinDel analysis of this regions showed the highest frequency of species-specific profiles (f sp = 0.89) ( Table 1), meaning that almost all species have a unique profile.
The region located within subunit 4 of the nad4 gene is considered to be a slowly evolving mitochondrial marker. Its evolution occurs 23 times slower than that of the ITS rDNA sequence [79]. Thus, it could be a useful insightful tool in deliberating on phylogenetic relationships. The research has shown that among all tested mtDNA sequences nad4/1-2 region proved to be the most informative [38]. Unfortunately, our analysis showed that frequency of insertions or deletions is not very high (f sp = 0.72) ( Table 1).
Another analyzed intergenic region, nad4L-orf25 shows the lowest frequency of species-specific profiles (f sp = 0.45) among all analyzed regions, both mtDNA and cpDNA (Table 1). This confirms our earlier research [38] and the data on sugar beet [80].
Two different combinations of primers were used to amplify the intergenic sequences of the rps12-nad3 region: the first amplified only the intergenic region, the second-the intergenic region and the nad3 gene. These regions were described as variable regions, while the first region proved conserved in Secale species and subspecies [38]. Similarly, SPinDel analysis of this regions showed great differences between them: frequency of species-specific profiles was lower in the rps12-1/nad3-2 region than in the rps12-2/nad3-1 (f sp = 0.67and f sp = 81, respectively) ( Table 1).
The mean f sp of the species analysed for this the rrn5-rrn18 region (f sp = 0.86) was very close to the mean f sp for the nad1B-nad1C region (Table 1). It was confirmed that sequences of this regions proved to be the most informative among all tested mtDNA sequences.
The frequency of species-specific SPInDel cpDNA profiles reached the maximal possible value (f sp = 1) in the intraspecific level (Tables S6-S11) and 0.91 for the atpB-rbcL region ( Table 1).
The concatenated analysis showed slightly lower frequency of species-specific profiles of the concentration for mtDNA (f sp = 0.23), than for cpDNA (f sp = 0.35) (Figure 2A). The maximum frequency was reached with the use of 3 regions rps12-1/nad3-2, rps12-2/nad3-1 and rrn5/rrn18-1 ( Figure 3A). The average number of pairwise differences for the concatenated regions was also lower compared to cpDNA (6.61 and 7.73, respectively). The concatenated mtDNA analysis of other values were also lower than in cpDNA ( Figure 2B,C), despite the greater number of analyzed regions.

Materials and Methods
The plant material consisted of 35 accessions of the genus Secale, 13 cultivated and non-cultivated species and subspecies of rye. They were obtained from several world collections (Center for Biological Diversity Conservation in Powsin-Warsaw, Poland; United States Department of Agriculture-Agricultural Research Service in Beltsville, Maryland, USA; Nordic Genetic Resource Center in Alnarp, Sweden). The list of species, as well as the accession numbers for each sample, is given in Table S1.

DNA Extraction, PCR Amplification, and DNA Sequencing
For the amplification of the cpDNA 6 non-coding (intron) regions and mtDNA 6 non-coding (intron) regions, genomic DNA was isolated and amplified as described in previous study [38].
The sequences analysed in this paper have been deposited in the NCBI Genbank nucleotide sequence database with the accession numbers MH893827-MH894176 [38] (Table S1).

SPInDel Analyses
The analysis for cpDNA and mtDNA regions were performed independently. The sequence alignments of each family for the four different cpDNA regions (atpB-rbcL intergenic spacer, trnT(UGU)-trnL5 exon intergenic spacer, trnL(UAA) intron intergenic spacer and trnD[tRNA-Asp(GUC)]-trnT[tRNA-Thr(GGU)]) as well as six different mtDNA regions (nad1 exon B-nad1 exon C intron intergenic spacer, nad4/1-2 intergenic spacer, nad4L-orf25 intergenic spacer, rps12-1/nad3-2 intergenic spacer, rps12-2/nad3-1 intergenic spacer and rrn5/rrn18-1 intergenic spacer) were submitted to the SPInDel workbench in order to perform diverse calculations [81]. 35 species were selected for the assessment of intra-species diversity (Table S1). The alignments were analysed in the SPInDel workbench using the same conserved regions defined previously for the family of each species. The SPInDel concept is based on the combination of sequence lengths from different genomic regions. Therefore, in order to perform the diverse statistical analyses available on the SPInDel workbench, the alignments were concatenated of different cpDNA and mtDNA regions. The concatenated alignments were exported to the SPInDel workbench and analysed as previously described using the conserved regions defined for the individual regions. In these analyses, we have excluded the hypervariable regions defined by the peripheral conserved region of adjacent targets since they are not close to each other in the cpDNA and mtDNA. Therefore, the obtained profiles are only composed of the hypervariable regions inside each target region.

Conclusions
The results obtained in this study clearly indicated disproportions in the available information regarding various non-coding cpDNA and mtDNA regions used in phylogenetic studies, and some of them-due to high variability-can be successfully used in the analyses of closely related species.
The results indicate regions that may be useful molecular markers in studies on closely related species of the genus Secale. These include the non-coding regions of chloroplast DNA: atpB-rbcL and trnT (UGU)-trnL(UAA)5 exon and non-coding regions of mitochondrial DNA: nad1exonB-nad1exonC and rrn5/rrn18-1.
In general, our method was able to unambiguously discriminate closely related species in well-supported monophyletic clades.
In summary, the SPInDel approach can be used for the identification of Secale species, however the chloroplast genome is more informative. These results suggest that this method can be used for taxonomic classification of Secale species as long as appropriate conserved regions are selected.

Conflicts of Interest:
The authors declare no conflict of interest.