agronomy Population Structure and Genetic Diversity Analysis in Sugarcane ( Saccharum spp. hybrids) and Six Related Saccharum Species

: Sugarcane ( Saccharum spp. hybrids) is one of the most important commercial crops for sugar, ethanol, and other byproducts production; therefore, it is of great signiﬁcance to carry out genetic research. Assessing the genetic population structure and diversity plays a vital role in managing genetic resources and gene mapping. In this study, we assessed the genetic diversity and population structure among 196 Saccharum accessions, including 34 S. ofﬁcinarum , 69 S. spontaneum , 17 S. robustum , 25 S. barberi , 13 S. sinense , 2 S. edule , and 36 Saccharum spp. hybrids. A total of 624 polymorphic SSR alleles were ampliﬁed by PCR with 22 pairs of ﬂuorescence-labeled highly polymorphic SSR primers and identiﬁed on a capillary electrophoresis (CE) detection system including 109 new alleles. Three approaches (model-based clustering, principal component analysis, and phylogenetic analysis) were conducted for population structure and genetic diversity analyses. The results showed that the 196 accessions could be grouped into either three (Q) or eight (q) sub-populations. Phylogenetic analysis indicated that most accessions from each species merged. The species S. barberi and S. sinense formed one group. The species S. robustum , S. barberi , S. spontaneum, S. edule , and sugarcane hybrids merged into the second group. The S. ofﬁcinarum accessions formed the third group located between the other two groups. Two-way chi-square tests derived a total of 24 species-speciﬁc or species-associated SSR alleles, including four alleles each for S. ofﬁcinarum , S. spontaneum , S. barberi , and S. sinense , ﬁve alleles for S. robustum . and three alleles for Saccharum spp. hybrids. These species-speciﬁc or species-associated SSR alleles will have a wide application value in sugarcane breeding and species identiﬁcation. The overall results provide useful information for future genetic study of the Saccharum genus and efﬁcient utilization of sugarcane germplasm resources in sugarcane breeding.


Introduction
Sugarcane is one of the largest commercial sugar crops [1]. It is not only a raw sugar production resource (80% sucrose), but also an important green-fuel resource (40% ethanol) in world agriculture, providing a vital role of economic growth and food security for the tropical and subtropical regions of the world [2]. Taxonomically, sugarcane is placed under the grass family Poaceae, subfamily Panicoideae, tribe Andropogoneae, subtribe Sacharinae, and genus Saccharum and shares genetic intimacy with Sorghum and other grasses. Sugarcane has gone through an extensive and complex process of domestication and hybridization [3]. The Saccharum genus contains six main species: the two wild species are S. spontaneum (2n = 40-128, x = 8) and S. robustum (2n = 60-80), and the four cultivated species are S.  Table S1). The leaf samples were placed on ice during collection and were kept in a −20 • C freezer until DNA extraction [31].

DNA Extraction
Genomic DNA was extracted from leaf tissues using a modified cetyltrimethylammonium bromide (CTAB) method as previously described by Pan (2006) [31]. The quality and concentration of DNA were measured using UV absorbance assay with either NanoDrop 1000 (Thermo Fisher Scientific, Waltham, MA, USA) or a Synergy™ H1 Multi-Mode Reader (BioTek, Winooski, VT, USA). The quality of DNA was further checked by 0.8% agarose gel electrophoresis [32].

SSR Markers and PCR Reactions
Twenty-two pairs of polymorphic SSR primers [31,33] were used in this study. All forward primers were labeled with fluorescence dyes, 6-carboxy-fluorescein (6-FAM), NED, or VIC. PCR reactions were conducted according to Chen [33]. The DNA fragments were identified by the capillary electrophoresis (CE) ABI 3730XL DNA Analyzer (Applied Biosystems Inc., Foster City, CA, USA) to generate GeneScan files following the manufacturer's instructions [34,35].

Marker Scoring
The GeneScan files were processed with the GeneMarker™ software (v. 2.80) (Soft-Genetics LLC, State College, PA, USA, www.softgenetics.com, accessed on 27 December 2021) to reveal capillary electropherograms of SSR-DNA fragments with sizes calibrated against the GS500 DNA size standards (Applied Biosystems, Inc., Foster City, CA, USA). SSR alleles were manually assigned to unique, true "Plus-adenine" DNA fingerprints that gave quantifiable fluorescence values. Irregular peaks and stutters peaks were not scored according to Pan (2006) [31]. Data were scored manually in a binary format into a data matrix file, with the presence of a band scored as "1" and its absence scored as "0" and a binary format data were recorded manually as the presence of a band scored as "1" or "A" and its absence scored as "0" or "C" [36]. [37] was used to infer the population structure. To identify the number of populations (K), the capturing of the major structure in the data, we set up at a burn-in period of 10,000 Markov Chain Monte Carlo iterations and 100,000 run length, with an admixture model following Hardy-Weinberg equilibrium and correlated allele frequencies as well as independent loci for each run. Ten independent runs were performed for each simulated value of K, ranging from 1 to 11. For each simulated K, the statistical value ∆K was calculated using the formula described by Evanno et al. (2005) [38]. The optimal K was determined using Structure Harvester [39].

Structure and Genetic Diversity Analysis
Phylogenetic relationships and principal component analysis (PCA) were generated by TASSEL 5.2.13 to analyze genetic relationships among accessions and to determine the optimal number of clusters in the study. The number of principal components (PC) was chosen according to the optimum subpopulation determined in STRUCTURE 2, and a PCA plot was drawn using R package ggplot2 by the data from TASSEL 5 [40]. Genetic diversity also was assessed, and phylogenetic trees were drawn using MEGA 7 [41] based on the Maximum Likelihood tree method with the parameters as Shi et al. (2017) [42]. During the drawing of the phylogeny trees, the population structure and the cluster information were imported for the combined analysis of genetic diversity. For the sub-tree of each Q/q population, the shape of 'Node/Subtree Marker' and the 'Branch Line' were drawn with the same color as in the figure of the bar plot of the population clusters from the STRUCTURE analysis. The phylogenetic tree is based on the genetic distance among the six species, S. officinarum, S. spontaneum, S. robustum, S. barberi, S. sinense, and S. edule, the cultivated sugarcane (Saccharum spp. hybrids) were calculated by using the neighbor-joining method and were visualized by software MEGA 7. Compared with other methods, the neighborjoining method has a higher complexity and reliability in calculating genetic distance values. In addition, the number of PC was chosen according to the optimum subpopulation determined in STRUCTURE combined structured populations (Q-cluster), and a PCA plot was drawn using R package ggplot2 [43].

Species-Specific and Species-Associated SSR Allele
Four methods, single-marker regression (SMR), t-test, and the chi-square test or the percentage good-to-fit testing were used to identify species-specific and species-associated SSR alleles for S. officinarum (so), S. spontaneum (ssp), S. robustum (sr), S. barberi (sb), S. sinense (ssi), and Saccharum spp. hybrids (Hyb). S. edule (se) was not included because it had only two accessions included in this study. When we identified a species-specific or species-associated SSR allele, we gave a score of '9 for the accessions of the species and a score of '1 for all accessions in other species. In that way, we had six data sets for sugarcane hybrids and five Saccharum species. The marker data were recorded "A" as present and "G" as absent corresponding to "1" and "0". SMR was conducted using TASSEL 5. The t-test was performed for every single marker using visual basic codes in Microsoft Excel 2016.
Meanwhile, when the chi-square test cannot detect any species-specific allele at a significant level, we identified some associated alleles for the target species. If an allele had a >70% value in both %n9(A) and %n1(G) values, it was recognized as a species-associated allele. At the same time, the SMR and t-test can identify some species-associated alleles.

Phylogenetic Analysis
Phylogenetic cluster analysis from MEGA 7 divided the 196 accessions into four groups, namely, Cluster I, II, III, and IV ( Figure 2). Although there was no clear demarcation in the clustering pattern, this result was consistent with the model-based population structure (Q populations). We lined out all clusters by several curves which were marked by colors using as Q (1.2.3) at K = 3 in Figure 1. Cluster I (red curve) was further subdivided into three sub-clusters. Sub-cluster i included all S. spontaneum accessions and sub-clusters ii and iii contained 19 (76.0%) of S. barberi accessions. Except for sub-cluster-ii (green dotted curve), which was marked as Q2 (green), everything else in Cluster I matched with Q1 (red). Cluster II included all S. officinarum accessions, which were mentioned as a high Q1-likelihood part of the Q3 (blue) population. Almost all S. sinense accessions belonged to Cluster III (green curve), which formed the Q2 (green) population with those accessions from sub-cluster ii (green dotted curve). Cluster IV (blue curve) was comprised of S. robustum and S. officinarum accessions, which were further divided into two sub-clusters (iv and v). These two sub-clusters were 97.4% matched with the low Q1-likelihood part of the Q3 (blue) population.
population structure (Q populations). We lined out all clusters by several curves which were marked by colors using as Q (1.2.3) at K = 3 in Figure 1. Cluster I (red curve) was further subdivided into three sub-clusters. Sub-cluster i included all S. spontaneum accessions and sub-clusters ii and iii contained 19 (76.0%) of S. barberi accessions. Except for sub-cluster-ii (green dotted curve), which was marked as Q2 (green), everything else in Cluster I matched with Q1 (red). Cluster II included all S. officinarum accessions, which were mentioned as a high Q1-likelihood part of the Q3 (blue) population. Almost all S. sinense accessions belonged to Cluster III (green curve), which formed the Q2 (green) population with those accessions from sub-cluster ii (green dotted curve). Cluster IV (blue curve) was comprised of S. robustum and S. officinarum accessions, which were further divided into two sub-clusters (iv and v). These two sub-clusters were 97.4% matched with the low Q1-likelihood part of the Q3 (blue) population.  The result from phylogenetic analysis (Figure 3) was highly consistent with the phylogenetic analysis of accessions in Figure 1. The S. barberi and S. sinense accessions were placed in the same branch apart from other Saccharum species. The two wild Saccharum species, S. spontaneum and S. robustum were placed in the same sub-branch and away from the cultivated species S. officinarum. The sugarcane accessions were gathered in a branch in-between S. officinarum and the two wild species, due to their hybrid origin from these three species.

Principal Component Analysis
The distribution of all 196 accessions over the two PCA dimensions was also clustered based on K = 3 and K = 8 ( Figure 4). Not only did these 196 accessions scatter in accordance with their taxonomic classification, but the PCA plot was consistent with the results of structure and phylogenetic analysis as well. five sub-clades are marked as i, ii, iii, iv, and v by dotted curves, with corresponding colors (red, green, and blue) of K3 gene pools (Q1, Q2, and Q3).
The result from phylogenetic analysis (Figure 3) was highly consistent with the phylogenetic analysis of accessions in Figure 1. The S. barberi and S. sinense accessions were placed in the same branch apart from other Saccharum species. The two wild Saccharum species, S. spontaneum and S. robustum were placed in the same sub-branch and away from the cultivated species S. officinarum. The sugarcane accessions were gathered in a branch in-between S. officinarum and the two wild species, due to their hybrid origin from these three species.

Principal Component Analysis
The distribution of all 196 accessions over the two PCA dimensions was also clustered based on K = 3 and K = 8 (Figure 4). Not only did these 196 accessions scatter in accordance with their taxonomic classification, but the PCA plot was consistent with the results of structure and phylogenetic analysis as well.

Species-Specific and Species-Associated SSR Alleles
The 4-way chi-square test by formula (1) showed that no allele had a p-value of > 0.01, suggesting we did not identify any strong species-specific SSR alleles (Supplementary  Table S2). However, when we used the 2-way chi-square test by formula (2), eight alleles had >0.01 p-values, among which, 569CS_222 was specific to Hyb; 334BS_144 was specific to sb; 278CS_140, 31CUQ_136, 569CS_204, and 597CS_138 were specific to ssi; 703BS_223 and 703BS_225 were specific to sr (Supplementary Table S2). In addition, 16 alleles were identified to be species-associated based on a high percentage of %n9(A)/n9, or %n1(G)/n1, or both, and a high percentage of %(n9(A)/n9 + %n1(G)/n1)/2, or 100% in one of %n9(A)/n9 and %n1(G)/n1 (Supplementary Table S2). The 24 alleles have a high LOD (−log (p-value)) in SMR and t-test (Supplementary Table S2), indicating that the 24 alleles were species-specific or species-associated alleles.

Species-Specific and Species-Associated SSR Alleles
The 4-way chi-square test by Formula (1) showed that no allele had a p-value of > 0.01, suggesting we did not identify any strong species-specific SSR alleles (Supplementary Table S2). However, when we used the 2-way chi-square test by Formula (2), eight alleles had >0.01 p-values, among which, 569CS_222 was specific to Hyb; 334BS_144 was specific to sb; 278CS_140, 31CUQ_136, 569CS_204, and 597CS_138 were specific to ssi; 703BS_223 and 703BS_225 were specific to sr (Supplementary Table S2). In addition, 16 alleles were identified to be species-associated based on a high percentage of %n9(A)/n9, or %n1(G)/n1, or both, and a high percentage of %(n9(A)/n9 + %n1(G)/n1)/2, or 100% in one of %n9(A)/n9 and %n1(G)/n1 (Supplementary Table S2). The 24 alleles have a high LOD (−log (p-value)) in SMR and t-test (Supplementary Table S2), indicating that the 24 alleles were speciesspecific or species-associated alleles.

Phylogenetic Analysis of the Sugarcane Accessions Using Species-Specific or Species-Associated SSR Alleles
Phylogenetic analyses in 196 sugarcane accessions were performed by seven sets of marker alleles: (1) all 24 selected alleles, (2) three Hyb-associated alleles, (3) four sbassociated alleles, (4) four so-associated alleles, (5) five sr-associated alleles, (6) four ssiassociated alleles, and (7) four ssp-associated alleles (Supplementary Table S2). Seven phylogenetic trees were drawn from each of the seven allele sets. The results showed that these 24 alleles could create a phylogenetic tree that grouped the sugarcane accessions by species (Figure 5B), which was similar to the phylogenetic tree with all 624 alleles ( Figure 5A). According to Supplementary Figure S2, these species-specific or speciesassociated alleles can group accessions of a target species away from other species, although these makers cannot completely distinguish all accessions of certain species from those accessions in other species.

Profile of the SSR Alleles
The 196 accessions included in the present study represent six Saccharum species and Saccharum spp. hybrids with different characteristic agro-economic traits [30,45]. These accessions were selected from the sugarcane germplasm collections in Mainland China and USA. Some are the progenitors of sugarcane cultivars that have been exploited

Profile of the SSR Alleles
The 196 accessions included in the present study represent six Saccharum species and Saccharum spp. hybrids with different characteristic agro-economic traits [30,45]. These accessions were selected from the sugarcane germplasm collections in Mainland China and USA. Some are the progenitors of sugarcane cultivars that have been exploited extensively in sugarcane breeding programs [46]. Compared with the previous report [3], the outcome of this study was greatly improved in analytical methods with 87 new accessions and 457 new alleles. Especially in the structure analysis, this study showed a higher level of consistency among the various analyses compared with all previous studies, which greatly improved the value of the developed markers and germplasm [47].
SSRs have increasingly become the marker of choice for such genetic analyses as linkage [18] and gene mapping [48], segregation [49], population structure [46,50,51], markerassisted selection [51], genetic relationships [52], marker-assisted backcrosses [44], and population genetics and phylogeny [33]. Being highly polymorphic and more economic, SSRs are still preferred markers for species with complex genomes such as sugarcane [50]. 624 polymorphic SSR alleles were amplified with 22 pairs of SSR primers, averaging 28.3 alleles per primer pair. This level of polymorphism was higher than most genetic diversity studies reported earlier, for example, 205 alleles with 13.67 per locus [14], 209 alleles with 5.8 per locus [30], and 261 alleles with 7.35 per locus [53]. Moreover, according to the distribution difference of the SSRs between species, the markers "SMC1751CL", "CIR43", "SMC486CG", "SMC278CS", "SMC24DUQ", "SMC22DUQ" and "SMC18SA" were high identification and diversity SSRs (Supplementary Figure S1). Such efficient genetic markers can not only offer an accurate description of individual accession within a population or species, but more importantly, have great potential for further population structure analysis, and provide high-quality genetic marker information for breeding, GWAS, and evolution related studies in the future [21].

Population Structure and Phylogenetic Analysis
The population structure analyses divided the 196 accessions into 3 and 8 gene pools based on the peak of ∆K at K = 3 and 8. The ∆K showed a higher score at K = 2, 3, 5, or 7 in earlier studies using different sugarcane panels [21,46,50,54]. In Figure 2, most accessions in Q1 (K = 3) belong to two species, S. spontaneum or S. barberi, except for "MUNTOKJAVA" (S. officinarum) and "H78-4153" (Saccharum spp. hybrids). A similar phenomenon was observed not only in other "Q populations", but also in the "q populations" when K = 8. For example, 35 of 36 Saccharum spp. hybrids belonged to q1; 94.5% accessions of q2 and q5 were S. barberi; all S. officinarum accessions belonged to q4; all accessions in q7 and q8 are S. spontaneum. Although the 22 pairs of SSR primers are highly efficient, it is still necessary to argue that population structure analysis alone is based on probability and cannot provide the immaculate divisions among all clusters [17]. Therefore, neither K = 3 nor K = 8 can perfectly cluster or segregate all 196 accessions, and the forgoing conclusions must be combined with application scenarios in the future.
Based on the population structure analysis by STRUCTURE (Figure 1), the three gene pool systems were generally fit and delivered good results based on the species using genetic distance analysis by MEGA to create phylogenetic trees (Figures 2 and 3). However, these analyses could not distinguish each species. There were five sub-clades/clusters (Lowercase roman) in the phylogenetic tree under the four main clades/clusters (Uppercase roman). All these clades could cluster the accessions by species in this study almost perfectly and match the gene pool dividing by K = 8. It is worth noting that the four main clades were created based on the same genetic distance, while the five sub-clades were divided according to the standard of each main clade, not the unified standard of the whole phylogenetic tree. In other words, although the phylogenetic analysis showed a very clear pattern of species segregations, the immediate species-based division of phylogenetic trees still cannot be achieved by a unified genetic distance but must be accomplished by re-division under the main clades. This fact also proved that the two clustering modes (K = 3 or K = 8) in structure analysis were not contradictory but complementary to each other [17]. The phylogenetic analysis for species relationship (Figure 4) is consistent with those of all accessions (Figure 3), and the differences may be caused by the number, diversity, and variation of the accessions within each species [33].
As shown in the PCA diagram, all 196 accessions displayed an obvious clustering tendency according to their species or Q (q) populations (K = 3 or K = 8), which was highly consistent with the phylogenetic trees. However, there also were a lot of mixture areas over all the clusters, which may help describe the relationship between species or Q (q) populations. PCA diagram is the most visualized clustering analysis, but due to the bias of the dimension reduction, the discrimination between each cluster was weakened [55], so was the precision of the generated 2D diagram [56]. The most typical case is that there would be many scatter overlaps or approximate overlaps in the graph, which result in the decay of the distance between each cluster area. For this reason, it is hard to see independent distribution patterns with clear boundaries. The same phenomenon also appears in this study, therefore when using these data we should also combine the results of other cluster analyses to draw a comprehensive conclusion [57].

Species-Specific and Species-Associated SSR Alleles
Due to the abundant level of polymorphism, SSR markers have received great attention in searching for species-specific alleles. In this study, 24 out of the 624 SSR alleles were found to be species-specific or species-associated and useful in identifying or differentiating different sugarcane species. Only three to five specific alleles were needed to identify most accessions of a particular species. This result will undoubtedly improve the efficiency of using SSR markers in identifying species in the future. However, it is still not possible to precisely describe the relationship between the accessions within or outside each species due to the very small genetic distances among them. Therefore, these markers must be used flexibly considering the application scenario.

Summary
Genome-wide association studies have been widely conducted to identify molecular markers associated with many valuable traits in many crops [58]. Due to the complex relationship and genetic background between genotypes in breeding populations, it is necessary to separate true marker-trait associations from the marker-trait association due to population structure [59]. The methods of population structure analysis include modelbased clustering, principal component analysis, kinship analysis, and phylogenetic analysis. If one applies just one method to interpret population structure, the conclusion can be ineffective or biased due to the failure of capturing the entire complexity of the population, especially for a genetically complex polyploidy crop like sugarcane [60]. Therefore, model-based clustering, principal component analysis, and phylogenetic analysis were conducted in this study to comprehensively analyze and discuss the population structure of 196 sugarcane accessions based on 624 SSR alleles amplified by 22 pairs of SSR primers. The overall results will provide a valuable reference for sugarcane germplasm management and genetic improvement through breeding.

Conclusions
Three (Q) and eight (q) well-differentiated populations were postulated from this study among the 196 accessions of sugarcane and six related Saccharum species based on the binary data of SSR alleles. A total of 624 alleles were amplified by PCR with 22 pairs of highly polymorphic SSR primers, of which 109 alleles were new and 24 alleles were species-specific or species-associated alleles. Almost all accessions could be appropriately grouped to a specific cluster according to their species. The subsequent PCA and phylogenetic analyses also drew similar conclusions, which further verified the reliability of the results. Due to the importance of population structure analysis in diversity analysis, the information generated in this research will help improve future sugarcane breeding programs towards good sugar recycling, paper and plywood, winemaking, and other related industrial areas. Moreover, the molecular data from this study will provide an important means for designing introgressive hybridization, making impactful cross combinations, and developing other related breeding strategies.