Development of High-Resolution Simple Sequence Repeat Markers through Expression Proﬁling of Genes Associated with Pod Maturity of Soybean

Featured Application: Use of high-resolution molecular markers for studying genetic diversity of soybean pods. Abstract: In soybeans ( Glycine max L.), the time required to attain maturity is a quantitative trait controlled by multiple genes and quantitative trait loci (QTL), which enable soybean cultivars to adapt to various regions with diverse day lengths. In this study, depending on the days to maturity, 100 soybean varieties were classiﬁed into eight maturity groups numbered from 0 to VII. The maturity groups were further sorted into three maturity ecotypes: early, middle, and late maturity. The analysis of 55,589 soybean genes revealed a total of 1147 related to the growth and development of soybean pods, including 211 genes with simple sequence repeats (SSRs). We further identiﬁed 42 SSR markers that ampliﬁed over two alleles in three di ﬀ erent ecotypes, including six genes that were up- or downregulated in pods of more than one ecotype. The agglomerative hierarchical tree constructed for the newly identiﬁed SSR markers had three clusters. Clusters B-I, B-II, and B-III were found to be strongly related with the early, middle, and late maturity ecotypes, respectively. Therefore, the newly identiﬁed set of SSR markers can serve as an e ﬀ ective high-resolution tool for the genotyping and QTL mapping of soybean pod maturity.


Introduction
The soybean (Glycine max L.) is a major legume crop that is widely grown for its edible bean. The reproductive period of the soybean has different stages (R1 to R8), from flowering to maturity, and is important in determining the yield, seed quality, and tolerance to various environmental stresses [1]. The time for attaining maturity is a quantitative trait in soybeans that is controlled by multiple genes and quantitative trait loci (QTL), which enable soybean cultivars to adapt to various regions with diverse day lengths [2,3].
Based on adaptability, soybean varieties are categorized into different maturity groupings (MGs), which facilitate the identification of the optimally adapted soybean cultivar for a particular region, as well as the most suitable time for sowing in that region. Scott and Aldrich [4] delineated the hypothetically optimum MG zones across the United States based on photoperiod sensitivity. The MGs range from MG 000, for the extremely early-maturing varieties, to MG 10, for the latest-maturing varieties [5]. Recently, Mourtzinis and Conley [6] re-delineated the optimum MG zones across the United States using the yield data from 2005 to 2015. Ha et al. [7] described a maturity grouping (MG 0

Experimental Field Tests
We conducted experimental field tests at Suwon (37 • 16 N, 127 • 01 E), Republic of Korea, in 2017 and 2018 to investigate the DTM of the 100 soybean varieties ( Table 1). The experimental plots were completely randomized, with three replicates. The soybean seeds were sown on Jun 12, 2017, and Jun 14, 2018, in 4 m (row length) and 70 cm (row width) plots with 15 cm spacing in the experimental fields, which were treated with a basic granular fertilizer N-P 2 O 5 -K 2 O (30-30-34 kg/ha) before sowing. This design allowed for 27 plants per row, with 4 rows per replicate of each variety. The DTM of the soybeans was defined as the number of days from sowing to full maturity (when 95% of the pods reached a mature pod color, stage R8).

Analysis and Selection of Genes
We downloaded the annotated information of all genes of the soybean cultivar 'Williams 82 from the SoyBase database [18]. The information included (1) the Arabidopsis thaliana (AT) protein (TAIR10) IDs matched via the BLASTP of Gmax2.0 primary proteins; (2) the Biological Processes, Molecular Functions, and Cellular Components assigned to each Gmax2.0 primary protein using the Gene Ontology (GO) terms associated with the Top TAIR10 BLASTP Hit; (3) the EuKaryotic Orthologous Groups (KOG) of the primary proteins described and assigned by The KOG Browser [19]; (4) the Protein Families (PF) of the studied proteins described and assigned by the Pfam [20]; (5) the PANTHER (PTHR) of the primary proteins described and assigned by The PANTHER (protein annotation through evolutionary relationship) classification system [21]; (6) the first versions of the Soybean Metabolic Pathway (SoyCyc) of the primary proteins described and assigned using the Soybean Metabolic Pathway Database [22]; (7) the seventh versions of the Soybean Metabolic Pathway (SoyCyc7) of the primary proteins described and assigned using PMN: Plant Metabolic Network [23]; and (8) the soycyc_enzymes of the Soybean Metabolic Pathway (SoyCyc7-rxn) for the primary proteins determined after downloading from the TAIR database [24].
We analyzed the SSRs in gene sequences using GRAMENE's SSR tools [25] and then selected genes that harbored more than five repeats of SSR motifs. The markers were manually designed to Appl. Sci. 2020, 10, 6363 3 of 17 amplify specific SSR regions, using the following parameters: a primer size ranging from 18 to 28, with 23 as the optimum; a product size ranging from 80 to 500 bp; an annealing temperature ranging from 48 to 60 • C, with an optimum of 56 • C; a GC content ranging from 45% to 55%, with 50% as the optimum. To select markers capable of detecting genetic diversity, we amplified alleles from three different maturity ecotypes (Figure 1) using the developed SSR markers; these included early (MG I, EM: OT89-05), middle (MG IV, MM: IT213198), and late (MG VI, LM: IT213177) maturity ecotypes. amplify specific SSR regions, using the following parameters: a primer size ranging from 18 to 28, with 23 as the optimum; a product size ranging from 80 to 500 bp; an annealing temperature ranging from 48 to 60 °C, with an optimum of 56 °C; a GC content ranging from 45% to 55%, with 50% as the optimum. To select markers capable of detecting genetic diversity, we amplified alleles from three different maturity ecotypes ( Figure 1) using the developed SSR markers; these included early (MG I, EM: OT89-05), middle (MG IV, MM: IT213198), and late (MG VI, LM: IT213177) maturity ecotypes.

Analysis of Genetic Polymorphisms
Genomic DNA was extracted from the young leaves of 100 varieties using the DNeasy Plant Mini Kit (Qiagen, Hilden, Germany) according to the manufacturer's protocol. The PCR amplification

Analysis of Genetic Polymorphisms
Genomic DNA was extracted from the young leaves of 100 varieties using the DNeasy Plant Mini Kit (Qiagen, Hilden, Germany) according to the manufacturer's protocol. The PCR amplification mixture contained 10 ng of template DNA, 0.2 µmol L −1 of each forward and reverse primer, and the 5x GoTaq DNA polymerase master mix (Promega, USA) in a final volume of 20 µL. The reactions were performed in the Applied Biosystems ProFlex PCR system (Thermo Fisher Scientific, Waltham, MA, USA) with the following conditions: 94 • C for 2 min, followed by 35 cycles at 94 • C for 1 min, the specific annealing temperature for each primer pair (Table S1) for 1 min, extension at 72 • C for 1 min, and a final elongation step at 72 • C for 5 min.
The alleles amplified by each SSR marker were confirmed on 1% (w/v) agarose gels and by capillary electrophoresis using the QIAxcel Advanced system (Qiagen). The measure of the allelic diversity at a given locus and the polymorphism information content (PIC) of each SSR marker was calculated using the formula reported by Botstein et al. as presented in Equation (1) [26]: where P j is the frequency of the j-th allele of the marker.

Profiling of Gene Expression
We conducted the expression profiling of genes harboring the selected SSRs using tissues (leaf, stem, and pod) of the three maturity ecotypes to further select genes closely associated with the growth and development of the soybean pod. For profiling gene expression according to reproductive stages, soybean plants of three distinct stages, based on maturity ecotypes, were collected from stages R6 (full seed, LM ecotype) to R8 (EM ecotype), as shown in Figure 1. The leaves, stems, and pods were then simultaneously collected. The tissue samples were immediately placed in liquid nitrogen and stored at −80 • C. All the samples were collected with three biological replicates. We also manually designed primers, to profile the expression of the genes, using the following parameters: a primer size ranging from 20 to 25, with 23 as the optimum; a product size ranging from 80 to 300 bp; an annealing temperature ranging from 52 to 58 • C, with an optimum of 56 • C; a GC content ranging from 45% to 55%, with 50% as the optimum. For use as a template in reverse transcription-quantitative polymerase chain reaction (RT-qPCR)-based expression profiling, cDNA was synthesized from the total RNA of each tissue of the three ecotypes using the High-Capacity cDNA Reverse Transcription Kit (Thermo Fisher Scientific). RT-qPCR was performed after 2 min at 95 • C for polymerase activation, and cycling between 95 • C (15 s) and the respective annealing temperature (15 s) for each primer pair (Table S2), followed by 40 cycles using the QuantStudio 5 Real-Time PCR System (Thermo Fisher Scientific) and the QuantiNova SYBR Green RT-PCR kit (Qiagen). The RT-qPCR products were subjected to melt-curve analysis (60-95 • C). Two reference genes were also used, TUB4 (Glyma.03G124400.1) and UKN2 (Glyma.06G038500.1), to improve the accuracy of normalization for expression profiling. For relative expression profiling, we modified the formula reported by Livak and Schmittgen [27]: for the normalization (∆Cq) of the quantification cycle (Cq) values for each gene using TUB4, ∆Cq = Cq (a target sample) − Cq (TUB4); the ∆∆Cq calculation was based on the ∆Cq of UKN2, so ∆∆Cq = ∆Cq (a target sample) − ∆Cq (UKN2); the relative expression ratio (R) = (2 −∆∆Cq of the target sample)/(mean of 2 −∆∆Cq of all samples).

Agglomerative Clustering
The presence and absence of alleles amplified by each SSR marker were expressed as a nucleotide code. For example, the presence and absence of one of the alleles amplified from each variety were denoted by the nucleotide codes A and T, and those from the other allele, by G and C. The sequences were then aligned using the multiple sequence alignment software ClustalX [28] and BioEdit 7.2 [29]. To assess the clustering resolution of the selected markers, we used an agglomerative clustering Neighbor-Joining Method [30]. Clustering was visualized using MEGA version 4.0 [31]. We selected a set of six SSR markers (Table 2), which were positively associated with the QTLs of pod maturity listed in the SoyBase database [18], to assess the resolution of the newly developed markers.

Statistical Analysis
Analysis of variance (ANOVA) was performed with R version 3.6.1, based on the package agricolae, using three biological replicates and three technical replicates. Multiple comparisons for the gene expression profiling and DTM determination of varieties were performed with Duncan's multiple range test and least significance difference (LSD), respectively. p < 0.05 was considered significant.

Development and Selection of SSR Markers
A total of 55,589 genes were downloaded from the SoyBase database; among them, 49,638 were annotated. We analyzed the gene annotations and identified 1147 genes related to the growth and development of soybean pods; these were classified into four groups: flower, hormone synthesis, seed, and senescence (Table 3). There were 168, 372, 490, and 117 genes in the flower, hormone synthesis, seed, and senescence categories, respectively. The genes related to senescence were excluded from further categorization, while the remaining 1030 genes were re-sorted into 12 sub-groups: enzyme, pollen, and time control in the flower group; abscisic acid (ABA), auxin/indole acetic acid (AUX/IAA), ethylene, gibberellic acid (GA), and jasmonic acid (JA) in the hormone synthesis group; and embryo, metabolite, maturation, and storage protein in the seed group. The largest sub-group was seed-embryo, with 279 genes; conversely, seed maturation included only five genes ( Table 3).
The analysis of the SSR regions in the 1147 genes revealed 211 genes with SSR regions. Hence, 211 SSR markers were then developed for amplifying these regions of the genes (Table S1). We then investigated the polymorphism of 211 SSR markers on the three maturity ecotypes and found 42 SSR markers that amplified over two alleles (Figure 2).
The analysis of the SSR regions in the 1147 genes revealed 211 genes with SSR regions. Hence, 211 SSR markers were then developed for amplifying these regions of the genes (Table S1). We then investigated the polymorphism of 211 SSR markers on the three maturity ecotypes and found 42 SSR markers that amplified over two alleles (Figure 2). Among the genes corresponding to the 42 SSR markers, 18 were either upregulated (relative expression ≥ 2.0) or downregulated (≤ 0.5) in more than one tissue of the three ecotypes (Table 3). Of these, six genes-Glyma.03g160900.1, Glyma.03g201600.1, Glyma.07g057900.3, Glyma.19g158400.1, Glyma.19g164800.1, and Glyma.19g198600.1-were regulated in the pods of more than one ecotype (Figure 3). For the advanced stage (R6 to R8), five of the genes were downregulated in pods, while Glyma.19g198600.1 was upregulated ( Figure 3).
Additionally, we investigated the transcriptional products of the six genes to analyze the effects of the SSRs on their translation. Most of the SSRs, except Glyma.19g164800.1, were predicted to lie within the 5′-or 3′-untranslated regions (UTRs) of the associated genes ( Figure 4). The SSR motif (GAA) of Glyma.19g164800.1, encoding glutamic acid, was repeated 12 times in the protein-coding sequence (CDS) of the gene and resided between two cupin_1 domains (Figure 4). Even when some, or all, of the SSR sequences were deleted in the CDS of this gene, the modified CDSs were aligned Among the genes corresponding to the 42 SSR markers, 18 were either upregulated (relative expression ≥ 2.0) or downregulated (≤ 0.5) in more than one tissue of the three ecotypes (Table 3). Of these, six genes-Glyma.03g160900.1, Glyma.03g201600.1, Glyma.07g057900.3, Glyma.19g158400.1, Glyma.19g164800.1, and Glyma.19g198600.1-were regulated in the pods of more than one ecotype (Figure 3). For the advanced stage (R6 to R8), five of the genes were downregulated in pods, while Glyma.19g198600.1 was upregulated (Figure 3).    Additionally, we investigated the transcriptional products of the six genes to analyze the effects of the SSRs on their translation. Most of the SSRs, except Glyma.19g164800.1, were predicted to lie within the 5 -or 3 -untranslated regions (UTRs) of the associated genes ( Figure 4). The SSR motif (GAA) of Glyma.19g164800.1, encoding glutamic acid, was repeated 12 times in the protein-coding sequence (CDS) of the gene and resided between two cupin_1 domains ( Figure 4). Even when some, or all, of the SSR sequences were deleted in the CDS of this gene, the modified CDSs were aligned with a soybean glycinin subunit G7 (GY7), using the same parameters. To further probe the positions of the SSRs with respect to the domain organization of the gene, we used ab initio prediction and found that the GY7-matched signal peptide of Glyma.19g164800.1 ranged from 29 to 91 bp (total, 63 bp), translating into 21 amino acids. Moreover, deleting the SSRs did not disrupt the sequences of the cupin_1 domains ( Figure 4). We then selected the SSR markers derived from the six selected genes for analyzing genetic polymorphisms ( Table 4) that were up-or downregulated in the pods of the three ecotype varieties.  Figure 3. Expression profiling of genes selected for analyzing genetic polymorphisms of the three soybean ecotypes: EM, early maturity ecotype OT89-05; MM, middle maturity ecotype IT213198; LM, late maturity ecotype IT213177. Different letters above the error bars indicate significant differences at p ≤ 0.05 according to Duncan`s multiple range test.

Analysis of Relationship Using the SSR Markers
The number of alleles amplified by the selected markers varied from one to five, with each showing a PIC value > 0.5 (Table 5). However, the SSR markers identified in this study exhibited PIC values lower than those of the previously established markers [26][27][28][29][30][31] (Table 5), although the sizes of the alleles amplified by the previous and new SSR markers were similar (Table 5). The tree generated for the previously established six SSR markers [26][27][28][29][30][31] had three large agglomerative clusters, with clusters A-I, A-II, and A-III comprising 45, 31, and 24 varieties, respectively. Although the EM ecotype in cluster A-I and the MM ecotype in clusters A-II and A-III were the largest ecotypes, their distribution was under 50% in each cluster (Figure 5a). Another tree that was constructed based on the six newly developed SSR markers also had three clusters: B-I, B-II, and B-III, which included 31, 34, and 35 varieties, respectively. In cluster B-I, 18 (58.1%) varieties belonged to the EM ecotype, while 24 (70.6%) varieties in cluster B-II and 25 (71.4%) varieties in cluster B-III belonged to the MM and LM ecotypes, respectively (Figure 5b). newly developed SSR markers. In each cluster, the green circle represents the EM ecotype, the yellow circle represents the MM ecotype, and the red circle represents the LM ecotype. Figure 5. Agglomerative clustering trees constructed using the (a) previously established and (b) newly developed SSR markers. In each cluster, the green circle represents the EM ecotype, the yellow circle represents the MM ecotype, and the red circle represents the LM ecotype.

Large-Scale Genic SSR Marker Development in the Soybean
Geneticists have identified major genes and QTLs controlling pod maturity [32][33][34][35][36][37][38] to examine the genetic characteristics of soybeans. The SSR marker, a type of effective genetic marker with various applications, is one of the most widely used molecular markers for plant genotyping due to its high polymorphism [39]. Although many soybean SSR markers have been used for genotyping [40][41][42] and QTL mapping [32][33][34][35][36][37], some of the existing methods for SSR genotyping exhibit low sensitivity, accuracy, and efficiency, yet have high costs [15]. To overcome these disadvantages of SSR markers, over 1000 genic SSR markers have been developed for soybean genetic research [43,44]. Genic SSR markers are used as an effective tool to identify genetic resources, since SSRs are located in transcribed genes that are annotated with putative functions based on biological information and homology search [45], and as a selection tool for specific traits in breeding populations of diverse crops [46][47][48][49]. As shown in Supplementary Table S1, we developed 211 genic SSR markers. Among the 211 SSR repeat motifs, tri-nucleotide repeats (TNRs) were found to be the most abundant, accounting for 75.4% (159), followed by di-nucleotide repeats (DNRs) at 16.6%, and tetra-nucleotide repeats (TtNRs) at 8.0%. Meanwhile, repeats longer than five nucleotides were not found. These results were similar to previously published data, with TNRs reported as the most frequent (54-78%), followed by DNRs (17.1-40.4%) and TtNRs (3-6%) among cereal species [50].

Expression Profiling Shows That Six Genes Were Linked to the Growth and Development of the Soybean Pod
To develop genic SSR markers associated with the pod maturity of soybeans, we preferentially selected six genes found to be regulated in the soybean pod. LATE FLOWERING (LATE) encodes a C 2 H 2 -type zinc-finger transcriptional regulator that acts as a floral repressor and regulates the expression of genes that control flowering in the leaf vasculature of Arabidopsis [51]. Glyma.03g160900.1, annotated as a LATE, showed the highest expression levels in the leaves of LM ecotypes; however, it was decreased in the pods of all three ecotypes.
Late embryogenesis abundant protein (LEA, PFAM:PF03168) genes are upregulated in maturing seeds and vegetative tissues by salinity, dehydration, cold or freezing stress, and ABA treatment [52][53][54]. Glyma.03g201600.1 and Glyma.19g198600.1 were annotated as LEAs. The expression of Glyma.03g201600.1 was increased in the pods of the MM and LM ecotypes, whereas that of Glyma.19g198600.1, encoding syntaxin-24 (SYP24), decreased only in the pods of the LM ecotype. These results demonstrate that although different isoforms belong to the same gene family in soybeans, the genes are differentially regulated with stage advancement [55] and by tissue [56]. The cupin_1 domain represents the conserved barrel domain of the cupin superfamily, which contains 11S and 7S plant seed storage proteins. Plant seed storage proteins are the major nitrogen sources for the developing plant [57]. A glycinin subunit Gy7, which has two cupin_1 domains, has been reported as upregulated in soybean pods during seed maturation [58]. Our results, however, showed that Glyma.19G164800.1, the corresponding gene of Gy7, was upregulated in the pod of only the LM variety. Non-specific lipid transfer protein (nsLTP) genes, including Glyma.19g158400.1, have been isolated and characterized extensively in soybeans [59]. These genes reportedly play important roles in plant growth and development, including seed development [60]. Glyma.19g158400.1 was upregulated in the pods of all three ecotypes, although the highest expression was observed in the pods of the LM ecotype, indicating that the gene is downregulated with stage advancement.
Moreover, JA is a key player in the control of senescence [61]. In Arabidopsis, JA is associated with the timing of senescence processes in both somatic tissues and reproductive organs [62,63]. We found that Glyma.07g057900.3, which encodes a jasmonic acid-amido synthetase (JAR1), was downregulated in the pods of the EM and MM ecotypes. Taken together, these results imply that all six genes are related to the growth and development of soybean pods. From these results, we predicted that the newly developed SSR markers derived from the six genes could serve for the high-resolution detection and characterization of genetic diversity based on the DTM of soybean varieties.

Allelic Variations for the New SSR Markers Are Sufficient for Generating a Cluster and Distinguishing the Soybean Varieties by Ecotype Group
Several studies have reported that the 5 -UTR of a gene influences the efficiency of translation by regulating its secondary structure [64][65][66], suggesting that it plays an important role in expression regulation. The 5 -and 3 -UTRs can also regulate mRNA stability [67,68]. Out of the six SSR markers described in this study, five were predicted to lie within the 5 -or 3 -UTR. The sixth SSR, Glyma.19g164800.1, was predicted to be in the CDS regions. Moreover, our preliminary SSR deletion results indicate that the molecular markers do not affect the signal peptide or the main domains on the CDS of this gene. However, further studies are required to establish whether the SSRs affect the mRNA stability and regulation of gene expression.
Genic SSR markers have been reported to not only be less polymorphic than non-genic markers [69] but to also have 3.5 alleles per locus, with a higher PIC of 0.824 among the mulberry species [70]. Similarly, previous studies reported that the genic SSR markers for jute (2.7 alleles and a PIC of 0.34) [71] and flax (2.3 alleles and a PIC of 0.35) [72] show high polymorphism and are expected to be of use for the characterization of germplasm, as well as for variety identification and marker-assisted breeding. Therefore, although the average number of alleles and PIC values of the six newly developed genic SSR markers were low (3.1 alleles and a PIC of 0.61) compared to those of the previously established six SSR markers (3.8 alleles and a PIC of 0.70), the allelic variations exhibited by the new SSR markers were sufficient for generating clustering and distinguishing the soybean varieties by ecotype group. Finally, these alleles can serve as a marker for the characterization of populations, varieties, and germplasm, as previously reported in wheat [73], jatropha [74], and oil palm [75].

Agglomerative Clustering Shows That the Six New Genic SSR Markers Can Serve for the Genotyping and Fine QTL Mapping of Soybean Pod Maturity
We used agglomerative hierarchical clustering, a common bottom-up clustering method that uses the neighbor-joining method, for creating the phylogenetic trees [30]. This method provides a snapshot of the data that can facilitate more detailed analysis, while rapidly producing well-scaled informative networks for several hundred taxa [76]. The agglomerative clusters constructed for the six new genic SSR markers showed a strong relationship with each maturity ecotype. The EM, MM, and LM ecotypes were found to be dominant in the B-I, B-II, and B-III clusters, respectively. However, our results showed that the new set of six SSR markers based on the maturity of the soybean pod had high sensitivity for detecting genetic diversity. Thus, the six new SSR markers can serve as effective and high-resolution tools for the genotyping and fine QTL mapping of soybean pod maturity.

Conclusions
There is increasing demand for the development of new and sensitive molecular markers for the genetic characterization of crops. High-resolution markers are basic and powerful tools for the identification of target genes or loci that are highly associated with crucial agronomic traits. These markers can also serve as a guide for the genetic improvement of crop varieties in breeding programs. In this study, we demonstrate an efficient approach for designing SSR markers that detect genes associated with specific traits with improved accuracy compared to previously reported markers. The method reported here could prove valuable in designing high-resolution markers for the genetic improvement of crops.
Supplementary Materials: The following are available online at http://www.mdpi.com/2076-3417/10/18/6363/s1. Table S1: Development of SSR markers using the genes related to the growth and development of the soybean pod. Table S2: Primers used in qRT-PCR for profiling the expression of genes related to the growth and development of the soybean pod. Funding: This study was funded by the Rural Development Administration (RDA, Republic of Korea), grant number PJ012548032020.

Conflicts of Interest:
The authors declare no conflict of interest.