Genetic Diversity of Korean Wild Soybean Core Collections and Genome-Wide Association Study for Days to Flowering

The utilization of wild soybean germplasms in breeding programs increases genetic diversity, and they contain the rare alleles of traits of interest. Understanding the genetic diversity of wild germplasms is essential for determining effective strategies that can improve the economic traits of soybeans. Undesirable traits make it challenging to cultivate wild soybeans. This study aimed to construct a core subset of 1467 wild soybean accessions of the total population and analyze their genetic diversity to understand their genetic variations. Genome-wild association studies were conducted to detect the genetic loci underlying the time to flowering for a core subset collection, and they revealed the allelic variation in E genes for predicting maturity using the available resequencing data of wild soybean. Based on principal component and cluster analyses, 408 wild soybean accessions in the core collection covered the total population and were explained by 3 clusters representing the collection regions, namely, Korea, China, and Japan. Most of the wild soybean collections in this study had the E1e2E3 genotype according to association mapping and a resequencing analysis. Korean wild soybean core collections can provide helpful genetic resources to identify new flowering and maturity genes near the E gene loci and genetic materials for developing new cultivars, facilitating the introgression of genes of interest from wild soybean.


Introduction
Soybean (Glycine max [L.] Merr.) is one of the most economically important crops worldwide, whose seeds consist of 40% protein, 20% oil, and 15% soluble carbohydrates. Soybean in Western countries is mainly used to produce vegetable oils for humans and high-protein meals for livestock, whereas it has traditionally been used as foods, such as soymilk, tofu, soy sprout, fermented soy foods, and soy sauce, in Asian countries [1,2]. Cultivated soybean was domesticated~5000 years ago from its wild progenitor (Glycine soja Sieb. and Zucc.) [3,4]. The genetic diversity of cultivated soybeans revealed a narrow genetic base due to the human selection of elite lines. Using wild soybean germplasms in breeding programs increases genetic diversity because of the preservation of rare alleles of traits of interest. Understanding the genetic diversity of wild germplasms is essential for determining effective strategies to improve the economic traits of soybeans. Although wild soybeans have favorable genes, growing large populations of this plant is challenging Plants 2023, 12, 1305 2 of 13 due to undesirable traits, such as lodging, viny growth habits, hard seed coats, and pod shattering [5,6].
Constructing a core subset is one of the strategies to overcome the handling of a large population of wild soybean accessions. A core subset can represent an entire population by maximizing its genetic and phenotypic variations [7]. The advantage of a core subset collection is an increased efficiency in overcoming population size, cost, and labor drawbacks compared with the evaluation of the total population. Therefore, core subsets from various crop species have been constructed based on morphological characteristics, phenotypic assessments, and genotypic information [8][9][10][11][12][13][14][15][16]. However, phenotypic observations are mainly affected by environmental factors. The utilization of molecular markers to construct a core subset directly reflects the genetic diversity of the total population [17,18]. As soybean is a short-day and photoperiod-sensitive crop, it is crucial to grow it in suitable target regions due to the narrow growth range of latitudes to achieve maximum production. Thus, soybean flowering is determined by specific night and day lengths in the transition from vegetative to reproductive growth. Soybeans are grown under different maturity group designations to match the day and night lengths at different latitudes. The optimal region for growing soybeans is one of the critical considerations for yield potential. Farmers need to consider the maturity of soybeans to determine seed production and quality at specific latitudes. There are 13 maturity groups (MGs) in the US, ranging from MG 000 to MG X, based on their latitudinal adaptation [19,20].
Korea is the center of origin for wild soybean with a long history of soybean cultivation [35]. There are~17,000 G. max and~3700 G. soja accessions in Korea [36]. In this study, we used 1467 wild soybean accessions as the total population. This study aimed to construct a core subset of 1467 wild soybean accessions and analyze their genetic diversity using 180 K Axiom ® SoyaSNP to understand their genetic variations [37]. Additionally, genome-wide association studies (GWASs) were conducted to detect the genetic loci underlying time to flowering for a core subset collection, and they revealed the allelic variation in E genes (E1, E2, and E3) using the available resequencing data of wild soybeans.

Construction of the Core Collection
In total, 170,233 SNPs from 1467 wild soybean collections were used to construct the core collection. The core collections included 408 wild soybeans, accounting for 27.8% of the total population. The core subset of the wild soybean collection represented 99.0% of the genetic diversity of the 1467 wild soybean collections. A principal component analysis (PCA) showed that the distribution of the core subset of collections covered the total population based on the first PC (PC1) and the second PC (PC2) (Figure 1). The core subset

Genetic Diversity and Population Structure
The 180 K single-nucleotide polymorphism (SNP) markers were used to evaluate the genetic diversity indices of the wild soybean collection. The genetic diversity index ranged from 0.022 to 0.500. The mean of the genetic diversity index was 0.272 ( Figure 2A). Approximately 97.8% of the analyzed SNP markers showed a heterozygosity value of <0.09 ( Figure 2B). The minor allele frequencies ranged from 0.011 to 0.555, and the mean value was 0.197 ( Figure 2C). The polymorphism information content (PIC) indicated the allelic diversity of the evaluated SNP markers ( Figure 2D). The maximum and minimum PIC values were 0.375 and 0.022, respectively, with an average of 0.222. According to the analysis of the genetic diversity indices in this study, 119 SNPs across 20 chromosomes showed the highest diversity among the 180 K SNP genotypes.

Genetic Diversity and Population Structure
The 180 K single-nucleotide polymorphism (SNP) markers were used to evaluate the genetic diversity indices of the wild soybean collection. The genetic diversity index ranged from 0.022 to 0.500. The mean of the genetic diversity index was 0.272 ( Figure 2A). Approximately 97.8% of the analyzed SNP markers showed a heterozygosity value of <0.09 ( Figure 2B). The minor allele frequencies ranged from 0.011 to 0.555, and the mean value was 0.197 ( Figure 2C). The polymorphism information content (PIC) indicated the allelic diversity of the evaluated SNP markers ( Figure 2D). The maximum and minimum PIC values were 0.375 and 0.022, respectively, with an average of 0.222. According to the analysis of the genetic diversity indices in this study, 119 SNPs across 20 chromosomes showed the highest diversity among the 180 K SNP genotypes. (PCA) showed that the distribution of the core subset of collections covered the total population based on the first PC (PC1) and the second PC (PC2) (Figure 1). The core subset of collections evenly covered the total population of PC1 and PC2. The variances of PC1 and PC2 were 19.37% and 10.09%, respectively. A wild core subset of collections consisting of 408 wild soybeans was constructed and used for subsequent analyses.

Genetic Diversity and Population Structure
The 180 K single-nucleotide polymorphism (SNP) markers were used to evaluate the genetic diversity indices of the wild soybean collection. The genetic diversity index ranged from 0.022 to 0.500. The mean of the genetic diversity index was 0.272 ( Figure 2A). Approximately 97.8% of the analyzed SNP markers showed a heterozygosity value of <0.09 ( Figure 2B). The minor allele frequencies ranged from 0.011 to 0.555, and the mean value was 0.197 ( Figure 2C). The polymorphism information content (PIC) indicated the allelic diversity of the evaluated SNP markers ( Figure 2D). The maximum and minimum PIC values were 0.375 and 0.022, respectively, with an average of 0.222. According to the analysis of the genetic diversity indices in this study, 119 SNPs across 20 chromosomes showed the highest diversity among the 180 K SNP genotypes. Population clustering was analyzed using fastSTRUCTURE with the wild soybean collection regions. Admixture plots of K values from 2 to 5 were constructed ( Figure 3A). The marginal likelihood increased with an increase in the K value. Additionally, PCA and the unweighted pair group method with an arithmetic mean (UPGMA) phylogenetic tree analysis were shown to determine the optimal number of clusters, supporting the fastSTRUCTURE results ( Figure 3B,C). Therefore, 408 wild soybean collections in the core subset could be divided into 3 clusters (K = 3) (Figure 3). Cluster 1 comprised 333 wild soybean accessions, consisting of 327 Korean, 3 Russian, and 3 unknown wild soybean accessions ( Figure 3D). In contrast, Cluster 2 comprised 37 Chinese wild soybean accessions, and Cluster 3 consisted of 35 Japanese and 3 Korean wild soybean collections. Clusters 1, 2, and 3 of the core collections were mainly collected from Korea, China, and Japan, respectively ( Figure 3D). Manhattan plots of the F st estimation between each cluster are shown in Figure S1 and Table S1. The F st estimations between each cluster were 0.114, 0.126, and 0.049 between Cluster 1-Cluster 2, Cluster 1-Cluster 3, and Cluster 2-Cluster 3, respectively (Table S1).

Figure 2.
The genetic diversity analyses of core subset of wild soybean collections. Genetic diversity index with 180 K SNP genotype (A), heterozygosity with 180 K SNP genotype (B), minor allele frequency with 180 K SNP genotype (C), and the polymorphism information content (PIC) (D) of 142,177 SNPs across 408 wild soybean collections with green cotyledon and 3 cultivars.
Population clustering was analyzed using fastSTRUCTURE with the wild soybean collection regions. Admixture plots of K values from 2 to 5 were constructed ( Figure 3A).
The marginal likelihood increased with an increase in the K value. Additionally, PCA and the unweighted pair group method with an arithmetic mean (UPGMA) phylogenetic tree analysis were shown to determine the optimal number of clusters, supporting the fastSTRUCTURE results ( Figure 3B,C). Therefore, 408 wild soybean collections in the core subset could be divided into 3 clusters (K = 3) (Figure 3). Cluster 1 comprised 333 wild soybean accessions, consisting of 327 Korean, 3 Russian, and 3 unknown wild soybean accessions ( Figure 3D). In contrast, Cluster 2 comprised 37 Chinese wild soybean accessions, and Cluster 3 consisted of 35 Japanese and 3 Korean wild soybean collections. Clusters 1, 2, and 3 of the core collections were mainly collected from Korea, China, and Japan, respectively ( Figure 3D). Manhattan plots of the Fst estimation between each cluster are shown in Figure S1 and Table S1. The Fst estimations between each cluster were 0.114, 0.126, and 0.049 between Cluster 1-Cluster 2, Cluster 1-Cluster 3, and Cluster 2-Cluster 3, respectively (Table S1). The phenotypic distribution of the 408 wild soybean accessions for the days to flowering was shown in 5 different environments based on 3 clusters (Figure 4). The Chinese wild soybean accessions had significantly earlier flowering than the Korean and Japanese

Phenotypic Distribution and GWAS for Days to Flowering with a Core Population
The phenotypic distribution of the 408 wild soybean accessions for the days to flowering was measured in 5 different environments ( Figure 5

Phenotypic Distribution and GWAS for Days to Flowering with a Core Population
The phenotypic distribution of the 408 wild soybean accessions for the days to flowering was measured in 5 different environments ( Figure 5 Table 1. Among the SNPs in Table 1, the SNP on chromosome 6 (AX-90416460), the SNP on chromosome 10 (AX-90408467) at Gwangju in 2016 ( Figure  4D), and the means of five environments ( Figure 4F) were significantly associated with the days to flowering. A haplotype analysis revealed that the SNPs on chromosome 10 across three environments, namely, Suwon in 2013, Gwangju in 2016, and the means of five environments, were in the E2 locus.   Table 1. Among the SNPs in Table 1, the SNP on chromosome 6 (AX-90416460), the SNP on chromosome 10 (AX-90408467) at Gwangju in 2016 ( Figure 4D), and the means of five environments ( Figure 4F) were significantly associated with the days to flowering. A haplotype analysis revealed that the SNPs on chromosome 10 across three environments, namely, Suwon in 2013, Gwangju in 2016, and the means of five environments, were in the E2 locus.

Allelic Variation in E Genes in Publicly Available Genome Sequencing Data
Our previous study revealed the genome sequences of wild soybeans, which are available in the National Center for Biotechnology Information [38]. Overall, the resequencing Plants 2023, 12, 1305 7 of 13 data of 334 wild soybean accessions from a core population in this study were used (available online). The allelic variations in E1 (Glyma.06G207800), E2 (Glyma.10G221500), and E3 (Glyma.19G224200) were analyzed using the genome sequencing data of wild soybeans to reveal the allelic variation in E genes (E1, E2, and E3) for predicting maturity in the wild soybean core collections ( Figure 5).
The SNP variants at C20,207,605T causing a silent mutation are shown as a blue line on the E1 gene in Figure 6A. Ten collections had thymine at position 20,207,605 (Table S2)

Discussion
The current study used 1467 wild soybean collections from the National Agrobiodiversity Center of South Korea to construct 408 Korean wild core collections based on 180 K Axiom ® SoyaSNP genotypic information. Although wild soybean germplasms increase genetic diversity and have favorable genes [3,4], it is challenging to grow wild soybean because of its undesirable traits, such as lodging, viny growth habits, hard seed coats, and pod shattering. Constructing a core subset is one of the strategies to overcome the handling of a large population of wild soybean accessions. In this study, the 180 K SNP markers were analyzed to directly reflect the genetic diversity of the total population, in contrast to phenotypic observations that are primarily influenced by environmental factors. Our previous study revealed 430 Korean core collections of G. max from 2872 collections based on the 180K SNP genotype [37][38][39][40]. Additionally, we recently provided resequencing data on the core collection of G. max and wild soybeans [38]. The Korean core collection of G. max and wild soybeans as diverse genetic resources can be used for genetic and genomic analyses in soybean breeding programs.
Studies have shown that the proportion of the core collection from the total population generally ranges from 5% to 20% and 99.0% of the genetic diversity of the total population [39,41]. This study revealed that 408 wild soybeans as the core collection can be useful genetic materials instead of growing 1467 soybean collections in wild soybean research. The core collection was evenly distributed in the total population of PC1 and PC2 ( Figure 1). Additionally, phylogenetics and PCA revealed that three clusters of core collections mainly represented the collection regions Korea, China, and Japan. The extent of genetic differentiation between each cluster is shown using the F st statistic (Table S1). The Chinese wild soybean collections (Cluster 1) in this study showed a higher level of population differentiation from the Korean (Cluster 2) and Japanese wild soybean collections (Cluster 3). The F st value between Cluster 2 and Cluster 3 was the smallest, suggesting that they were similar to each other. Therefore, the 408 Korean wild soybean accessions of the core subset in this study are helpful genetic germplasms for genetic studies and soybean breeding programs and for exploring candidate genes related to traits through GWASs.
GWAS analyses were conducted with 180 K Axiom ® SoyaSNP to identify the loci in the wild soybean for the days to flowering of the 408 core collections. Although we did not detect significantly overlapped SNPs in five different environments, two SNPs on chromosome 6 (AX-90416460), chromosome 10 (AX-90408467) at Gwangju in 2016 ( Figure 5D), and the mean of five environments ( Figure 5F) were significantly associated with the days to flowering. The SNPs on chromosome 6 were near the E1 locus at Gwanju in 2016, and most of the wild soybeans contained a functional E1 gene, except for two wild soybean accessions with deletions ( Figure 6A). The SNPs on chromosome 10 across three environments (Suwon in 2013, Gwangju in 2016, and the mean of five environments) were in the E2 locus. All 388 wild soybean collections contained a missense mutation at Plants 2023, 12, 1305 9 of 13 A45,305,285G (I220V) on the E2 gene, as determined via an analysis of the resequencing data of 388 wild soybean collections, although isoleucine and valine are nonpolar amino acids, meaning that the function of these 2 amino acids is similar. Thus, further research is required to identify the expression level of the E2 gene in wild soybean accessions.
Maturity E genes are highly associated with MGs and the days to maturity in soybeans. Zhai et al. [42] identified 180 soybean cultivars with E1, E2, E3, and E4 genotypes. The earliest-maturing lines had e1, e2, e3, and e4 genotypes, and the later-maturing lines had E1, E2, E3, and E4 genotypes in three northern and southern regions in China. Tsubokura et al. [43] used molecular markers to determine the genotypes of maturity genes and revealed that early-flowering lines had two or three recessive alleles and that late-flowering lines had E1, E2, E3, and E4 genotypes. Lagewisch et al. [34] proposed the E gene maturity model with an allelic variation in the E1, E2, and E3 genes to adapt to different MGs in the US and Canada. Thus, we analyzed the resequencing data of the wild soybean collections to identify the allelic variation in E1, E2, and E3 and predict the days to maturity in this study. However, most of the wild soybeans in the core collection had functional E1 and mutant alleles of the E2 gene at position A45,305,285G (I220V). The resequencing of the wild soybean collections revealed a small portion of alternative SNPs at each SNP position in the E3 gene.
We assumed that other maturity genes were likely to cause the days to flowering phenotype in wild soybean accessions. Recent studies have revealed that the stepwise selection of two homologous pseudo-response regulator genes, Tof11 and Tof12, has played an essential role in soybean growth at high latitudes during domestication [33,44,45]. Additionally, Tof5 encoded a homolog of Arabidopsis thaliana FRUIT-FULL that promotes flowering in soybeans [46]. They reported that Tof5 for flowering was under parallel selection during soybean domestication. We suggest that these wild soybean collections can be used as genetic materials to identify new maturity or flowering genes for further studies.
In conclusion, 408 wild soybean collections from 1467 accessions were constructed as core wild soybean collections from 180 K SNP genotypic data. These wild soybean collections can be used as genetic materials to identify new maturity or flowering genes and favorable traits for further studies. This core collection can provide helpful genetic resources for developing new cultivars, thereby facilitating the introgression of genes of interest from wild soybeans as part of a breeding program.

Wild Soybean Collections
In our previous study, 1467 wild soybean collections were obtained from the National Agrobiodiversity Center and Yeongnam University, Republic of Korea, to construct wild soybean core collections [40]. Among these, 1343 wild soybean accessions were collected from South Korea, 58 from China, 56 from Japan, and 6 from Russia, and 3 were unknown wild soybean accessions.

DNA Isolation and Genotyping with Axiom ® 180k SoyaSNP Array
To isolate the genomic DNA, we collected young trifoliate leaves from each wild soybean collection. Next, the leaf samples were ground into a fine power under liquid nitrogen with a mortar and pestle, and genomic DNA was extracted using the cetyltrimethylammonium bromide (CTAB) method with a minor modification [47]. The quality of the genomic DNA from each wild soybean accession was evaluated using 1.5% agarose gel electrophoresis. Next, 30 µL of 100 ng/µL genomic DNA from the 1467 wild soybean accessions was genotyped using the Axiom ® 180k SoyaSNP array [37]. The SNPs were scored based on the Axiom ® Genotyping Solution Data Analysis User Guide [48]. In total, 170,233 SNPs were genotyped from 180,961 SNPs using the Axiom ® 180k SoyaSNP array.

Construction of a Core Collection
A total of 170,233 SNPs of the 1467 wild soybean accessions were analyzed using GenoCore [49] to construct a core collection. GenoCore was used based on two parameters, coverage, and delta. The coverage provided a percentage value for a core subset representing the total population, and delta represented an increasing coverage ratio. To construct the core subset of the collection with GenoCore, the parameters were set as 99.0% of the coverage and 0.001% of the delta value.

Core Subset for Days to Flowering
Phenotypic observations of the days to flowering in five environments were obtained via field evaluations. A core collection of wild soybeans was tested in an experimental field at the National Institute of Crop Science (Suwon; 37 •

Population Structure of a Core Subset including 408 Wild Soybean Accessions
With PowerMarker 3.25 [51], we obtained diversity indices, such as minor allele frequencies, the genetic diversity index, PIC, and heterozygosity, using 408 wild soybean collections with 142,177 SNP markers. PCA was conducted with the 1467 wild soybean accessions and a core subset using the R package SNPRelate [52]. The PCA plot showed PC1 against PC2. For the phylogenetic tree in this study, an UPGMA tree was constructed with all wild soybean accessions using the calculation of a modified Euclidean distance between each pair of wild soybean collections, as determined via TASSEL [53]. The admixture plots from K = 1 to K = 10 were analyzed using fastSTRUCTURE [54]. Genetic differentiation (F st ) between each cluster was calculated using VCFtools [55].

GWAS for Days to Flowering
To conduct a GWAS, TASSEL and GAPIT were used in this study. A total of 142,177 SNPs were finalized for further analyses after the filtration of SNPs with 20% missing data and a minor allele frequency (0.01) and to construct PCA and the kinship coefficient matrix [56,57]. We used a compressed mixed linear model (CMLM) to generate Manhattan plots [57,58]. The significance threshold of −log 10 (p) was determined using Bonferroni's correction and false discovery rates.

Resequencing Data of 334 Wild Soybean Core Collections
Large datasets, including SNP and INDEL variants of 855 soybean accessions, were available from the figshare repository (https://figshare.com/projects/Soybean_haplotype_ map_project/76110 (accessed on 1 December 2022)) [38]. For further analyses, the SNPs and INDEL of 334 wild soybean collections were extracted from 855 resequenced accessions for allelic variation in E1, E2, and E3 genes. In total, 334 wild soybean collections belonged to the core subset in this study.

Statistical Analysis
Statistical analyses were conducted using SAS v9.4 (SAS Institute, Cary, NC, USA, 2013). Mean differences among the genotypic groups were analyzed using Fisher's least significant difference test at a p value of 0.05 using PROC GLM.