A Genome-Wide Association Study of Protein, Oil, and Amino Acid Content in Wild Soybean (Glycine soja)

Soybean (Glycine max L.) is a globally important source of plant proteins, oils, and amino acids for both humans and livestock. Wild soybean (Glycine soja Sieb. and Zucc.), the ancestor of cultivated soybean, could be a useful genetic source for increasing these components in soybean crops. In this study, 96,432 single-nucleotide polymorphisms (SNPs) across 203 wild soybean accessions from the 180K Axiom® Soya SNP array were investigated using an association analysis. Protein and oil content exhibited a highly significant negative correlation, while the 17 amino acids exhibited a highly significant positive correlation with each other. A genome-wide association study (GWAS) was conducted on the protein, oil, and amino acid content using the 203 wild soybean accessions. A total of 44 significant SNPs were associated with protein, oil, and amino acid content. Glyma.11g015500 and Glyma.20g050300, which contained SNPs detected from the GWAS, were selected as novel candidate genes for the protein and oil content, respectively. In addition, Glyma.01g053200 and Glyma.03g239700 were selected as novel candidate genes for nine of the amino acids (Ala, Asp, Glu, Gly, Leu, Lys, Pro, Ser, and Thr). The identification of the SNP markers related to protein, oil, and amino acid content reported in the present study is expected to help improve the quality of selective breeding programs for soybeans.


Introduction
Due to its high protein and oil content, soybean (Glycine max L.) is one of the world's most important crops, accounting for the largest proportion of protein consumption, livestock feed, and oil seed production (http://soystats.com, accessed on 1 November 2022). Soybean protein contains all of the essential amino acids including isoleucine (Ile), histidine (His), leucine (Leu), lysine (Lys), methionine (Met), phenylalanine (Phe), threonine (Thr), tryptophan (Trp), and valine (Val), making it a nutritionally valuable crop [1]. In Korea, soybean is used in various products, including tofu, soymilk, soybean sprouts, and soybean paste. Therefore, the gene-based improvement of protein, oil, and amino acid content is a very important goal in soybean breeding. However, the domestication bottleneck and selective breeding have led to a significant reduction in the genetic diversity of modern soybean cultivars, which has hindered breeding progress [2].

Phenotypic Variation and Correlation Analysis
In 2015 and 2016, the seeds harvested from 203 wild soybeans were used to determine the protein, oil, and amino acid contents. The protein content was 38.61% to 52.61%, with an average of 47.88%, in 2015 and 38.83% to 53.21%, with an average of 47.92%, in 2016 (Table 1). The oil content ranged from 4.29% to 12.86% (an average of 7.11%) in 2015 and from 4.66% to 12.69% (an average of 7.51%) in 2016. The average coefficient of variation for the oil content was 18.5%, which was considerably higher than that for the protein content (5.1%). Based on the skewness and kurtosis of the data, the protein content exhibited a negatively skewed distribution, while the oil content was positively skewed (Table 1 and Figure 1). The broad-sense heritability (h 2 ) for the protein and oil content was 0.78 and 0.83, respectively. Env., environment; Min., minimum; Max., maximum; SD, standard deviation; CV, coefficient of variation; Skew, skewness; Kur., kurtosis; h 2 , broad sense heritability.
1 and Figure 1). The broad-sense heritability (h 2 ) for the protein and oil content was 0.78 and 0.83, respectively. Env., environment; Min., minimum; Max., maximum; SD, standard deviation; CV, coefficient of variation; Skew, skewness; Kur., kurtosis; h 2 , broad sense heritability. The contents of 17 amino acids, including Met and Cys, varied widely among the 203 wild soybean accessions (Table 2). Of these 17 amino acids, Glu was the main component of soybean seeds, with an average content of 7019 mg, while Cys had the lowest average content at 348 mg (Table 2 and Figure 2). On the other hand, Cys had the largest coefficient of variation with an average of 14%, but Thr had the lowest with an average of 4.9%. Arginine (Arg) had the highest h 2 (0.71), while Met had the lowest (0.35).  The contents of 17 amino acids, including Met and Cys, varied widely among the 203 wild soybean accessions (Table 2). Of these 17 amino acids, Glu was the main component of soybean seeds, with an average content of 7019 mg, while Cys had the lowest average content at 348 mg (Table 2 and Figure 2). On the other hand, Cys had the largest coefficient of variation with an average of 14%, but Thr had the lowest with an average of 4.9%. Arginine (Arg) had the highest h 2 (0.71), while Met had the lowest (0.35). The average correlations between the protein, oil, and amino acid contents i and 2016 are presented in Figure 3. The p-value of the correlation coefficients for eac  The average correlations between the protein, oil, and amino acid contents in 2015 and 2016 are presented in Figure 3. The p-value of the correlation coefficients for each seed component was all significant at less than 0.001. The correlation between protein and oil content was negative at r = −0.64, while the correlation between each amino acid content was positive. Among the amino acids, Glu and Leu showed the highest correlation at r = 0.96, while Val and Met showed the lowest correlation at r = 0.34. In addition, each amino acid content had a positive correlation with protein content and a negative correlation with oil content. Figure 2. Distribution of protein and oil content in 203 wild soybean accessions. Asp, aspartic acid; Thr, threonine; Ser, serine; Glu, glutamic acid; Pro, proline; Gly, glycine; Ala, alanine; Cys, cysteine; Val, valine; Met, methionine; Iso, isoleucine; Leu, leucine; Tyr, tyrosine; Phe, phenylalanine; His, histidine; Lys, lysine; Arg, arginine.
The average correlations between the protein, oil, and amino acid contents in 2015 and 2016 are presented in Figure 3. The p-value of the correlation coefficients for each seed component was all significant at less than 0.001. The correlation between protein and oil content was negative at r= −0.64, while the correlation between each amino acid content was positive. Among the amino acids, Glu and Leu showed the highest correlation at r= 0.96, while Val and Met showed the lowest correlation at r= 0.34. In addition, each amino acid content had a positive correlation with protein content and a negative correlation with oil content.

Genome-Wide Association Study for Protein, Oil, and Amino Acid Content
A GWAS was conducted using a linear mixed model for the protein, oil, and 17 amino acid contents. Based on a significance threshold of −log10(P) ≥ 6.29, thirteen SNPs across six chromosomes and thirty SNPs across four chromosomes were detected for protein and oil, respectively ( Figure 4). Of the detected SNPs, the most significant SNPs with the lowest p-value in a particular associated genetic region were selected as a causal SNP for the target trait. The selected SNP markers are summarized in Table 3. For the protein content, six SNP markers were identified on chromosomes three (AX-90486230), eleven (AX-90422214), twelve (AX-90436656), thirteen (AX-90336510), fourteen (AX-90450715), and fif-

Genome-Wide Association Study for Protein, Oil, and Amino Acid Content
A GWAS was conducted using a linear mixed model for the protein, oil, and 17 amino acid contents. Based on a significance threshold of −log 10 (P) ≥ 6.29, thirteen SNPs across six chromosomes and thirty SNPs across four chromosomes were detected for protein and oil, respectively ( Figure 4). Of the detected SNPs, the most significant SNPs with the lowest p-value in a particular associated genetic region were selected as a causal SNP for the target trait. The selected SNP markers are summarized in Table 3. For the protein content, six SNP markers were identified on chromosomes three (AX-90486230), eleven (AX-90422214), twelve (AX-90436656), thirteen (AX-90336510), fourteen (AX-90450715), and fifteen (AX-90368184), while seven SNP markers were identified for the oil content on chromosomes twelve (AX-90408186 and AX-90513548), thirteen (AX-90440743), fourteen (AX-90525501), and twenty (AX-90387626, AX-90339137, and AX-90513791).
The allelic effect of the SNP marker was estimated for each trait, and the highest differences per trait are displayed in Figure 5. The protein-associated SNP marker AX-90422214 on chromosome 11 had the alleles G/A, and the average protein content for individuals with GG alleles was 42.84 g, which was 5.35 g lower than the average protein content for individuals with AA alleles (48.19 g). In addition, the oil-associated SNP marker AX-90513548 on chromosome 12 had the alleles C/T, and the average oil content for the individuals with CC alleles was 11.11 g, which was 3.98 g more than the average oil content for individuals with TT alleles (7.13 g).
At a suggestive threshold of −log 10 (P) ≥ 4.98, nineteen SNPs across five chromosomes were associated with Ala, six SNPs across three chromosomes with Arg, twenty SNPs across five chromosomes with Asp, two SNPs across two chromosomes with Cys, fourteen SNPs across three chromosomes with Glu, thirteen SNPs across five chromosomes with Gly, two SNPs across two chromosomes with His, fifteen SNPs across four chromosomes with Leu, forty-six SNPs across twelve chromosomes with Lys, three SNPs across two chromosomes with Phe, twenty-four SNPs across eight chromosomes with Pro, sixteen SNPs across seven chromosomes with Ser, twelve SNPs across four chromosomes with Thr, and three SNPs on one chromosome with Val ( Figure S1 and Table 4). Only one SNP was detected for each of Iso and Tyr, and no suggestive or significant SNPs were detected for Met. AX-90332294 on chromosome one and AX-90522787 on chromosome three were individually or simultaneously associated with eleven of the seventeen amino acids (all except Arg, Cys, His, Phe, Val, and Met). In addition, the AX-90397199 marker, which was located in almost the same position as the AX-90522787 marker on chromosome three, was associated with Leu and Lys.  The allelic effect of the SNP marker was estimated for each trait, and the highest differences per trait are displayed in Figure 5. The protein-associated SNP marker AX-90422214 on chromosome 11 had the alleles G/A, and the average protein content for individuals with GG alleles was 42.84 g, which was 5.35 g lower than the average protein content for individuals with AA alleles (48.19 g). In addition, the oil-associated SNP marker AX-90513548 on chromosome 12 had the alleles C/T, and the average oil content  for the individuals with CC alleles was 11.11 g, which was 3.98 g more tha oil content for individuals with TT alleles (7.13 g). At a suggestive threshold of −log10(P) ≥ 4.98, nineteen SNPs across five c were associated with Ala, six SNPs across three chromosomes with Arg, across five chromosomes with Asp, two SNPs across two chromosomes with SNPs across three chromosomes with Glu, thirteen SNPs across five chrom Gly, two SNPs across two chromosomes with His, fifteen SNPs across four c with Leu, forty-six SNPs across twelve chromosomes with Lys, three SNP chromosomes with Phe, twenty-four SNPs across eight chromosomes with SNPs across seven chromosomes with Ser, twelve SNPs across four chrom Thr, and three SNPs on one chromosome with Val ( Figure S1 and Table 4). O was detected for each of Iso and Tyr, and no suggestive or significant SNPs for Met. AX-90332294 on chromosome one and AX-90522787 on chromosom individually or simultaneously associated with eleven of the seventeen am except Arg, Cys, His, Phe, Val, and Met). In addition, the AX-90397199 mark located in almost the same position as the AX-90522787 marker on chromoso associated with Leu and Lys.   The allelic effects of SNP markers on amino acid contents are presented in Figure 6. Glu, which exhibited the largest difference in the allele effects, was associated with SNP marker AX-90332294 on chromosome one, which had the alleles G/T, and the average Glu content for individuals with GG alleles was 8829 mg, which was 1033 mg higher than the average Glu content for individuals with TT alleles (7797 mg/g). On the other hand, Cys, which had the smallest allele effect, was associated with SNP marker AX-90358159 on chromosome seven, which had the alleles C/T, and the average Cys content for individuals with CC alleles was 540 mg, which was 104 mg higher than the average Glu content for individuals with TT alleles (437 mg/g).

Ser
AX-90332294 The allelic effects of SNP markers on amino acid contents are presented in Figure 6. Glu, which exhibited the largest difference in the allele effects, was associated with SNP marker AX-90332294 on chromosome one, which had the alleles G/T, and the average Glu content for individuals with GG alleles was 8829 mg, which was 1033 mg higher than the average Glu content for individuals with TT alleles (7797 mg/g). On the other hand, Cys, which had the smallest allele effect, was associated with SNP marker AX-90358159 on chromosome seven, which had the alleles C/T, and the average Cys content for individuals with CC alleles was 540 mg, which was 104 mg higher than the average Glu content for individuals with TT alleles (437 mg/g). Figure 6. Phenotypic differences between lines carrying different SNP alleles associated with amino acid contents. Asp, aspartic acid; Thr, threonine; Ser, serine; Glu, glutamic acid; Pro, proline; Gly, Figure 6. Phenotypic differences between lines carrying different SNP alleles associated with amino acid contents. Asp, aspartic acid; Thr, threonine; Ser, serine; Glu, glutamic acid; Pro, proline; Gly, glycine; Ala, alanine; Cys, cysteine; Val, valine; Leu, leucine; Phe, phenylalanine; His, histidine; Lys, lysine; Arg, arginine.

Discussion
The wild soybean accessions used in this study were collected from Korea, China, Japan, and Russia and contain various genetic diversity [24], which can be utilized for soybean improvement by applying the GWAS to identify useful alleles. Soybeans contain not only essential amino acids but also a large amount of unsaturated fatty acids; thus, they are widely consumed for health purposes. As a result, many studies have been conducted on QTLs involved in regulating protein and oil content. The present study analyzed 203 wild soybean accessions grown for two years for their content of protein, oil, and 17 amino acids. The average protein and oil content for wild soybean was 47.84% and 7.33%, respectively. This is consistent with several previous studies that have reported a higher protein content and lower fat content than the 40% and 20% widely reported for protein and oil, respectively, in cultivated soybean [6][7][8]. Globally, soybean accounts for the largest proportion of oilseed production at 61% (http://soystats.com/, accessed on 1 November 2022). These results indicate that the oil content has increased during the domestication of cultivated soybean from wild soybean. On the other hand, there has been a shift toward lower protein content. This can be explained by the negative correlation between protein and oil content [11,12,21], which is also observed in Figure 3. In soybeans, there are constituent amino acids that make up the proteins, and there are free amino acids. In this study, the constituent amino acids were analyzed, and it was found that the content of Glu was the highest, and the correlation between each amino acid was significantly positive. Chotekajorn et al. (2021) [25] analyzed the free amino acids from 316 wild soybean accessions and identified that Arg was the most abundant, while most of the amino acids were positively correlated with each other, similar to the results of this study.
In the GWAS results, five and six genes containing detected SNP markers were found for the protein and oil content, respectively. It has been widely reported by many studies that major candidate genes associated with protein contents are present on chromosomes 15 and 20 [13,15,17,19,20]. Kim et al. (2016) conducted fine-mapping of the protein and oil content using a backcross population with the high-protein line PI 407788A as the donor parent and Williams 82 as the recurrent parent and found that QTLs were located between BARCSOYSSR_15_0161 and BARCSOYSSR_15_0194 on chromosome 15 [19]. In addition, Fliege et al. (2022) recently conducted fine-mapping and the RNAi transformation of the protein content using a backcross population with the high-protein wild soybean line PI 468916 as the donor parent and A81-356022 as the recurrent parent in the initial stages of a large-scale QTL analysis for soy protein and oil content [20]. Their research revealed that the protein content is regulated by a CCT domain protein polymorphism in the Glyma.20G85100 gene on chromosome 20 [20]. In this study, AX-90368184 on chromosome 15 and AX-90513791 on chromosome 20, which were associated with protein and oil content, respectively, were detected at positions similar to the aforementioned major QTLs. The genes on the reference genome where the SNPs are located are Glyma.15g055200 (F-box and associated interaction domain-containing protein) and Glyma.20g087700 (protein of unknown function), which differ from the aforementioned genes. However, it is clear that the major QTLs are located on chromosomes 15 and 20.
These differences can be ascribed to the analysis of different accessions and the fact that the protein and oil content is not regulated by a single gene. The DNA binding with one finger (DOF) family of plant-specific transcription factors (TFs) is known to regulate seed protein accumulation and mobilization [26]. OBP3, an annotation of the Glyma.11g015500 gene detected on chromosome 12, is a member of the DOF family. It has been reported that OBP3 regulates the signaling of phytochrome and tryptochrome in Arabidopsis thaliana [27] and plays an important role in growth and development [26]. However, the function of OBP3 in soybean is unknown. The involvement of the DOF family in protein accumulation suggests that the Glyma.11g015500 gene may be a strong candidate gene for involvement in regulating the protein content.
The Glyma.20g050300 gene for zinc-binding alcohol dehydrogenase family protein was detected on chromosome 20 and associated with oil contents. Soybean alcohol dehydroge-nase has been found to be active in anaerobic reactions and seed respiration, including in response to flooding stress [28,29]. On the other hand, alcohol dehydrogenase was included among the fatty acid synthesis-related proteins identified in the comparative proteomics of high-fat soybean cultivar JY73 in a previous study [30]. Therefore, the fat content may be indirectly affected by alcohol dehydrogenase depending on the condition of the seed; thus, Glyma.20g050300 may be a candidate gene for the regulation of the oil content.
Interestingly, in the GWAS results for the content of the seventeen amino acids, markers AX-90332294 and AX-90522787 located on chromosomes one and three, respectively, were detected for nine amino acids. In particular, AX-90332294 exhibited the largest SNP variance for each amino acid ( Figure 6). For these amino acids, Glyma.01g053200 and Glyma.03g239700, which were annotated with the prefoldin chaperone subunit family protein and aspartyl protease/7S seed globulin precursor, respectively, were identified. Chaperone is known to act as a proteolytic enzyme in eukaryotes by inducing proteases that aid in the structural folding of protein complexes or the degradation of proteins [31,32]. Prefoldins are a family of chaperone proteins, which are heterohexameric proteins composed of two α subunits and four β subunits [31,33]. Protein complexes are eventually formed by amino acids, so Glyma.01g053200 may be related to the content of amino acids. In addition, it is known that β-con-glycinin (7S) and glycinin (11S) account for more than 70% of the total soybean storage proteins [23,34]. The fact that the 7S and 11S proteins in soybeans make up a significant portion of storage proteins may not be directly related to the presence of SNPs in structural genes. Rather, the expression and accumulation of these proteins are known to be controlled by regulatory elements, such as promoters and enhancers, that govern the transcription and translation of the corresponding genes. However, genetic variation, including SNPs in structural genes encoding 7S and 11S proteins, can affect expression levels or protein function, which in turn can affect soybean protein composition and nutritional value. Thus, while the presence of SNPs in structural genes may not be directly related to the abundance of storage proteins in soybeans, in a broader sense, it suggests that genetic variation in these genes may have important implications for the quality and utilization of soybean proteins. In this study, Glyma.03g239700, Glyma.19g164800, and Glyma.19g164900 are associated with precursors or subunits of 7S and 11S storage proteins. In addition, amino acid synthesis involves several complex processes [35] and can be regulated by shikimate dehydrogenase (Glyma.03g242400), chorismate mutase (Glyma.06g061700), arogenate dehydratase (Glyma.12g072500), asparagine (Glyma.20g025400), and aminotransferase (Glyma.15g012300). Gene expression patterns for candidate genes from https://www.soybase.org/soyseq/, accessed on 1 November 2022 are shown in Table S1. Candidate genes were expressed according to various tissues and stages of seed development, and in particular, it was confirmed that Glyma.03g239700 was intensively expressed during the period of seed development. This information suggests that the candidate genes detected in the present study may directly or indirectly affect amino acid contents.

Plant Materials and Field Management
A total of 203 wild soybean accessions were used for analysis [24]. Briefly, they were cultivated and harvested in an experimental field at Chonnam National University (Gwangju, 36 • 17 N, 126 • 39 E, Republic of Korea) in 2015 and 2016. Each accession was planted in a single hill plot measuring 1 × 1 m in two replicates in the experimental field. A compound fertilizer with a ratio of nitrogen, phosphorus pentoxide, and potassium oxide of 8:8:9 was applied at 40 kg per 1000 m 2 .

Analysis of Protein, Oil, and Amino Acid Content
Seed samples were dried in a dry oven at 40 • C for 7 days, and then finely ground and quantified by 3 g each. The protein and oil contents were measured using the Kjeldal method and the ether extraction method, respectively, in the same way as used by Kim et al. (2023) [36]. The composition and content of the amino acids in the soybean seeds were analyzed using an amino acid analyzer (S433-H, SYKAM GmbH, Munich, Germany). Briefly, in the pretreatment process, 0.1 g of the sample was weighed into an 18 ml test tube, and 5 mL of 6N hydrochloric acid (HCl) was added. The tube was sealed under reduced pressure (nitrogen gas filling) and then hydrolyzed in a heating block set at 110 • C for 24 h. After hydrolysis was completed, the acid was removed via rotary evaporation at 50 • C. Using 10 mL of a sodium dilution buffer (pH 3.45-10.85), 1 mL of the sample was then filtered through a 0.2 µm membrane filter in a cation separation column (LCA K06/Na, 4.6 × 150 mm), with a flow rate for the buffer solution of 0.45 mL/min, a flow rate for the reagent of 0.25 mL/min, and a column temperature of 57-74 • C. The amino acids were identified using a fluorescence spectrophotometer at wavelengths of 440 and 570 nm.

DNA Extraction and SNP Genotyping
Leaf tissue was collected from young trifoliate leaves of the V2 seedling stage and ground to a fine powder using liquid nitrogen in a mortar. The genomic DNA extraction of the ground leaf tissue was carried out according to the manufacturer's instructions using a DNeasy Plant Mini Kit (QIAGEN, Valencia, CA). The quantity and quality of the extracted total DNA were testified using a Nano-MD UV-Vis spectrophotometer (Scinco, Seoul, Republic of Korea). A total of 203 wild soybean accession genotyping was performed using the 180K Axiom Soya SNP array (Affymetrix, CA, USA) [37]. Low-quality SNPs were eliminated by removing SNPs when a genotype was observed in less than 95% of the samples and SNPs with a minor allele frequency (MAF) of less than 5%, and duplicates in the raw data were removed using R software [38]. Missing genotypes were estimated using BEAGLE software 3.0 [39], resulting in a total of 96,432 SNPs used in the analysis.

Genome-Wide Association Study and Statistical Analysis
The method of the GWAS was the same as Kim et al. (2023) [24]. Briefly, a linear mixed model using the restricted maximum likelihood (REML) algorithm was used in consideration of the population structure and similarity matrix. In addition, association analysis was performed using merged phenotypes to increase resolution between environments, and −log 10 (P) thresholds in Manhattan plots were calculated using the Bonferroni method (P = α/n). With 96,432 SNPs used in this study, at α = 1 and α = 0.05, the Bonferronicorrected thresholds for the p-values were 1.04 × 10 −5 (α = 1) and 5.19 × 10 −7 (α = 0.05), with equivalent −log 10 (P) values of 4.98 for the suggestive threshold and 6.29 for the significance threshold [40]. The soybean reference genome, Glyma.Wm82.a2.v1, from https://www.soybase.org (accessed on 1 November 2022), was used as the gene model for candidate gene identification. Phenotypic data for each trait were subjected to descriptive statistics and correlation analysis using Microsoft Excel 2016. Broad-sense heritability (h 2 ) and the GWAS analysis were calculated using QTLmax V2 software [41], and the following formula was used to calculate h 2 :

Conclusions
In this study, a GWAS analysis was conducted for protein, oil, and amino acid content using 203 wild soybean accessions. SNP markers for protein and oil content were detected at novel loci along with loci on chromosomes 15 and 20, which have been reported as major locations by several previous studies. In addition, candidate genes Glyma.01g053200 and Glyma.03g239700 were also found to have the potential to affect the content of nine amino acids. These findings could thus prove useful for soybean breeding programs.