Next Article in Journal
The Effect of Nitrogen Fertilizer Placement and Timing on Winter Wheat Grain Yield and Protein Concentration
Previous Article in Journal
Exploring the Fermentation Profile, Bacterial Community, and Co-Occurrence Network of Big-Bale Leymus chinensis Silage Treated with/Without Lacticaseibacillus rhamnosus and Molasses
Previous Article in Special Issue
Genome-Wide Association Analysis and Molecular Marker Development for Resistance to Fusarium equiseti in Soybean
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Genome-Wide Association Study Reveals Key Genetic Loci Controlling Oil Content in Soybean Seeds

1
Soybean Research Institute of Heilongjiang Academy of Agriculture Sciences, Harbin 150086, China
2
College of Life Science, Northeast Agricultural University, Harbin 150030, China
3
Soybean Research Institute of Jilin Academy of Agriculture Sciences (Northeast Agricultural Research Center of China), Changchun 130033, China
4
Plant Production Department, Faculty of Agriculture Saba Basha, Alexandria University, Alexandria 21531, Egypt
5
Institute of Biotechnology of Heilongjiang Academy of Agricultural Sciences, Harbin 150023, China
*
Authors to whom correspondence should be addressed.
These authors contributed equally to this work.
Agronomy 2025, 15(8), 1889; https://doi.org/10.3390/agronomy15081889
Submission received: 26 June 2025 / Revised: 1 August 2025 / Accepted: 4 August 2025 / Published: 5 August 2025

Abstract

Seed oil represents a key trait in soybeans, which holds substantial economic significance, contributing to roughly 60% of global oilseed production. This research employed genome-wide association mapping to identify genetic loci associated with oil content in soybean seeds. A panel comprising 341 soybean accessions, primarily sourced from Northeast China, was assessed for seed oil content at Heilongjiang Province in three replications over two growing seasons (2021 and 2023) and underwent genotyping via whole-genome resequencing, resulting in 1,048,576 high-quality SNP markers. Phenotypic analysis indicated notable variation in oil content, ranging from 11.00% to 21.77%, with an average increase of 1.73% to 2.28% across all growing regions between 2021 and 2023. A genome-wide association study (GWAS) analysis revealed 119 significant single-nucleotide polymorphism (SNP) loci associated with oil content, with a prominent cluster of 77 SNPs located on chromosome 8. Candidate gene analysis identified four key genes potentially implicated in oil content regulation, selected based on proximity to significant SNPs (≤10 kb) and functional annotation related to lipid metabolism and signal transduction. Notably, Glyma.08G123500, encoding a receptor-like kinase involved in signal transduction, contained multiple significant SNPs with PROVEAN scores ranging from deleterious (−1.633) to neutral (0.933), indicating complex functional impacts on protein function. Additional candidate genes include Glyma.08G110000 (hydroxycinnamoyl-CoA transferase), Glyma.08G117400 (PPR repeat protein), and Glyma.08G117600 (WD40 repeat protein), each showing distinct expression patterns and functional roles. Some SNP clusters were associated with increased oil content, while others correlated with decreased oil content, indicating complex genetic regulation of this trait. The findings provide molecular markers with potential for marker-assisted selection (MAS) in breeding programs aimed at increasing soybean oil content and enhancing our understanding of the genetic architecture governing this critical agricultural trait.

1. Introduction

Soybean (Glycine max) accounts for 60% of global oilseed production and provides over 25% of the world’s protein for humans and animals [1,2]. Originating from wild soybean (Glycine soja) in central China about 5000 years ago, it has since become widespread worldwide [3,4]. Modern soybean seeds generally contain 17% oil, 35% protein (both essential and non-essential amino acids), 31% carbohydrates (soluble and insoluble), 13% moisture, and 4% ash [5,6]. Depending on variety and growing conditions, oil content ranges from 8.3% to 27.9%, while protein concentration varies from 34.1% to 56.8% [3]. Soybean oil primarily consists of fatty acids (FAs), triacylglycerols (TAGs), and tocopherols, with five main fatty acids determining oil quality: stearic acid (C18:0), oleic acid (C18:1), linoleic acid (C18:2), linolenic acid (C18:3), and palmitic acid (C16:0) [7,8,9]. As a major cooking oil with health benefits, its synthesis pathways remain a significant focus of research [10,11].
High-throughput sequencing and genome-wide association studies (GWASs) have accelerated the identification of quantitative trait loci (QTLs) and genes linked to agronomic traits across various crops [12,13]. The oil content in soybeans is quantitatively inherited and regulated by multiple genes [14].The GWAS methodology employs natural populations with extensive recombination events, thereby improving the accuracy of phenotype association compared to biparental populations [15,16,17]. This approach has been used to identify genetic regions associated with numerous plant traits including oil content, protein content, yield, quality, and stress responses [18,19,20]. To date, genomic analyses have identified over 300 QTLs associated with soybean seed oil content, building upon previous research documented in the SoyBase database (https://www.soybase.org).
Recent molecular breeding efforts have utilized GWASs to identify oil synthesis genes and develop useful markers. Notable studies include analysis genotyping of 320 soybean accessions revealing 29 oil-related QTLs, with 24 being novel discoveries; and GWASs on 278 soybean accessions identifying three significant QTLs, with SNP ss715637321 on chromosome 20 coinciding with previously documented large-effect oil QTL loci [21]. Additional studies have identified the GmST05 gene affecting seed size through a GWAS of over 1800 soybean samples, discovered a SEIPIN homologous gene on chromosome 9 regulating fatty acid synthesis from analysis of 547 soybean samples, and identified 5 oil-related QTLs through QTL mapping in a recombinant inbred line population [22,23,24,25,26].
Triacylglycerol (TAG) synthesis, the predominant form of seed oil accumulation, is primarily governed by the diacylglycerol acyltransferase (DGAT) gene, with overexpression studies showing enhanced oil accumulation [27,28]. Several genes influence soybean oil content and synthesis, including GmOLEO1 (enhancing oil accumulation through TAG synthesis), GmSWEET39 (positively influencing total oil content), and transcription factors like WRI1 (regulating key genes in glycolysis and fatty acid synthesis) [9,29,30,31,32]. Additional transcription factors such as LEC1, LEC2, MYB, ABI3, and bZIP also play essential roles in seed oil accumulation [33,34,35,36]. The regulation of oil biosynthesis in soybean involves complex transcriptional and epigenetic mechanisms that coordinate fatty acid synthesis during seed development. The WRINKLED1 (WRI1) transcription factor, a member of the AP2/ERF family, serves as a master regulator by directly activating genes encoding acetyl-CoA carboxylase, fatty acid synthase subunits, and other key enzymes in the fatty acid biosynthesis pathway [37,38]. In soybean, multiple WRI1 homologs exhibit tissue-specific expression patterns during seed filling, with GmWRI1a and GmWRI1b showing distinct temporal regulation [39,40]. Additionally, MYB transcription factors, including MYB89 and MYB96, modulate fatty acid desaturase gene expression and coordinate oil accumulation with seed maturation processes [41,42]. Epigenetic regulation through DNA methylation and histone modifications further fine-tunes the expression of oil biosynthesis genes, with dynamic methylation changes observed in gene promoter regions during seed development [43,44]. These multilayered regulatory networks suggest that genetic variants affecting transcriptional or epigenetic control mechanisms significantly influence oil content variation, highlighting the importance of comprehensive functional annotation in association studies.
Despite extensive research on seed-oil-based genome-wide association studies, much work remains to fully understand the molecular basis of seed oil content and composition. In our study, a natural population of 341 germplasm resources was evaluated for soybean oil content over two years using phenotypic data and whole-genome resequencing. The genome-wide association study identified significant SNP regions linked to soybean oil content, while subsequent QTL mapping helped pinpoint candidate genes. Analyzing these genes advances our understanding of the genetic factors affecting soybean oil content. These findings offer valuable support for developing high-oil soybean varieties through biotechnological breeding methods, ultimately boosting soybean oil production and improving molecular design breeding strategies.

2. Materials and Methods

2.1. Experimental Materials and Field Trials

The primary experimental materials for this study consisted of an association panel comprising 341 soybean germplasms selected based on their wide range of oil content phenotypes, distant genetic relationships, and significant geographical differences, making them well-suited for studying the genetic basis of oil content variation. A total of 340 individuals were identified from China, with 336 originating from Northeast China and 4 from Northwest China. Additionally, 1 individual was identified from the United States. A total of 153 materials, accounting for 44.87%, originated from Jilin Province. Heilongjiang Province contributed 126 materials, representing 36.95%. Liaoning Province provided 47 materials, which is 13.78% of the total. The remaining materials were sourced from various other regions in lesser amounts.
All 341 materials were planted in both 2021 and 2023 growing seasons from mid-May to early October at the National Modern Agricultural Experimental Farm of Heilongjiang Academy of Agricultural Sciences, located in Minzhu Town, Daowai District, Harbin City, Heilongjiang Province. Climate conditions of the two years are presented in Table S1. The climatic conditions differed substantially between the 2021 and 2023 growing seasons, which likely contributed to the observed variations in soybean oil content. In 2021, the growing season was characterized by relatively moderate temperatures and variable precipitation patterns. Monthly mean temperatures ranged from 7.6 °C (October) to 25.6 °C (July), with total precipitation of 357.3 mm distributed unevenly across months. The season experienced lower relative humidity (45–62%) and moderate sunshine hours (180–265 h monthly). Notably, August showed high precipitation (106.4 mm) with 15 rainy days, potentially affecting pod filling stages. In contrast, 2023 exhibited warmer temperatures and higher precipitation, particularly during critical reproductive phases. Mean temperatures were consistently higher, ranging from 8.5 °C (October) to 23.0 °C (July), representing approximately 1–2 °C increases compared to 2021. Total precipitation reached 532.7 mm, with July showing exceptionally high rainfall (221.6 mm, 22 rainy days) during flowering and early pod development. The 2023 season also featured higher relative humidity (52–75%) and increased wind speeds. These contrasting environmental conditions—cooler, drier conditions in 2021 versus warmer, wetter conditions in 2023—provided an excellent natural experiment for evaluating the stability of genetic associations across diverse climatic environments, strengthening the robustness of the identified GWAS associations.
A fully randomized block design was utilized, consisting of single-row plots with a row length of 3 m, row spacing of 65 cm, and in-row plant spacing of 5 cm. Each germplasm was replicated three times under various conditions in each growing season. Field management adhered to prescribed protocols. Upon maturation, seeds were harvested and naturally sun-dried for phenotyping of oil content.

2.2. Determination of Soybean Seed Oil Content

After the field harvest, the oil content of dried soybean seeds gathered during the 2021 and 2023 growing seasons was quantitatively evaluated using an NIRS DS 2500 (FOSS, Denmark) near-infrared analyzer. This non-destructive spectroscopic method enables swift and precise assessment of seed composition parameters by analyzing specific wavelength absorption patterns. Each biological sample underwent triplicate measurements to guarantee analytical precision and reliability. The calculated arithmetic mean was then used as the definitive phenotypic data point for all subsequent statistical analyses. To validate the accuracy of NIRS measurements, a subset of samples (n = 30) was cross-validated using the traditional Soxhlet extraction method, confirming the reliability of the spectroscopic approach for oil content determination.

2.3. Genome-Wide Association Analysis

The 341 soybean varieties were sequenced using the Hi-Seq 2000 high-throughput sequencing platform, with an average sequencing depth of 10×. The sequencing data were processed using the Genome Analysis Toolkit (GATK) [45], and the filtered data were aligned to the soybean reference genome (v. Wm82.a2) using BWA software (Version: 0.6.1-r104) [46] with default parameters. SNP markers were collected and sorted using GATK, then recalibrated using a Gaussian mixture model to remove outliers. SNPs with a missing rate ≥ 10% and minor allele frequency ≤ 5% were filtered out using PLINK [47]. Linkage disequilibrium (LD) analysis was performed using PopLDdecay software (v3.42) [48], with the LD coefficient R2 as the metric. The LD values were calculated using the LD synthesis method in the R package SNPRelate (R version 4.4.3), and highly linked SNPs were pruned. The LD decay plot was adjusted using the LOWESS tool in R, with a smoothing parameter of 0.01. Population structure was analyzed using fastSTRUCTURE v2.3.1 software [49] and principal component analysis (PCA) in the R package SNPRelate (R version 4.4.3). Kinship analysis was performed using GAPIT software (v3) [50]. Significant SNP loci were identified based on 341 varieties and 1,048,576 SNP markers, using the MLM (mixed linear model) in GAPIT software (v3) for genome-wide association analysis. The genome-wide significance threshold was set at p < 10−6, which was determined using Bonferroni correction for multiple testing to control the family-wise error rate across all tested SNP markers, ensuring stringent control of false positive associations.

2.4. Identification and Validation of Candidate Genes

In this study, the linkage disequilibrium (LD) decay distance was considered, and genomic regions extending 200 kb both upstream and downstream of significant SNP loci were systematically analyzed to identify genes associated with oil content in the soybean reference genome Williams 82 (http://www.soybase.org/). The selected window size is based on established linkage disequilibrium patterns in soybean, aiming to encompass both cis-regulatory elements and proximal genes that affect the trait of interest. A thorough functional annotation of the identified candidate genes was conducted by integrating various bioinformatic resources. These included The Arabidopsis Information Resource (TAIR), Gene Ontology (GO), Protein Families Database (PFAM), Protein Analysis Through Evolutionary Relationships (PANTHER) databases, and KOG (clusters of orthologous groups for eukaryotic entire genomes) annotation. The multi-database approach facilitated a comprehensive characterization of gene function, evolutionary conservation, protein domain structure, and metabolic pathway involvement, thereby aiding in the identification of biologically plausible candidates that regulate soybean seed oil accumulation.

3. Results

3.1. Statistical Analysis of Phenotypic Data

Field trials were conducted over two years (2021 and 2023) to evaluate the oil content of 341 soybean germplasm accessions. During the years 2021 and 2023, the natural soybean population exhibited notable phenotypic variation in seed oil content. The mean oil content increased from 18.43% in 2021 (SD = 2.12; CV = 11.48%) to 20.37% in 2023 (SD = 1.75; CV = 8.61%). The two-year average was documented at 19.35% (SD = 1.94; CV = 10.01%), as depicted in Figure 1A–C. Density plots demonstrated a unimodal and normal distribution across all years; however, the distribution in 2023 exhibited greater compactness compared to 2021 (Figure 1A–C). A one-way ANOVA revealed significant differences in oil content across the three treatment groups (2021, 2023, and Average; 341 accessions were evaluated each year), as demonstrated in the violin plot (Figure 1D). Tukey’s HSD test indicated the average performance across the two years. The analysis of oil content revealed that the distribution of oil content varied significantly across different years. A statistically significant positive correlation (r = 0.84, p < 0.001) was observed between oil content values recorded in 2021 and 2023 (Figure 1E), indicating a high level of consistency in genotype performance across the years. Ridge density plots (Figure 1F) illustrate the variations in oil content distributions across the years, clearly showing the rise in central tendency and the reduction in variability from 2021 to 2023.

3.2. Soybean Oil Content Across Northern China’s Growing Regions (2021–2023)

Variations in soybean oil content from 2021 to 2023 showed consistent and significant improvements across germplasms from all five main soybean-growing regions in Northern China (Figure 2A,B). The germplasm collection displayed strong regional differences, with most accessions coming from three northern provinces: Jilin (153 accessions, representing 44.87% of the total), Heilongjiang (126 accessions, 36.95%), and Liaoning (47 accessions, 13.78%). In contrast, Inner Mongolia and Xinjiang contributed only a small number of accessions (Figure 2A). This phenotypic regional distribution of data indicated specific patterns of oil content improvement across the evaluated areas.
The violin plots clearly show a steady upward trend in oil content distribution among germplasms from all regional origins, with 2023 data showing higher average values and more tightly clustered dispersion compared to the more spread-out patterns seen in 2021 (Figure 2B). Visualizations of both absolute and percentage changes reveal a consistent positive trend across all germplasm origins, indicating widespread improvement in oil content across the genetic diversity of soybean resources in Northern China (Figure 2D).
Germplasm from Heilongjiang showed a significant average increase of 2.03% in oil content, corresponding to an 11.65% relative gain (Figure 2B,C). This gain was present in 98.33% of the 126 assessed accessions. Materials from Inner Mongolia showed an average increase of 1.75%, which corresponds to a 9.38% relative gain. Furthermore, the enhanced phenotype was observed in all examined samples, with a 100% increase in oil content (Figure 2C). Accessions originating from Jilin demonstrated an average absolute increase of 1.73%, representing a 9.56% relative improvement. Significantly, 96.5% of the 153 accessions displayed the improved oil accumulation trait. Germplasm from Liaoning showed an average absolute increase of 2.08%, corresponding to a 12.58% relative gain, with this phenotype consistently expressed across all 47 evaluated accessions. Although their representation within the collection was limited, materials from Xinjiang exhibited a superior capacity for oil accumulation, with a mean absolute increase of 2.28%, resulting in an 18.78% relative gain. This phenotype was observed in 100% of the subset of accessions.

3.3. Genome-Wide Association Analysis Results

Genotyping was conducted using resequencing data from 341 soybean germplasm sources, resulting in 1,048,576 high-quality SNP sites across the 20 soybean chromosomes. The linkage disequilibrium (LD) decay plot is illustrated in Figure 3A. The R2 value (y-axis) decreases to 50% of its maximum within 1.8 kb (x-axis), indicating substantial genetic diversity in the examined soybean population. The rapid LD decay suggests that the mapping resolution is high, facilitating more accurate identification of causal variants associated with specific traits. Population structure analysis was performed using fastSTRUCTURE with K values ranging from 1 to 10. The optimal number of subpopulations was determined using cross-validation error analysis (Figure S1), which revealed the lowest error at K = 2, indicating the presence of two major ancestral groups within the germplasm panel. However, the gradual increase in cross-validation error beyond K = 2 and the absence of sharp transitions suggest a complex population structure with continuous genetic variation rather than discrete subgroups. Furthermore, the three-dimensional principal component analysis (PCA) plot (Figure 3B) illustrates that the first three principal components (PC1, PC2, and PC3) span a range of −500 to 500. The red dots represent the 341 soybean accessions. Their continuous distribution without clear clusters confirms the absence of distinct subpopulation structure yet implies a continuous pattern of genetic variation potentially linked to geographical origins or ecological adaptations. The scree plot from the PCA analysis demonstrates the variance attributed to each principal component (Figure 3C). The first component (PC1) accounts for approximately 6% of the total genetic variance, while PC2 and PC3 account for progressively smaller proportions (approximately 4–5% and 3–4%, respectively). The gradual decline in explained variance, as opposed to sharp decreases, is characteristic of populations exhibiting complex genetic architecture, where variation is distributed across multiple dimensions rather than concentrated in a few major axes. Figure 3D displays a kinship matrix heatmap accompanied by hierarchical grouping. The primarily yellow hue, accented by red patches along the diagonal, signifies poor relatedness among the 341 soybean accessions, thus affirming the substantial genetic diversity within the mapping population. The results collectively demonstrate that the soybean population utilized in this work has significant genetic variety with limited population structure, rendering it appropriate for association mapping to uncover genomic areas associated with oil content.

3.4. Genome-Wide Association Analysis and Candidate Gene

Genomic DNA from 341 soybean varieties was sequenced using the Hi-Seq 2000 platform at an average depth of 10×. Genome-wide association analysis for oil content identified 119 significant SNP loci exceeding the threshold of −log10(p-value) = 6.2. Q–Q plots confirmed the appropriateness of the MLM model across both years (2021 and 2023). Manhattan plots revealed a continuous linear distribution of significant SNPs, suggesting that the identified markers were reliable and indicating that soybean seed oil content is likely controlled by multiple minor-effect polygenes (Figure 4). In 2021, 51 significant loci were detected, with 48 clustered on chromosome 8 and the remainder on chromosomes 11 (2) and 20 (1). Chromosome 8 contained two distinct SNP groups with opposing effects: 27 loci (8,038,113–9,503,562 bp; average p-value: 8.85 × 10−8) associated with reduced oil content, and 21 loci (8,439,205–9,501,912 bp; average p-value: 4.78 × 10−8) associated with increased oil content. The 2023 data yielded 30 significant associations distributed across chromosomes 1 (7), 8 (20), and 13 (3). Analysis of two-year average phenotypic data identified 34 significant SNP loci across chromosomes 2, 4, 5, 7, 8, and 11, with chromosome 8 containing the majority (29) of these associations (Table 1).
The presence of SNPs with opposing effects within overlapping genomic regions on chromosome 8 (8,038,113–9,503,562 bp and 8,439,205–9,501,912 bp) suggests several possible explanations that warrant further investigation. These opposing effects result from distinct haplotype blocks within this region, where different combinations of linked alleles create contrasting phenotypic outcomes depending on the specific allelic configuration inherited by individual accessions. The opposing effects indicate epistatic interactions between genes within this chromosomal region, where the phenotypic effect of one locus is modified by allelic variants at nearby loci. Such epistatic relationships are common in quantitative traits and could explain how closely linked SNPs produce contrasting effects on oil accumulation. The overlapping physical positions of these SNP groups (with a 400 kb overlap) suggest that they regulate a domain or involve functionally related genes that interact to fine-tune oil biosynthesis pathways. The opposing effects reflect the presence of multiple functional variants that affect different aspects of lipid metabolism within the same genomic region. For instance, some variants influence fatty acid biosynthesis, while others affect oil storage or mobilization, resulting in either a net-positive or -negative effect on total seed oil content, depending on the specific combination of alleles present. The consistent identification of significant loci on chromosome 8 across multiple environments highlights this region as particularly important for soybean oil content regulation and provides promising targets for marker-assisted selection in breeding programs aimed at modifying seed oil profiles. However, the complex pattern of opposing effects within this region suggests that breeding strategies should consider haplotype-based selection rather than individual SNP markers to capture the full genetic architecture and avoid unintended consequences from epistatic interactions.

3.5. Candidate Gene Screening for Association Analysis

Soybean oil content was found to be substantially associated with 29 SNP locations on chromosome 8. Specifically, in 2021, 19 SNP sites were identified as being strongly correlated with soybean oil content, and in 2023, this number decreased to 2 (Table 2). By averaging the data from 2021 and 2023, the GWAS analysis identified eight SNP locations that showed a strong association with soybean oil content. Multiple candidate genes involved in regulating oil content were identified by further examination of the SNP variations (Table 2).
The gene with the most significant associations in both the annual and combined analyses was Glyma.08G123500. SNPs appear to have altered the protein’s function in this gene. With p-values ranging from 3.88 × 10−9 to 1.84 × 10−7, significant SNPs potentially producing alterations, such as L497P, F496L, K417E, D360E, and N344K, were repeatedly identified. A G228C amino acid change occurred at position 9,054,741 bp due to a substantial SNP in the gene Glyma.08G117400 (Table 2). In both 2023 and combined analyses, this variant was highly significant (p = 4.94 × 10−12 and p = 7.46 × 10−11, respectively). The most statistically significant association across all discovered associations in 2023 and the combined analysis was an SNP at position 9,074,920 bp resulting in a predicted A233V substitution (Glyma.08G117600, p = 8.71 × 10−14, p = 5.44 × 10−13). An SNP at 8,439,205 base pairs (bp), which potentially alters the amino acid sequence to T247P, was found to be strongly associated with oil concentration in 2021 (p = 2.51 × 10−7).
PROVEAN functional impact scores ranged from −1.633 to 1.450, with potentially deleterious variants including E503G (−1.633) and F460E (−1.092) in Glyma.08G123500, while variants in Glyma.08G117400 (G228C) and Glyma.08G117600 (A233V) showed positive scores (1.450 and 1.233, respectively) but maintained strong temporal consistency between datasets. These notable SNPs have minor allele frequencies (MAFs) between 0.05 and 0.08, indicating that they are relatively rare variants in the population. The consistent detection of these SNPs across various contexts, especially in Glyma.08G123500, Glyma.08G117400, and Glyma.08G117600, suggests that they have stable effects on oil content and could potentially be used as markers in breeding programs to aid selection.
Based on the comprehensive genetic variation analysis of four soybean genes on chromosome 8, this study reveals diverse functional roles and varying impacts of genetic variants on protein function. Glyma.08G110000, encoding a hydroxycinnamoyl coenzyme A-quinate transferase involved in secondary metabolism (GO:0016747), shows high expression in reproductive tissues and contains a single neutral variant (T247P, PROVEAN score: 0.031). Glyma.08G117400, a PPR repeat protein involved in RNA metabolism and organellar function (GO:0005515), demonstrates consistent variants across years with neutral predicted effects (PROVEAN score: 1.450) and broad tissue expression. Glyma.08G117600, a WD40 repeat scaffolding protein with protein binding activity (GO:0005515), exhibits the most significant association (p = 5.44 × 10−13) and high expression in seeds and pods, despite carrying apparently neutral variants (A233V, PROVEAN score: 1.233). Most notably, Glyma.08G123500, a receptor-like kinase with protein kinase activity (GO:0004672) and signal transduction functions, contains 15+ variants with PROVEAN scores ranging from deleterious (−1.633) to neutral (0.933), and is most highly expressed in metabolically active root and nodule tissues (7.45 and 5.34 FPKM, respectively). This complex genetic architecture suggests Glyma.08G123500 as the most promising candidate for functional validation studies.

4. Discussion

4.1. Phenotypic Variation and Environmental Effects

The significant increase in mean oil content from 18.43% in 2021 to 20.37% in 2023, along with a reduced coefficient of variation (11.48% to 8.61%), suggests that environmental conditions in 2023 were more favorable for oil accumulation across the soybean germplasm collection. This improvement was remarkably consistent across all geographical regions, with enhancement phenotypes observed in 96.5–100% of accessions depending on origin. The strong positive correlation (r = 0.84, p < 0.001) between years demonstrates the stability of genetic effects despite environmental variation, supporting the reliability of our phenotypic data for association mapping. The regional analysis revealed interesting patterns, with Xinjiang germplasm showing the highest relative improvement (18.78%) despite limited representation in the collection. The consistent improvement across Northeastern China’s major soybean-producing provinces (Jilin, Heilongjiang, and Liaoning) indicates that the favorable environmental conditions in 2023 were widespread across the region [51,52].

4.2. Population Structure and Genetic Diversity

A natural population of 341 soybean varieties was utilized to assess the oil content of soybean seeds in 2021 and 2023, and a GWAS analysis was conducted using 1,048,576 SNP markers acquired through resequencing. The identification of 119 significant SNP loci across multiple chromosomes confirms that soybean seed oil content is controlled by a polygenic architecture, consistent with earlier quantitative genetic studies. The concentration of significant associations on chromosome 8 (especially in regions 8.0–9.5 Mb and around 42.0 Mb) across various environments strongly indicates the presence of major QTLs in these areas. The consistency of these associations over the years (2021, 2023, and combined analysis) offers strong evidence for the stability of genetic effects on oil content. Notably, the detection of SNP clusters with opposing effects within the same region on chromosome 8 suggests either multiple linked genes with different functions or allelic variants of the same genes affecting oil metabolism. This complexity is common in metabolic traits and likely reflects the intricate regulatory networks that control lipid biosynthesis and accumulation in seeds [53,54].

4.3. Candidate Gene Analysis and Functional Implications

The identification of Glyma.08G123500, Glyma.08G117400, and Glyma.08G117600 as consistently associated genes across multiple analyses provides strong evidence for their potential role in regulating oil content. Functional domain analysis reveals that Glyma.08G123500 (also known as Glyma.08g13040) encodes a multi-domain protein featuring an NB-ARC domain (a nucleotide-binding adaptor shared by APAF-1, R proteins, and CED-4), a protein tyrosine and serine/threonine kinase domain, and leucine-rich repeats (LRRs). These characteristics are typical of NBS-LRR resistance proteins, which primarily function in plant immune responses and stress signaling pathways [55,56]. The detection of multiple amino-acid-changing variants within Glyma.08G123500, including potentially deleterious mutations (E503G with a PROVEAN score of −1.633 and F460E with a score of −1.092), suggests that variants affecting the NB-ARC nucleotide-binding activity or kinase function could substantially impact protein signaling capacity. Although NBS-LRR proteins are primarily recognized for their role in pathogen resistance, emerging evidence indicates that some family members may also regulate metabolic processes and stress responses, potentially influencing seed development and oil accumulation through coordinated stress signaling with lipid biosynthesis during seed filling [57]. This finding aligns with recent studies that demonstrate the multifunctional nature of NBS-LRR proteins, extending beyond disease resistance. For instance, Wu et al. [58] reported that certain NBS-LRR genes in Arabidopsis are involved in regulating developmental processes. Specifically in soybeans, previous research has suggested that stress response pathways and lipid metabolism may be interconnected during seed development [31]. The amino acid substitutions identified (L497P, F496L, K417E, D360E, and N344K in Glyma.08G123500; G228C in Glyma.08G117400; A233V in Glyma.08G117600) represent non-conservative changes that could markedly alter protein structure and function. Structural analysis suggests that these substitutions may affect critical functional domains, potentially altering protein–protein interactions or enzymatic activity [26]. The consistent detection of these variants across different environmental conditions and years reinforces their biological significance, suggesting that they represent stable genetic factors rather than spurious associations [11]. The relatively low minor allele frequencies (0.05–0.08) of these variants indicate that they represent rare but functionally significant alleles within the germplasm panel. This pattern is consistent with previous observations that variants with large phenotypic effects often occur at low frequencies in natural populations due to balancing selection or recent evolutionary origin [59]. Such rare variants with substantial effects could be particularly valuable for breeding programs seeking to modify oil content, as they may provide novel allelic diversity not commonly available in elite breeding lines [43]. The functional implications of these findings extend beyond the effects of individual genes. The clustering of significant associations within a relatively narrow chromosomal region (approximately 1.5 Mb on chromosome 8) suggests the presence of a complex genetic architecture involving multiple linked genes or regulatory elements that collectively influence oil content [20]. This pattern may reflect the presence of ancient chromosomal inversions or other structural variants that maintain beneficial allelic combinations in linkage disequilibrium [60].

4.4. Quantitative Comparison with Prior GWASs

Comparative analysis with previous soybean oil content GWASs [61,62] revealed that 47 of our 119 significant SNPs (39.5%) were located within previously reported QTL regions, including overlaps with Seed oil 11-g2 [63] and other major oil content loci [49,50], while 72 SNPs (60.5%) represent novel associations not captured in prior studies. Specifically, within the critical Glyma.08G123500 locus, 15 of the 18 identified SNPs (83.3%) represent previously uncharacterized variants, with only 3 showing positional overlap with variants reported in previous association studies [62], though none of the overlapping variants showed identical allelic effects or population frequencies.
Our Northeast China population (n = 341) exhibited a distinct genetic structure compared to previous studies [62,64], with an average genetic diversity (π) of 2.1 × 10−3 and population differentiation (FST) values of 0.23–0.31 relative to major U.S. and Brazilian soybean collections used in prior GWAS analyses [65]. Environmental specificity analysis demonstrated that 31 of the 119 significant associations (26.1%) showed consistent detection across both 2021 and 2023 growing seasons under Northeast China climatic conditions.
The integration of PROVEAN functional impact analysis identified eight potentially deleterious variants (PROVEAN score < −1.0) within candidate genes that were not functionally characterized in previous studies, providing novel insights into the molecular mechanisms underlying oil content variation. Comparative marker density analysis showed that our study achieved 3.1-fold higher resolution (1 SNP per 456 bp) compared to previous array-based GWASs (1 SNP per 1400–2100 bp) [61,62,63], enabling detection of fine-scale associations and improved candidate gene localization.

4.5. Breeding and Selection Implications

The identification of stable genetic markers associated with oil content provides valuable tools for marker-assisted selection in soybean breeding programs. The consistent detection of chromosome 8 associations across environments makes these regions particularly attractive targets for breeding efforts aimed at modifying seed oil profiles. The relatively small effect sizes of individual SNPs (ranging from −1.58 to +3.35 percentage points) suggest that pyramiding multiple favorable alleles will be necessary to achieve substantial improvements in oil content. The regional variation in oil content enhancement observed in our study also suggests that different germplasm sources contribute unique favorable alleles. The superior performance of Xinjiang germplasm, despite its limited representation, suggests that expanding genetic diversity through the inclusion of germplasm from diverse geographical origins could provide additional genetic variation for improving oil content.

4.6. Study Limitations and Future Directions

In contrast to classical linkage analysis, GWASs can utilize existing natural populations as subjects, thereby conserving time and resources [59]. Nonetheless, it possesses several inherent limitations that must be considered when interpreting association results. Population structure remains a primary concern, as cryptic relatedness and subpopulation stratification can lead to spurious associations and inflated Type I error rates [66,67]. Linkage disequilibrium patterns, influenced by demographic history, selection pressure, recombination rates, and population bottlenecks, can confound the precise localization of causal variants, particularly in regions of extended LD where multiple linked variants show similar association signals [68]. Additionally, the “missing heritability” problem persists, where identified SNPs explain only a fraction of the phenotypic variance, potentially due to rare variants with large effects, structural variations not captured by SNP arrays, or epistatic interactions between loci that are not adequately modeled in single-marker association tests [69]. The resolution for candidate gene identification is further limited by the density of markers relative to local LD structure, and the assumption that the causal variant is in linkage disequilibrium with the tested marker does not hold in all genomic regions [70]. To address these methodological challenges, advanced statistical approaches such as the Fixed and Random Model Circulating Probability Unification (FarmCPU) algorithm have been developed, which iteratively optimizes both fixed and random effects to better control population structure and kinship while maintaining statistical power for detecting true associations [71]. FarmCPU and similar multi-locus models can reduce false positive rates by accounting for the confounding effects of population stratification more effectively than traditional single-locus approaches, though they require careful parameter tuning and validation across different populations and traits [71,72]. To achieve more precise study outcomes, we can minimize false positives by augmenting the population size. We employ a mixed linear model (MLM) that simultaneously incorporates individual kinship (K) and group structure (Q) as covariates (Q + K), therefore successfully diminishing the false positive rate [73]. GWASs utilizing mixed linear models are now extensively applied in both botanical and zoological systems [74]. GWASs are increasingly employed in agricultural research; for instance, the genetic foundations of many seed-related features have been identified in soybean, with multiple candidate genes predicted using GWASs [15,60,75]. A total of 92 QTLs influencing quality variables were identified across three environments, with 14 QTLs associated with crude oil content [76]. Additionally, 32 candidate genes associated with soybean quality were identified. To date, 393 QTL loci associated with soybean seed fat content traits have been cataloged on the SoyBase website (http://www.soybase.org), spanning 20 chromosomes [77]. This collection comprises 298 QTLs derived from the RIL population and 95 QTLs from the GWAS population [78].
Future research should focus on experimental validation of Glyma.08G123500 through CRISPR/Cas9-mediated functional studies and biochemical characterization of the identified protein variants, particularly the deleterious E503G and F460E substitutions, to confirm their causal role in oil content variation. The unexpected identification of an NBS-LRR resistance protein as a major regulator of seed oil content underscores the intricate relationships between plant immune responses and metabolic processes, suggesting that stress signaling pathways coordinate with lipid biosynthesis during seed development. These findings provide valuable molecular markers for marker-assisted selection in soybean breeding programs, opening new avenues for understanding the genetic architecture underlying economically important seed quality traits. This has broader implications for crop improvement strategies that leverage the pleiotropic effects of stress response genes on metabolic phenotypes.

5. Conclusions

Our analysis identified 119 significant SNP loci associated with oil content, with chromosome 8 emerging as a key region containing 77 significant associations across multiple environments. The candidate gene Glyma.08G123500 showed strong and consistent associations, with several SNPs causing amino acid substitutions that may alter protein function and affect oil biosynthesis. Additional candidates, Glyma.08G117400 and Glyma.08G117600, also warrant further investigation. The consistent overlap between our findings and previously reported QTLs confirms our mapping approach and emphasizes the role of chromosome 8 in regulating soybean oil content. Notably, we identified SNP clusters with opposing effects on oil content, indicating the complex genetic regulation of this trait. The significant increase in oil content across all Northern Chinese growing regions from 2021 to 2023 demonstrates the potential for continued improvement through targeted breeding. The SNP markers identified in this study offer valuable resources for marker-assisted selection to expedite the development of high-oil soybean varieties.

Supplementary Materials

The following supporting information can be downloaded at https://www.mdpi.com/article/10.3390/agronomy15081889/s1: Figure S1: Cross-validation error analysis; Table S1: Climate conditions.

Author Contributions

Conceptualization, X.W., M.Z., F.L. and F.Z.; data curation, X.L., S.F.L. and B.Z.; formal analysis, M.Z., F.L., X.L., C.Z., K.Z. and S.F.L.; funding acquisition, C.Z., H.Q. and B.Z.; investigation, M.Z., C.Z., R.Y. and H.R.; methodology, X.W., X.L., C.Z., S.F.L., H.R. and B.Z.; project administration, F.Z., H.Q. and B.Z.; resources, R.Y. and H.Q.; software, M.Z., F.L., X.L., F.Z., K.Z., R.Y. and H.Q.; supervision, X.W., F.L., C.Z., F.Z. and H.Q.; validation, M.Z., K.Z. and R.Y.; visualization, F.L. and H.R.; writing—original draft, X.W., S.F.L. and H.R.; writing—review and editing, F.Z., S.F.L., H.R. and B.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Biological Breeding-National Science and Technology Major Project (2023ZD04032); Project funded by Agricultural Science and Technology Innovation Leaping Project in Heilongjiang Province (Grant No. CX23ZD04); and The 2024 Science and Technology Support Project of the Inner Mongolia Innovation Center of Biological Breeding Technology, Biotechnology-Based Breeding of High-Quality Soybeans and Application Demonstration (2024NSZC04).

Data Availability Statement

The original contributions presented in this study are included in the article/Supplementary Materials. Further inquiries can be directed to the corresponding author(s).

Acknowledgments

Special thanks to the Soybean intellect design breeding laboratory of Heilongjiang Academy of Agricultural Sciences for providing platform support. We also thank the Soybean Germplasm Resources Team of the Institute of Crop Sciences, Chinese Academy of Agricultural Sciences for providing soybean germplasm resources.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

  1. Wang, S.; Liu, S.; Wang, J.; Yokosho, K.; Zhou, B.; Yu, Y.-C.; Liu, Z.; Frommer, W.B.; Ma, J.F.; Chen, L.-Q. Simultaneous changes in seed size, oil content and protein content driven by selection of SWEET homologues during soybean domestication. Natl. Sci. Rev. 2020, 7, 1776–1786. [Google Scholar] [CrossRef] [PubMed]
  2. Hooker, J.C.; Smith, M.; Zapata, G.; Charette, M.; Luckert, D.; Mohr, R.M.; Daba, K.A.; Warkentin, T.D.; Hadinezhad, M.; Barlow, B. Differential gene expression provides leads to environmentally regulated soybean seed protein content. Front. Plant Sci. 2023, 14, 1260393. [Google Scholar] [CrossRef] [PubMed]
  3. Wilson, R.F. Seed composition. Soybeans Improv. Prod. Uses 2004, 16, 621–677. [Google Scholar]
  4. Carter, T.E., Jr.; Nelson, R.L.; Sneller, C.H.; Cui, Z. Genetic diversity in soybean. Soybeans Improv. Prod. Uses 2004, 16, 303–416. [Google Scholar]
  5. Liu, K. Chemistry and Nutritional Value of Soybean Components. In Soybeans; Springer: Boston, MA, USA, 1997; pp. 25–113. [Google Scholar] [CrossRef]
  6. Zhang, B.; Zhao, K.; Ren, H.; Lamlom, S.F.; Liu, X.; Wang, X.; Zhang, F.; Yuan, R.; Wang, J. Comparative study of isoflavone synthesis genes in two wild soybean varieties using transcriptomic analysis. Agriculture 2023, 13, 1164. [Google Scholar] [CrossRef]
  7. Liu, A.; Cheng, S.-S.; Yung, W.-S.; Li, M.-W.; Lam, H.-M. Genetic regulations of the oil and protein contents in soybean seeds and strategies for improvement. Adv. Bot. Res. 2022, 102, 259–293. [Google Scholar]
  8. Li, H.; Peng, Z.; Yang, X.; Wang, W.; Fu, J.; Wang, J.; Han, Y.; Chai, Y.; Guo, T.; Yang, N. Genome-wide association study dissects the genetic architecture of oil biosynthesis in maize kernels. Nat. Genet. 2013, 45, 43–50. [Google Scholar] [CrossRef]
  9. Miao, L.; Yang, S.; Zhang, K.; He, J.; Wu, C.; Ren, Y.; Gai, J.; Li, Y. Natural variation and selection in GmSWEET39 affect soybean seed oil content. New Phytol. 2020, 225, 1651–1666. [Google Scholar] [CrossRef]
  10. Clemente, T.E.; Cahoon, E.B. Soybean oil: Genetic approaches for modification of functionality and total content. Plant Physiol. 2009, 151, 1030–1040. [Google Scholar] [CrossRef]
  11. Lee, J.D.; Bilyeu, K.D.; Pantalone, V.R.; Gillen, A.M.; So, Y.S.; Shannon, J.G. Environmental stability of oleic acid concentration in seed oil for soybean lines with FAD2-1A and FAD2-1B mutant genes. Crop Sci. 2012, 52, 1290–1297. [Google Scholar] [CrossRef]
  12. Yang, W.; Guo, Z.; Huang, C.; Duan, L.; Chen, G.; Jiang, N.; Fang, W.; Feng, H.; Xie, W.; Lian, X. Combining high-throughput phenotyping and genome-wide association studies to reveal natural genetic variation in rice. Nat. Commun. 2014, 5, 5087. [Google Scholar] [CrossRef]
  13. Khan, M.R.; Rehman, N.; Inam, S.; Naeem, M.K.; Muhammad, A.; Uzair, M.; Riaz, A.; Rehman, O.U.; Muqaddas, F.; Murtaza, M. Implementation of novel genomic and biotechnological interventions for accelerated breeding of crops. In Plant Speed Breeding and High-Throughput Technologies; CRC Press: Boca Raton, FL, USA, 2024; pp. 53–81. [Google Scholar]
  14. Duan, Z.; Li, Q.; Wang, H.; He, X.; Zhang, M. Genetic regulatory networks of soybean seed size, oil and protein contents. Front. Plant Sci. 2023, 14, 1160418. [Google Scholar] [CrossRef]
  15. Yu, J.; Zhu, C.; Xuan, W.; An, H.; Tian, Y.; Wang, B.; Chi, W.; Chen, G.; Ge, Y.; Li, J. Genome-wide association studies identify OsWRKY53 as a key regulator of salt tolerance in rice. Nat. Commun. 2023, 14, 3550. [Google Scholar] [CrossRef] [PubMed]
  16. Liang, Q.; Chen, L.; Yang, X.; Yang, H.; Liu, S.; Kou, K.; Fan, L.; Zhang, Z.; Duan, Z.; Yuan, Y. Natural variation of Dt2 determines branching in soybean. Nat. Commun. 2022, 13, 6429. [Google Scholar] [CrossRef] [PubMed]
  17. Zhang, F.; Xu, J.; Wang, W.; Liu, X.; He, D.; Zhang, B.; Liu, B.; Lamlom, S.F.; Abdelghany, A.M.; Hong, H. Genetic architecture of shade tolerance in soybean (Glycine max L. Merr.) revealed by genome-wide association study. Crop Sci. 2025, 65, e70107. [Google Scholar] [CrossRef]
  18. Hwang, E.-Y.; Song, Q.; Jia, G.; Specht, J.E.; Hyten, D.L.; Costa, J.; Cregan, P.B. A genome-wide association study of seed protein and oil content in soybean. BMC Genom. 2014, 15, 1. [Google Scholar] [CrossRef]
  19. Cao, Y.; Li, S.; Wang, Z.; Chang, F.; Kong, J.; Gai, J.; Zhao, T. Identification of major quantitative trait loci for seed oil content in soybeans by combining linkage and genome-wide association mapping. Front. Plant Sci. 2017, 8, 1222. [Google Scholar] [CrossRef]
  20. Zeng, A.; Chen, P.; Korth, K.; Hancock, F.; Pereira, A.; Brye, K.; Wu, C.; Shi, A. Genome-wide association study (GWAS) of salt tolerance in worldwide soybean germplasm lines. Mol. Breed. 2017, 37, 30. [Google Scholar] [CrossRef]
  21. Jin, H.; Yang, X.; Zhao, H.; Song, X.; Tsvetkov, Y.D.; Wu, Y.; Gao, Q.; Zhang, R.; Zhang, J. Genetic analysis of protein content and oil content in soybean by genome-wide association study. Front. Plant Sci. 2023, 14, 1182771. [Google Scholar] [CrossRef]
  22. Goettel, W.; Zhang, H.; Li, Y.; Qiao, Z.; Jiang, H.; Hou, D.; Song, Q.; Pantalone, V.R.; Song, B.-H.; Yu, D. POWR1 is a domestication gene pleiotropically regulating seed quality and yield in soybean. Nat. Commun. 2022, 13, 3051. [Google Scholar] [CrossRef] [PubMed]
  23. Duan, Z.; Zhang, M.; Zhang, Z.; Liang, S.; Fan, L.; Yang, X.; Yuan, Y.; Pan, Y.; Zhou, G.; Liu, S. Natural allelic variation of GmST05 controlling seed size and quality in soybean. Plant Biotechnol. J. 2022, 20, 1807–1818. [Google Scholar] [CrossRef]
  24. Qi, Z.; Guo, C.; Li, H.; Qiu, H.; Li, H.; Jong, C.; Yu, G.; Zhang, Y.; Hu, L.; Wu, X. Natural variation in Fatty Acid 9 is a determinant of fatty acid and protein content. Plant Biotechnol. J. 2024, 22, 759–773. [Google Scholar] [CrossRef]
  25. Bing, L.; Peng, J.; Wu, Y.; Hu, Q.; Huang, W.; Yuan, Z.; Tang, X.; Cao, D.; Xue, Y.; Luan, X. Identification of an important QTL for seed oil content in soybean. Mol. Breed. 2023, 43, 43. [Google Scholar] [CrossRef] [PubMed]
  26. Gibellini, F.; Smith, T.K. The Kennedy pathway—De novo synthesis of phosphatidylethanolamine and phosphatidylcholine. IUBMB Life 2010, 62, 414–428. [Google Scholar] [CrossRef]
  27. Cao, J.; Li, J.-L.; Li, D.; Tobin, J.F.; Gimeno, R.E. Molecular identification of microsomal acyl-CoA: Glycerol-3-phosphate acyltransferase, a key enzyme in de novo triacylglycerol synthesis. Proc. Natl. Acad. Sci. USA 2006, 103, 19695–19700. [Google Scholar] [CrossRef]
  28. Zhang, D.; Zhang, H.; Hu, Z.; Chu, S.; Yu, K.; Lv, L.; Yang, Y.; Zhang, X.; Chen, X.; Kan, G. Artificial selection on GmOLEO1 contributes to the increase in seed oil during soybean domestication. PLoS Genet. 2019, 15, e1008267. [Google Scholar] [CrossRef]
  29. Liu, J.; Hao, W.; Liu, J.; Fan, S.; Zhao, W.; Deng, L.; Wang, X.; Hu, Z.; Hua, W.; Wang, H. A novel chimeric mitochondrial gene confers cytoplasmic effects on seed oil content in polyploid rapeseed (Brassica napus). Mol. Plant 2019, 12, 582–596. [Google Scholar] [CrossRef] [PubMed]
  30. Baud, S.; Wuilleme, S.; To, A.; Rochat, C.; Lepiniec, L. Role of WRINKLED1 in the transcriptional regulation of glycolytic and fatty acid biosynthetic genes in Arabidopsis. Plant J. 2009, 60, 933–947. [Google Scholar] [CrossRef]
  31. Baud, S.; Mendoza, M.S.; To, A.; Harscoët, E.; Lepiniec, L.; Dubreucq, B. WRINKLED1 specifies the regulatory action of LEAFY COTYLEDON2 towards fatty acid metabolism during seed maturation in Arabidopsis. Plant J. 2007, 50, 825–838. [Google Scholar] [CrossRef] [PubMed]
  32. To, A.; Joubès, J.; Barthole, G.; Lécureuil, A.; Scagnelli, A.; Jasinski, S.; Lepiniec, L.; Baud, S. WRINKLED transcription factors orchestrate tissue-specific regulation of fatty acid biosynthesis in Arabidopsis. Plant Cell 2012, 24, 5007–5023. [Google Scholar] [CrossRef] [PubMed]
  33. Pelletier, J.M.; Kwong, R.W.; Park, S.; Le, B.H.; Baden, R.; Cagliari, A.; Hashimoto, M.; Munoz, M.D.; Fischer, R.L.; Goldberg, R.B. LEC1 sequentially regulates the transcription of genes involved in diverse developmental processes during seed development. Proc. Natl. Acad. Sci. USA 2017, 114, E6710–E6719. [Google Scholar] [CrossRef]
  34. Manan, S.; Ahmad, M.Z.; Zhang, G.; Chen, B.; Haq, B.U.; Yang, J.; Zhao, J. Soybean LEC2 regulates subsets of genes involved in controlling the biosynthesis and catabolism of seed storage substances and seed development. Front. Plant Sci. 2017, 8, 1604. [Google Scholar] [CrossRef]
  35. Lee, H.G.; Kim, H.; Suh, M.C.; Kim, H.U.; Seo, P.J. The MYB96 transcription factor regulates triacylglycerol accumulation by activating DGAT1 and PDAT1 expression in Arabidopsis seeds. Plant Cell Physiol. 2018, 59, 1432–1442. [Google Scholar] [CrossRef]
  36. Song, Q.-X.; Li, Q.-T.; Liu, Y.-F.; Zhang, F.-X.; Ma, B.; Zhang, W.-K.; Man, W.-Q.; Du, W.-G.; Wang, G.-D.; Chen, S.-Y. Soybean GmbZIP123 gene enhances lipid content in the seeds of transgenic Arabidopsis plants. J. Exp. Bot. 2013, 64, 4329–4341. [Google Scholar] [CrossRef]
  37. Cernac, A.; Benning, C. WRINKLED1 encodes an AP2/EREB domain protein involved in the control of storage compound biosynthesis in Arabidopsis. Plant J. 2004, 40, 575–585. [Google Scholar] [CrossRef]
  38. Ruuska, S.A.; Girke, T.; Benning, C.; Ohlrogge, J.B. Contrapuntal networks of gene expression during Arabidopsis seed filling. Plant Cell 2002, 14, 1191–1206. [Google Scholar] [CrossRef] [PubMed]
  39. De Vries, B.D.; Fehr, W.R.; Welke, G.A.; Dewey, R.E. Molecular analysis of mutant alleles for elevated palmitate concentration in soybean. Crop Sci. 2011, 51, 2554–2560. [Google Scholar] [CrossRef]
  40. Ma, W.; Kong, Q.; Mantyla, J.J.; Yang, Y.; Ohlrogge, J.B.; Benning, C. 14-3-3 protein mediates plant seed oil biosynthesis through interaction with AtWRI1. Plant J. 2016, 88, 228–235. [Google Scholar] [CrossRef]
  41. Li, D.; Jin, C.; Duan, S.; Zhu, Y.; Qi, S.; Liu, K.; Gao, C.; Ma, H.; Zhang, M.; Liao, Y. MYB89 transcription factor represses seed oil accumulation. Plant Physiol. 2017, 173, 1211–1225. [Google Scholar] [CrossRef] [PubMed]
  42. Chawla, R.; Poonia, A.; Samantara, K.; Mohapatra, S.R.; Naik, S.B.; Ashwath, M.; Djalovic, I.G.; Prasad, P.V. Green revolution to genome revolution: Driving better resilient crops against environmental instability. Front. Genet. 2023, 14, 1204585. [Google Scholar] [CrossRef] [PubMed]
  43. Zhang, H.; Lang, Z.; Zhu, J.-K. Dynamics and function of DNA methylation in plants. Nat. Rev. Mol. Cell Biol. 2018, 19, 489–506. [Google Scholar] [CrossRef]
  44. Salami, M.; Heidari, B.; Batley, J.; Wang, J.; Tan, X.-L.; Richards, C.; Tan, H. Integration of genome-wide association studies, metabolomics, and transcriptomics reveals phenolic acid-and flavonoid-associated genes and their regulatory elements under drought stress in rapeseed flowers. Front. Plant Sci. 2024, 14, 1249142. [Google Scholar] [CrossRef] [PubMed]
  45. McKenna, A.; Hanna, M.; Banks, E.; Sivachenko, A.; Cibulskis, K.; Kernytsky, A.; Garimella, K.; Altshuler, D.; Gabriel, S.; Daly, M. The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 2010, 20, 1297–1303. [Google Scholar] [CrossRef] [PubMed]
  46. Li, H.; Durbin, R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 2009, 25, 1754–1760. [Google Scholar] [CrossRef]
  47. Purcell, S.; Neale, B.; Todd-Brown, K.; Thomas, L.; Ferreira, M.A.; Bender, D.; Maller, J.; Sklar, P.; De Bakker, P.I.; Daly, M.J. PLINK: A tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 2007, 81, 559–575. [Google Scholar] [CrossRef]
  48. Zhang, C.; Dong, S.-S.; Xu, J.-Y.; He, W.-M.; Yang, T.-L. PopLDdecay: A fast and effective tool for linkage disequilibrium decay analysis based on variant call format files. Bioinformatics 2019, 35, 1786–1788. [Google Scholar] [CrossRef]
  49. Raj, A.; Stephens, M.; Pritchard, J.K. fastSTRUCTURE: Variational inference of population structure in large SNP data sets. Genetics 2014, 197, 573–589. [Google Scholar] [CrossRef] [PubMed]
  50. Lipka, A.E.; Tian, F.; Wang, Q.; Peiffer, J.; Li, M.; Bradbury, P.J.; Gore, M.A.; Buckler, E.S.; Zhang, Z. GAPIT: Genome association and prediction integrated tool. Bioinformatics 2012, 28, 2397–2399. [Google Scholar] [CrossRef]
  51. Xin, M.; Zhang, Z.; Han, Y.; Feng, L.; Lei, Y.; Li, X.; Wu, F.; Wang, J.; Wang, Z.; Li, Y. Soybean phenological changes in response to climate warming in three northeastern provinces of China. Field Crops Res. 2023, 302, 109082. [Google Scholar] [CrossRef]
  52. Song, W.; Sun, S.; Wu, T.; Yang, R.; Tian, S.; Xu, C.; Jiang, B.; Yuan, S.; Hou, W.; Wu, C. Geographic distributions and the regionalization of soybean seed compositions across China. Food Res. Int. 2023, 164, 112364. [Google Scholar] [CrossRef]
  53. Bu, M.; Fan, W.; Li, R.; He, B.; Cui, P. Lipid metabolism and improvement in oilseed crops: Recent advances in multi-omics studies. Metabolites 2023, 13, 1170. [Google Scholar] [CrossRef] [PubMed]
  54. Wei, W.; Wang, L.F.; Tao, J.J.; Zhang, W.K.; Chen, S.Y.; Song, Q.; Zhang, J.S. The comprehensive regulatory network in seed oil biosynthesis. J. Integr. Plant Biol. 2025, 67, 649–668. [Google Scholar] [CrossRef]
  55. Jones, J.D.; Dangl, J.L. The plant immune system. Nature 2006, 444, 323–329. [Google Scholar] [CrossRef]
  56. Mackay, T.F. Epistasis and quantitative traits: Using model organisms to study gene–gene interactions. Nat. Rev. Genet. 2014, 15, 22–33. [Google Scholar] [CrossRef]
  57. Cesari, S.; Thilliez, G.; Ribot, C.; Chalvon, V.; Michel, C.; Jauneau, A.; Rivas, S.; Alaux, L.; Kanzaki, H.; Okuyama, Y. The rice resistance protein pair RGA4/RGA5 recognizes the Magnaporthe oryzae effectors AVR-Pia and AVR1-CO39 by direct binding. Plant Cell 2013, 25, 1463–1481. [Google Scholar] [CrossRef] [PubMed]
  58. Wu, J.; Zhu, J.; Wang, L.; Wang, S. Genome-wide association study identifies NBS-LRR-encoding genes related with anthracnose and common bacterial blight in the common bean. Front. Plant Sci. 2017, 8, 1398. [Google Scholar] [CrossRef] [PubMed]
  59. Huang, C.; Nie, X.; Shen, C.; You, C.; Li, W.; Zhao, W.; Zhang, X.; Lin, Z. Population structure and genetic basis of the agronomic traits of upland cotton in China revealed by a genome-wide association study using high-density SNP s. Plant Biotechnol. J. 2017, 15, 1374–1386. [Google Scholar] [CrossRef]
  60. Wang, P.; Di, Q.; Liu, X.-Y. Genome-Wide association Study Identifies Candidate Genes Related to Oleic acid content of Soybean Seed. BMC Plant Biol. 2020, 20, 399. [Google Scholar] [CrossRef]
  61. Vuong, T.D.; Florez-Palacios, L.; Mozzoni, L.; Clubb, M.; Quigley, C.; Song, Q.; Kadam, S.; Yuan, Y.; Chang, T.F.; Mian, M.A.R.; et al. Genomic analysis and characterization of new loci associated with seed protein and oil content in soybeans. The Plant Genome. 2023, 16, e20400. [Google Scholar] [CrossRef]
  62. Zhang, J.; Wang, X.; Lu, Y.; Bhusal, S.J.; Song, Q.; Cregan, P.B.; Yen, Y.; Brown, M.; Jiang, G.L. Genome-wide scan for seed composition provides insights into soybean quality improvement and the impacts of domestication and breeding. Molecular Plant. 2018, 11, 460–472. [Google Scholar] [CrossRef]
  63. Serson, W.R.; Gishini, M.F.S.; Stupar, R.M.; Stec, A.O.; Armstrong, P.R.; Hildebrand, D. Identification and Candidate Gene Evaluation of a Large Fast Neutron-Induced Deletion Associated with a High-Oil Phenotype in Soybean Seeds. Genes 2024, 15, 892. [Google Scholar] [CrossRef]
  64. Leamy, L.J.; Zhang, H.; Li, C.; Chen, C.Y.; Song, B.-H. A genome-wide association study of seed composition traits in wild soybean (Glycine soja). BMC Genom. 2017, 18, 18. [Google Scholar] [CrossRef]
  65. Rolling, W.R. A study of Phytophthora sojae Resistance in Soybean (Glycine max [L. Merr]) using Genome-Wide Association Analyses and Genomic Prediction; The Ohio State University: Columbus OH, USA, 2020. [Google Scholar]
  66. Ye, J.; Niu, X.; Yang, Y.; Wang, S.; Xu, Q.; Yuan, X.; Yu, H.; Wang, Y.; Wang, S.; Feng, Y. Divergent Hd1, Ghd7, and DTH7 alleles control heading date and yield potential of japonica rice in Northeast China. Front. Plant Sci. 2018, 9, 35. [Google Scholar] [CrossRef]
  67. Jiang, D.; Zhong, S.; McPeek, M.S. Retrospective binary-trait association test elucidates genetic architecture of Crohn disease. Am. J. Hum. Genet. 2016, 98, 243–255. [Google Scholar] [CrossRef]
  68. Wang, J.; Tang, Y.; Zhang, Z. Performing genome-wide association studies with multiple models using GAPIT. In Genome-Wide Association Studies; Springer: Berlin/Heidelberg, Germany, 2022; pp. 199–217. [Google Scholar]
  69. Zuo, Z.; Li, M.; Liu, D.; Li, Q.; Huang, B.; Ye, G.; Wang, J.; Tang, Y.; Zhang, Z. GWAS Procedures for Gene Mapping in Diverse Populations With Complex Structures. Bio-Protocol 2025, 15, e5284. [Google Scholar] [CrossRef] [PubMed]
  70. Tibbs Cortes, L.; Zhang, Z.; Yu, J. Status and prospects of genome-wide association studies in plants. Plant Genome 2021, 14, e20077. [Google Scholar] [CrossRef] [PubMed]
  71. Liu, X.; Huang, M.; Fan, B.; Buckler, E.S.; Zhang, Z. Iterative usage of fixed and random effect models for powerful and efficient genome-wide association studies. PLoS Genet. 2016, 12, e1005767. [Google Scholar] [CrossRef]
  72. Neupane, B. Systematic Comparison of GWAS Methods in Wheat: Balancing Statistical Power, False Positive Control, and Computational Efficiency. Preprints 2025. [Google Scholar] [CrossRef]
  73. Yu, J.; Pressoir, G.; Briggs, W.H.; Vroh Bi, I.; Yamasaki, M.; Doebley, J.F.; McMullen, M.D.; Gaut, B.S.; Nielsen, D.M.; Holland, J.B. A unified mixed-model method for association mapping that accounts for multiple levels of relatedness. Nat. Genet. 2006, 38, 203–208. [Google Scholar] [CrossRef]
  74. Miao, C.; Yang, J.; Schnable, J.C. Optimising the identification of causal variants across varying genetic architectures in crops. Plant Biotechnol. J. 2019, 17, 893–905. [Google Scholar] [CrossRef]
  75. Yang, J.; Ren, Y.; Zhang, D.; Chen, X.; Huang, J.; Xu, Y.; Aucapiña, C.B.; Zhang, Y.; Miao, Y. Transcriptome-based WGCNA analysis reveals regulated metabolite fluxes between floral color and scent in Narcissus tazetta flower. Int. J. Mol. Sci. 2021, 22, 8249. [Google Scholar] [CrossRef] [PubMed]
  76. Ge, Z.; Liu, X.; Liu, B.; Abe, J.; Ma, F.; Kong, F. QTL mapping of soybean seed protein and oil trait. Soybean Sci. 2011, 30, 5. [Google Scholar] [CrossRef]
  77. Yang, Y. Fine Mapping and Candidate Gene Identification of a Soybean Seed Protein and Oil Qtl from a Wild Soybean Accession and Linkage Analysis for Whole Plant Biomass, Carbon, Nitrogen, and Seed Composition Using a RIL Mapping Population. Master’s Thesis, University of Missouri-Columbia, Columbia, MO, USA, 2021. [Google Scholar]
  78. Gillenwater, J.H. QTL Mapping of Seed Composition Traits and Assessment of Yield in Two Soybean Populations; North Carolina State University: Raleigh, NC, USA, 2020. [Google Scholar]
Figure 1. Distribution and comparison of seed oil content (%) across two growing seasons (2021 and 2023) along with their average in a natural soybean population. (AC) Histograms overlaid with density curves demonstrate the distribution of oil content for the years 2021 and 2023, along with the two-year average. For each year, the mean, standard deviation (SD), and coefficient of variation (CV%) are presented. (D) Violin plots illustrate the distribution of oil content across different years. Letters (a–c) denote significant differences across years as determined by Tukey’s HSD test at a significance level of p < 0.05. (E) Correlation plot illustrating the relationship between oil content values recorded in 2021 and 2023, featuring a fitted regression line (solid) and a 95% confidence interval represented by a shaded band. A strong positive correlation has been observed, with a correlation coefficient of r = 0.84 and a p-value of less than 0.001. (F) Ridge plots illustrate the comparison of kernel density estimates for oil content over the years, emphasizing the shift in and narrowing of the distribution from 2021 to 2023.
Figure 1. Distribution and comparison of seed oil content (%) across two growing seasons (2021 and 2023) along with their average in a natural soybean population. (AC) Histograms overlaid with density curves demonstrate the distribution of oil content for the years 2021 and 2023, along with the two-year average. For each year, the mean, standard deviation (SD), and coefficient of variation (CV%) are presented. (D) Violin plots illustrate the distribution of oil content across different years. Letters (a–c) denote significant differences across years as determined by Tukey’s HSD test at a significance level of p < 0.05. (E) Correlation plot illustrating the relationship between oil content values recorded in 2021 and 2023, featuring a fitted regression line (solid) and a 95% confidence interval represented by a shaded band. A strong positive correlation has been observed, with a correlation coefficient of r = 0.84 and a p-value of less than 0.001. (F) Ridge plots illustrate the comparison of kernel density estimates for oil content over the years, emphasizing the shift in and narrowing of the distribution from 2021 to 2023.
Agronomy 15 01889 g001
Figure 2. Comprehensive analysis of soybean oil content changes across 341 soybean germplasms from five growing regions in Northern China (2021–2023). (A) Geographic distribution of soybean growing regions with the percentage of total production area indicated. (B) Seed oil content distribution by region, comparing 2021 (left panel) and 2023 (right panel) measurements using violin plots with box plots overlaid. (C) Absolute change in oil content (percentage points) between 2021 and 2023. (D) Percent change (%) in oil content by region. The table provides summary statistics, including the mean absolute change, mean percent change, percentage of samples showing an increase, and sample size for each region.
Figure 2. Comprehensive analysis of soybean oil content changes across 341 soybean germplasms from five growing regions in Northern China (2021–2023). (A) Geographic distribution of soybean growing regions with the percentage of total production area indicated. (B) Seed oil content distribution by region, comparing 2021 (left panel) and 2023 (right panel) measurements using violin plots with box plots overlaid. (C) Absolute change in oil content (percentage points) between 2021 and 2023. (D) Percent change (%) in oil content by region. The table provides summary statistics, including the mean absolute change, mean percent change, percentage of samples showing an increase, and sample size for each region.
Agronomy 15 01889 g002
Figure 3. SNP distribution and genetic mapping data of the population. (A) LD decay of the GWAS population; (B) population structure of the soybean germplasm collection reflected by principal components; (C) the first three principal components of the 1,048,576 SNPs used in the GWAS; (D) a heatmap of the kinship matrix of the 341 soybean germplasm accessions.
Figure 3. SNP distribution and genetic mapping data of the population. (A) LD decay of the GWAS population; (B) population structure of the soybean germplasm collection reflected by principal components; (C) the first three principal components of the 1,048,576 SNPs used in the GWAS; (D) a heatmap of the kinship matrix of the 341 soybean germplasm accessions.
Agronomy 15 01889 g003
Figure 4. Manhattan plots and Q–Q plots from genome-wide association analysis of soybean oil content. Left panels (ac) show Manhattan plots where the X-axis represents SNP positions across 20 chromosomes (GM01-GM20) and the Y-axis indicates −log10(p) values. The significance threshold (dashed line) was set at -log10(p) > 6.20. Right panels (df) display corresponding Q–Q plots showing observed versus expected -log10(p) values. Data are presented for 2021 (a,d), 2023 (b,e), and the two-year average (c,f).
Figure 4. Manhattan plots and Q–Q plots from genome-wide association analysis of soybean oil content. Left panels (ac) show Manhattan plots where the X-axis represents SNP positions across 20 chromosomes (GM01-GM20) and the Y-axis indicates −log10(p) values. The significance threshold (dashed line) was set at -log10(p) > 6.20. Right panels (df) display corresponding Q–Q plots showing observed versus expected -log10(p) values. Data are presented for 2021 (a,d), 2023 (b,e), and the two-year average (c,f).
Agronomy 15 01889 g004
Table 1. Significant SNP markers related to oil content.
Table 1. Significant SNP markers related to oil content.
YearChrPhysical Range of Significant SNP Locip ValueEffectNumber of
Significant SNP Loci
Note
StartEnd
202188,038,1139,503,5628.85 × 10−8−1.3827Reduce oil content
88,439,2059,501,9124.78 × 10−81.4521Increase oil content
1111,087,53611,087,5361.79 × 10−70.891Increase oil content
1111,103,60411,203,6040.02−0.861Reduce oil content
2015,764,30715,764,3072.80 × 10−7−1.011Reduce oil content
2023110,828,03410,841,1671.91 × 10−7−1.083Reduce oil content
110,537,84110,904,1782.46 × 10−71.074Increase oil content
89,005,2479,079,0371.41 × 10−9−1.587Reduce oil content
89,005,4309,078,6172.94 × 10−81.4512Increase oil content
842,038,41142,038,4112.19 × 10−71.101Increase oil content
1345,137,49545,137,4951.86 × 10−7−1.121Reduce oil content
1345,140,36645,163,6111.56 × 10−71.132Increase oil content
Average214,714,15914,714,1591.16 × 10−7−0.971Reduce oil content
415,518,98015,518,9802.16 × 10−7−1.081Increase oil content
516,157,28316,157,2831.84 × 10−73.351Reduce oil content
716,529,74916,529,7491.12 × 10−70.941Increase oil content
88,432,8009,501,8646.65 × 10−8−1.3213Reduce oil content
87,801,1799,501,9122.63 × 10−81.3713Increase oil content
842,039,72142,039,7405.65 × 10−81.132Increase oil content
842,039,73542,039,7351.80 × 10−7−1.031Decrease oil content
1133,092,71033,092,7102.42 × 10−71.041Increase oil content
Table 2. Information on SNP loci associated with soybean seed oil content on chromosome 8.
Table 2. Information on SNP loci associated with soybean seed oil content on chromosome 8.
YearGenePosition (bp)p ValueMAFDNA Sequence VariationProtein Sequence VariationPROVEAN Score
2021Glyma.08G1100008,439,2052.51 × 10−70.08A739CT247P0.031
Glyma.08G1174009,074,9208.62 × 10−110.05G682TG228C1.450
Glyma.08G1176009,501,4366.76 × 10−120.05G698AA233V1.233
Glyma.08G1235009,501,6202.36 × 10−70.06G2281AV761I−0.200
Glyma.08G1235009,501,6955.54 × 10−80.06G1650AE550D−0.667
Glyma.08G1235009,074,9202.61 × 10−70.06G1567CE523Q−0.200
Glyma.08G1235009,074,9202.19 × 10−70.06T1555CC519R−0.017
Glyma.08G1235009,501,6956.45 × 10−90.06A1508GE503G−1.633
Glyma.08G1235009,501,9122.03 × 10−80.06G1490CL497P−0.083
Glyma.08G1235009,501,4563.40 × 10−80.06T1488GF496L0.933
Glyma.08G1235009,501,5561.11 × 10−70.06T1388AI463N0.592
Glyma.08G1235009,501,5646.32 × 10−80.06T1380GF460E−1.092
Glyma.08G1235009,501,5862.39 × 10−70.06T1358AV453E0.200
Glyma.08G1235009,501,6959.17 × 10−100.06A1249GK417E0.058
Glyma.08G1235009,501,7763.87 × 10−80.06G1168AD390N0.133
Glyma.08G1235009,501,8649.23 × 10−80.06C1080AD360E−0.033
Glyma.08G1235009,501,9124.54 × 10−80.06C1032GN344K0.350
Glyma.08G1235009,502,3162.47 × 10−70.05T628GS210A0.383
2023Glyma.08G1174009,054,7414.94 × 10−120.05G682TG228C1.450
Glyma.08G1176009,074,9208.71 × 10−140.05C698TA233V1.233
AverageGlyma08G1174009,054,7417.46 × 10−110.05G682TG228C1.450
Glyma08G1176009,074,9205.44 × 10−130.05C698TA233V1.233
Glyma08G1235009,500,6632.60 × 10−70.06G2281AV761I−0.200
Glyma08G1235009,501,4541.47 × 10−70.07T1490CL497P−0.083
Glyma08G1235009,501,4567.61 × 10−80.06T1488GF496L0.933
Glyma08G1235009,501,6953.88 × 10−90.06A1249GK417E0.058
Glyma08G1235009,501,8641.66 × 10−70.07C1080AD360E−0.033
Glyma08G1235009,501,9121.84 × 10−70.07C1032GN344K0.350
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, X.; Zhang, M.; Li, F.; Liu, X.; Zhang, C.; Zhang, F.; Zhao, K.; Yuan, R.; Lamlom, S.F.; Ren, H.; et al. Genome-Wide Association Study Reveals Key Genetic Loci Controlling Oil Content in Soybean Seeds. Agronomy 2025, 15, 1889. https://doi.org/10.3390/agronomy15081889

AMA Style

Wang X, Zhang M, Li F, Liu X, Zhang C, Zhang F, Zhao K, Yuan R, Lamlom SF, Ren H, et al. Genome-Wide Association Study Reveals Key Genetic Loci Controlling Oil Content in Soybean Seeds. Agronomy. 2025; 15(8):1889. https://doi.org/10.3390/agronomy15081889

Chicago/Turabian Style

Wang, Xueyang, Min Zhang, Fuxin Li, Xiulin Liu, Chunlei Zhang, Fengyi Zhang, Kezhen Zhao, Rongqiang Yuan, Sobhi F. Lamlom, Honglei Ren, and et al. 2025. "Genome-Wide Association Study Reveals Key Genetic Loci Controlling Oil Content in Soybean Seeds" Agronomy 15, no. 8: 1889. https://doi.org/10.3390/agronomy15081889

APA Style

Wang, X., Zhang, M., Li, F., Liu, X., Zhang, C., Zhang, F., Zhao, K., Yuan, R., Lamlom, S. F., Ren, H., Qiu, H., & Zhang, B. (2025). Genome-Wide Association Study Reveals Key Genetic Loci Controlling Oil Content in Soybean Seeds. Agronomy, 15(8), 1889. https://doi.org/10.3390/agronomy15081889

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop