GWAS and WGCNA Analysis Uncover Candidate Genes Associated with Oil Content in Soybean

Soybean vegetable oil is an important source of the human diet. However, the analysis of the genetic mechanism leading to changes in soybean oil content is still incomplete. In this study, a total of 227 soybean materials were applied and analyzed by a genome-wide association study (GWAS). There are 44 quantitative trait nucleotides (QTNs) that were identified as associated with oil content. A total of six, four, and 34 significant QTN loci were identified in Xiangyang, Hulan, and Acheng, respectively. Of those, 26 QTNs overlapped with or were near the known oil content quantitative trait locus (QTL), and 18 new QTNs related to oil content were identified. A total of 594 genes were located near the peak single nucleotide polymorphism (SNP) from three tested environments. These candidate genes exhibited significant enrichment in tropane, piperidine, and pyridine alkaloid biosynthesiss (ko00960), ABC transporters (ko02010), photosynthesis-antenna proteins (ko00196), and betalain biosynthesis (ko00965). Combined with the GWAS and weighted gene co-expression network analysis (WGCNA), four candidate genes (Glyma.18G300100, Glyma.11G221100, Glyma.13G343300, and Glyma.02G166100) that may regulate oil content were identified. In addition, Glyma.18G300100 was divided into two main haplotypes in the studied accessions. The oil content of haplotype 1 is significantly lower than that of haplotype 2. Our research findings provide a theoretical basis for improving the regulatory mechanism of soybean oil content.


Introduction
Soybean oil, as a main source of the human diet, is closely related to our daily lives [1].Plant oil, also referred to as plant fat, primarily originates from the seeds of plants.The composition of vegetable oil primarily consists of five fatty acids, which collectively account for 98.4% of the total oil content, including palmitic acid (16:0), stearic acid (18:0), oleic acid (18:1), linoleic acid (18:2), and linolenic acid (18:3) [2,3].Soybean oil is a prominent cooking oil and plays a pivotal role in disease prevention, so its synthesis mechanism has always been a research hotspot [4,5].Therefore, revealing the characteristics associated with oil synthesis in plant seeds can help contribute to enhancing oil yield and quality.
In plants, the accumulation of seed oil primarily occurs in the form of triacylglycerol (TAG) [6].The DGAT gene predominantly governs the process involved in triacylglycerol (TAG) synthesis.Research has found that overexpression of the DGAT gene significantly increases the accumulation of plant seed oil content [7].In soybeans, GmOLEO1 enhances oil accumulation by affecting the synthesis of triacylglycerol (TAG) [8].The GmSWEET39 gene exerts a positive influence on increasing the total oil content of soybeans and Arabidopsis [3].Research has shown that the mitochondrial gene orf188 exerts an influence on the oil content of rapeseed [9].Previous studies have demonstrated that WRI1, as a pivotal transcription factor regulating oil metabolism, can regulate key genes in glycolysis and Plants 2024, 13, 1351 2 of 14 fatty acid synthesis pathways, thereby exerting further influence on oil synthesis [10][11][12].In addition, transcription factors such as LEC1, LEC2, MYB, ABI3, and bZIP play an important role in regulating seed oil accumulation [13][14][15][16].With the development of highthroughput sequencing technology and genome-wide association analysis, it has become widely employed for the identification of QTLs/genes associated with various agronomic traits, for example, maize, soybeans, rice, and rapeseed.Currently, the identification of over 300 QTLs associated with seed oil content has been accomplished.A GWAS was employed to analyze the grain oil of 533 (305 indica subpopulation and 178 japonica subpopulation) rice accessions, where a total of 94 QTLs were identified to be associated with oil content, and the qPAL6 locus was detected for C16:0 composition [17].A total of 3,290,923 SNPs were identified with 320 soybean accessions, 29 QTLs were identified to be significantly associated with oil content, while 24 loci are likely to be new [18].Previous studies conducted GWAS analysis on 278 soybean materials using two models, and three significant QTLs were identified.Among them, the significant SNP (ss715637321 on chr20: 32835139) overlapped with the known large-effect oil QTL loci [19].Duan et al. conducted a GWAS on over 1800 soybean materials, and the identification of a significant QTL locus on chromosome 5 controlling seed thickness was identified.Further investigation revealed that the allelic variation of the candidate gene GmST05 is the main factor affecting seed size in the soybean germplasm [20].Qi et al. conducted GWAS analysis on the fatty acid content of 547 soybean materials, identified a significant SNP site on chromosome 9, and discovered a SEIPIN homologous gene that plays an important role in regulating fatty acid synthesis [21].Li et al. conducted QTL mapping of oil content in the recombinant inbred line (RIL) population, and 5 QTLs related to oil content were identified.A total of 20 candidate genes were screened [22].
Soybean oil content is a quantitative trait that is governed by multiple genes.The GWAS based on natural populations has more recombination events compared to biparental populations, thereby leading to enhanced accuracy in phenotype association [23,24].The GWAS has been employed to identify genomic regions associated with plant traits, including oil and protein content, yield, quality, and biotic and abiotic stress [25][26][27].The weighted gene co-expression network analysis (WGCNA) enables the extraction of relevant genes from phenotype data and is extensively employed for investigating intricate relationships among different types of genes [28,29].The WGCNA can be employed to further identify and prioritize potential candidate genes.
To elucidate the underlying mechanism of soybean oil biosynthesis, soybean oil content was analyzed through the GWAS using 227 soybean resources.Then, transcriptome data from 30 soybean seeds with high and low oil content were applied for WGCNA analysis.This study utilized the strategy of GWAS and WGCNA joint analysis to identify the putative regulatory genes governing oil content.

Phenotypic Variation of Oil Content
A total of 227 soybean resources were studied and the distribution is presented in Table 1.The oil content of the test population at three locations was 16.9-22.6%(Xiangyang), 17-23.8%(Hulan), and 17-24.6%(Acheng), respectively.The phenotypic variations of oil content were 4.18%, 3.62%, and 4.68% across three different environments (Xiangyang, Hulan, and Acheng), respectively (Figure 1, Table 1).The above results indicate that there are significant differences in oil content among the tested populations, as well as the abundant genetic diversity of germplasm resources, providing favorable conditions for the screening of specific germplasms.

Population Structure and GWAS Analysis
In this study, specific-locus amplified fragment sequencing (SLAF-seq) was applied to analyze 227 soybean germplasm resources.A total of 23,150 high-quality SNPs were selected (MAF > 0.05, missing data < 10%).The obtained SNPs are evenly distributed on 20 chromosomes of soybean (Figure S1A).By analyzing the principal components and phylogenetic relationships of SNPs, the changes in principal component analysis revealed an inflection point at PC3 (Figure S1B).These results showed that the first three phylogenetic relationships dominated the population structure on the association mapping (Figure S1C).Based on the pairwise relative kinship coefficients of association panel analysis, the 227 germplasm resources have low levels of genetic correlation (Figure S1D).The decay distance of LD is approximately 200 kb (Figure S2).

Population Structure and GWAS Analysis
In this study, specific-locus amplified fragment sequencing (SLAF-seq) was applied to analyze 227 soybean germplasm resources.A total of 23,150 high-quality SNPs were selected (MAF > 0.05, missing data < 10%).The obtained SNPs are evenly distributed on 20 chromosomes of soybean (Figure S1A).By analyzing the principal components and phylogenetic relationships of SNPs, the changes in principal component analysis revealed an inflection point at PC3 (Figure S1B).These results showed that the first three phylogenetic relationships dominated the population structure on the association mapping (Figure S1C).Based on the pairwise relative kinship coefficients of association panel analysis, the 227 germplasm resources have low levels of genetic correlation (Figure S1D).The decay distance of LD is approximately 200 kb (Figure S2).

Gene Enrichment Analysis of Candidate Genes
Gene enrichment analysis was performed to determine candidate genes for the regulating of oil content.The 200 kb genomic regions (100 kb on both sides) of each significant SNP locus in the GWAS results are identified as candidate genes.A total of 594 genes were located near the peak SNPs from three tested environments (Table S1).To further understand the potential function of candidate genes, the candidate genes were subjected to KEGG pathway analysis.These candidate genes exhibited significant enrichment in tropane, piperidine, and pyridine alkaloid biosynthesiss (ko00960), ABC transporters (ko02010), photosynthesis-antenna proteins (ko00196), and betalain biosynthesis (ko00965) (Figure S3).Among these identified candidate genes was Glyma.18G299300, an alpha/beta-Hydrolases superfamily protein located near rs20739 of Chr.18.This gene is an enzyme that promotes the hydrolysis of ester bonds between fatty acids and glycerol [43].Glyma.13G129900, a GDSL-like Lipase/Acylhydrolase superfamily protein (located near rs13422 of Chr.13), plays a pivotal function of regulating seed oil content [44].Glyma.03G260300(located near rs3413 of Chr.3), a 3-ketoacyl-CoA synthase 1 protein, has been proven to improve the synthesis of long-chain fatty acids in seeds [45].Glyma.18G299600(located near rs20739 of Chr.18), a Phosphoenolpyruvate carboxylase family protein, has the function of negatively regulating oil content [46].

Identification of Key Modules Possessing Candidate Genes via WGCNA
In order to further identify novel genes involved in the regulation of oil synthesis, transcriptome data from 30 soybean seeds with high and low oil content were applied for WGCNA analysis.Three modules were obtained in this study, represented by different colors in Figure 3A.Analysis of the relationship between modules-traits showed a relatively high correlation between one module and oil, including the MEblack module (r = 0.5, p = 4 × 10 −5 ), where there are 863 genes strongly associated with oil content (Figure 3B).
To understand the biological significance of co-expression networks, the genes in the MEblack module were subjected to Gene Ontology (GO) annotations and Kyoto Encyclopedia of Genes and Genomes (KEGG) enrichment analysis.The KEGG enrichment analysis revealed a significant enrichment of these genes in the metabolic pathways associated with fatty acid metabolism, fatty acid biosynthesis, fatty acid degradation, biosynthesis of unsaturated fatty acid, glycerolipid metabolism, pyruvate metabolism and alpha-linolenic acid metabolism (Figure 4A).Next, the present study performed GO annotations analysis on all genes within the MEblack module.As shown in Figure 4B, the most significant annotations terms were identified in the biological processes category, and the top five significantly enriched terms were found to be related to the pigment metabolic process (GO:0042440), a negative regulation of response to stimulus (GO:0048585), protein dephosphorylation (GO:0006470), porphyrin-containing compound metabolic process (GO:0006778), and hyperosmotic response (GO:0006972) (Figure 4B).
In order to further identify novel genes involved in the regulation of oil synthesis, transcriptome data from 30 soybean seeds with high and low oil content were applied for WGCNA analysis.Three modules were obtained in this study, represented by different colors in Figure 3A.Analysis of the relationship between modules-traits showed a relatively high correlation between one module and oil, including the MEblack module (r = 0.5, p = 4 × 10 −5 ), where there are 863 genes strongly associated with oil content (Figure 3B).To understand the biological significance of co-expression networks, the genes in the MEblack module were subjected to Gene Ontology (GO) annotations and Kyoto Encyclopedia of Genes and Genomes (KEGG) enrichment analysis.The KEGG enrichment analysis revealed a significant enrichment of these genes in the metabolic pathways associated with fatty acid metabolism, fatty acid biosynthesis, fatty acid degradation, biosynthesis of unsaturated fatty acid, glycerolipid metabolism, pyruvate metabolism and alpha-linolenic acid metabolism (Figure 4A).Next, the present study performed GO annotations analysis on all genes within the MEblack module.As shown in Figure 4B, the most significant annotations terms were identified in the biological processes category, and the top five significantly enriched terms were found to be related to the pigment metabolic process (GO:0042440), a negative regulation of response to stimulus (GO:0048585), protein dephosphorylation (GO:0006470), porphyrin-containing compound metabolic process (GO:0006778), and hyperosmotic response (GO:0006972) (Figure 4B).Furthermore, in order to obtain key genes that regulate oil synthesis, the genes in the MEblack module were screened based on a correlation greater than 0.57 as the threshold.A total of 863 linear networks were obtained and visualized using Cytoscape 3.9.1 software.As shown in Figure 5, four genes were identified in the GWAS results, including Glyma.18G300100,Glyma.11G221100,Glyma.13G343300, and Glyma.02G166100.Further analysis of candidate gene expression patterns revealed that the Glyma.18G300100,Glyma.13G343300, and Glyma.02G166100genes upregulated expression, and the Glyma.11G221100gene downregulated expression (Figure S4).Based on co-expression networks, the genes identified by the GWAS results are significantly correlated with other genes.Glyma.18G070600 and Glyma.18G300100 were found to be significantly correlated (r Furthermore, in order to obtain key genes that regulate oil synthesis, the genes in the MEblack module were screened based on a correlation greater than 0.57 as the threshold.A total of 863 linear networks were obtained and visualized using Cytoscape 3.9.1 software.As shown in Figure 5, four genes were identified in the GWAS results, including Glyma.18G300100,Glyma.11G221100,Glyma.13G343300, and Glyma.02G166100.Further analysis of candidate gene expression patterns revealed that the Glyma.18G300100,Glyma.13G343300, and Glyma.02G166100genes upregulated expression, and the Glyma.11G221100gene downregulated expression (Figure S4).Based on co-expression networks, the genes identified by the GWAS results are significantly correlated with other genes.Glyma.18G070600 and Glyma.18G300100 were found to be significantly correlated (r > 0.58).Glyma.11G221100,Glyma.04G062400, Glyma.18G070600,Glyma.04G130300,Glyma.07G273900, and Glyma.09G212100 were found to be significantly correlated (r > 0.57).Glyma.13G343300 and Glyma.18G070600 were Plants 2024, 13, 1351 7 of 14 found to be significantly correlated (r > 0.57).Glyma.02G166100 and Glyma.18G070600 were found to be significantly correlated (r > 0.57) (Figure 5, Table S2).

Gene-Based Association and Haplotype Analysis of Candidate Genes
To further elucidate the association between candidate genes and oil content, genebased associations and haplotype analysis were calculated using the GLM method.The SNP extraction of the Glyma.13G343300gene identified six SNPs, and subsequent correlation analysis revealed no significant association between these SNPs and oil content (−log(p) < 2.5).Similarly, the SNPs extracted from Glyma.11G221100 and Glyma.02G166100 were found to be three and two SNPs, respectively, and there was no significant correlation with oil content (−log(p) < 2.5).However, based on the association analysis, two SNPs were identifed in the exonic and upstream regions of Glyma.18G300100, and the SNP variation distance between exonic and upstream regions was 1273 bp (Figure 6A).The variation in the CDS region occurs on the first exonic region and the variation in exonic region belongs to synonym mutation and has no effect on protein change.The SNP markers rs57781456 and rs57782730 showed a significant association with oil content (−log(p) > 2.5).Glyma.18G300100 was divided into two predominant haplotypes in the studied accessions.The oil content of haplotype 1 is significantly lower than that of haplotype 2 (Figure 6B).

Gene-Based Association and Haplotype Analysis of Candidate Genes
To further elucidate the association between candidate genes and oil content, genebased associations and haplotype analysis were calculated using the GLM method.The SNP extraction of the Glyma.13G343300gene identified six SNPs, and subsequent correlation analysis revealed no significant association between these SNPs and oil content (−log(p) < 2.5).Similarly, the SNPs extracted from Glyma.11G221100 and Glyma.02G166100 were found to be three and two SNPs, respectively, and there was no significant correlation with oil content (−log(p) < 2.5).However, based on the association analysis, two SNPs were identifed in the exonic and upstream regions of Glyma.18G300100, and the SNP variation distance between exonic and upstream regions was 1273 bp (Figure 6A).The variation in the CDS region occurs on the first exonic region and the variation in exonic region belongs to synonym mutation and has no effect on protein change.The SNP markers rs57781456 and rs57782730 showed a significant association with oil content (−log(p) > 2.5).Glyma.18G300100 was divided into two predominant haplotypes in the studied accessions.The oil content of haplotype 1 is significantly lower than that of haplotype 2 (Figure 6B).> 0.58).Glyma.11G221100,Glyma.04G062400,Glyma.18G070600,Glyma.04G130300,Glyma.07G273900, and Glyma.09G212100 were found to be significantly correlated (r > 0.57).Glyma.13G343300 and Glyma.18G070600 were found to be significantly correlated (r > 0.57).Glyma.02G166100 and Glyma.18G070600 were found to be significantly correlated (r > 0.57) (Figure 5, Table S2).

Gene-Based Association and Haplotype Analysis of Candidate Genes
To further elucidate the association between candidate genes and oil content, genebased associations and haplotype analysis were calculated using the GLM method.The SNP extraction of the Glyma.13G343300gene identified six SNPs, and subsequent correlation analysis revealed no significant association between these SNPs and oil content (−log(p) < 2.5).Similarly, the SNPs extracted from Glyma.11G221100 and Glyma.02G166100 were found to be three and two SNPs, respectively, and there was no significant correlation with oil content (−log(p) < 2.5).However, based on the association analysis, two SNPs were identifed in the exonic and upstream regions of Glyma.18G300100, and the SNP variation distance between exonic and upstream regions was 1273 bp (Figure 6A).The variation in the CDS region occurs on the first exonic region and the variation in exonic region belongs to synonym mutation and has no effect on protein change.The SNP markers rs57781456 and rs57782730 showed a significant association with oil content (−log(p) > 2.5).Glyma.18G300100 was divided into two predominant haplotypes in the studied accessions.The oil content of haplotype 1 is significantly lower than that of haplotype 2 (Figure 6B).

Discussion
Soybean is a significant economic crop and a primary source of vegetable oil [47].Previous studies have found that many QTN loci are significantly associated with oil Plants 2024, 13, 1351 8 of 14 content.In this study, we performed genome sequencing on 227 soybean germplasm samples, and a total of 23,131 high-quality markers were identified.Of those, 44 SNPs were found to have significant association with oil content.A total of six, four, and 34 significant QTN loci were identified in Xiangyang, Hulan, and Acheng, respectively.In recent years, the combined analysis of GWAS and WGCNA has been utilized to identify some novel genes.For example, four candidate genes were identified through the integration of GWAS and WGCNA [48].This study identified four candidate genes (Glyma.18G300100,Glyma.11G221100,Glyma.13G343300, and Glyma.02G166100)associated with oil content through the integration of the GWAS and WGCNA methods.
The GWAS is extensively employed for the analysis of the genetic basis of complex traits, and the identification of numerous candidate genes associated with the regulation of controlling target traits has been accomplished.Previous researchers have conducted GWAS analysis on unsaturated fatty acids (FA), utilizing a panel of 30,000 SNPs, and found nine, five, and five QTNs associated with the levels of linoleic acid (LLA), linolenic acid (LNA), and oleic acid (OA), respectively [49].A GWAS was conducted to analyze the agronomic traits (plant height, number of nodes on main stem, branch number, stem diameter, and 100-seed weight) of 133 soybean germplasms, where a total of 59 SNPs were detected in at least two environments, and 15 candidate genes were further identified [50].Previous studies conducted GWAS analysis on the 100-seed weight of 185 soybean varieties, where a total of 31 significant QTNs were identified, and further screening revealed 237 candidate genes related to 100-seed weight [51].To evaluate the accuracy of SNPs in this study, the SNPs identified in this study were compared with previously published QTLs/SNPs.In this study, 44 QTNs related to oil content were identified in three environments in 2019.Meanwhile, the Acheng region exhibits the most significant loci, involving 34 loci.The higher number of significant QTL loci in the Acheng region compared to the other two regions may potentially be attributed to environmental factors.Out of these 44 QTNs, 26 QTNs were found to overlap with or be in close proximity to the previously identified oil content QTL.Two QTNs (rs13422 of Chr.13 and rs10231 of Chr.10) were significantly associated with oil content, and the association between locus rs13422 and oil content has been previously reported.Meanwhile, a total of 18 novel QTNs associated with oil content were identified.
Although GWAS analysis can strongly identify significant SNP-trait relationships, it may not accurately determine candidate genes.Therefore, the integrated GWAS and WGCNA joint analysis strategies can enhance the identification of candidate genes.Azam et al. (2023) used a combination of GWAS and WGCNA methodologies to identify four hub genes (Glyma.11G108100,Glyma.11G107100,Glyma.11G106900, and Glyma.11G109100)involved in TIF accumulation in soybean [52].Li et al. (2021) used a combination of GWAS and WGCNA methodologies to identify eight candidate genes regulating root growth in rapeseed [53].In this study, four candidate genes (Glyma.18G300100,Glyma.11G221100,Glyma.13G343300, and Glyma.02G166100)involved in oil synthesis were identified through the integrated GWAS and WGCNA analysis.The Glyma.13G343300 gene encoded the E3 ubiquitin ligase protein, which is homologous to Arabidopsis AT2G31510, and was shown to be involved in promoting seedling oleosin degradation and lipid droplet mobilization [54].The remaining three genes may be novel genes regulating oil synthesis.The Glyma.11G221100 gene encoded phosphoribosylformylglycinamidine, according to reports that the gene can promote the expression of flower organs [55].The Glyma.02G166100 gene encoded an unknown protein.In addition, the present study provides SNP markers for the purpose of soybean oil synthesis breeding.In this study, the Glyma.18G300100gene was identified to possess two haplotypes in the exonic and upstream regions, including Glyma.18G300100Hap1 and Hap2.The oil content of haplotype 2 exhibits a significantly higher amount than that of haplotype 1, and the result shows that Glyma.18G300100genes beneficial to haplotypes might be valuable for molecular assistant selection (MAS) of the oil content of soybean.Meanwhile, the expression level of the Glyma.18G300100gene was found to be significantly higher in high oil soybean materials compared to low oil soybean materials.The reason for the variation of Glyma.18G300100expression between different materials remains to be further explored.
We conducted GWAS analysis on the oil content of 227 soybean seeds and further found that 44 significant SNPs loci were detected.Through the integrated GWAS and WGCNA analysis, we identified four potential candidate genes (Glyma.18G300100,Glyma.11G221100,Glyma.13G343300, and Glyma.02G166100).The haplotype analysis of candidate genes further revealed that Glyma.18G300100 was divided into two haplotypes, and the oil content of haplotype 2 was significantly higher than that of haplotype 1.Therefore, variations in the exonic and upstream regions of Glyma.18G300100 can help us provide a basis for MAS of soybean oil content.

Plant Materials
This study used 227 soybean germplasm resources as experimental materials (Table S3).All materials were planted in three locations in Harbin, including Xiangyang, Hulan, and Acheng (45.80 • N, 126.53 • E) in 2019.We used a single-row plot (3 m long, 0.65 m between rows, 34 plants per row) and repeated three times for each location.After the soybeans had fully matured, we randomly selected 10 mature soybean plants from each row at each location.We put the mature soybean seeds into the sample tank, and seed oil content was quantified using the Infratec 1241 NIR Grain Analyzer (FOSS, Hoganas, Sweden).

DNA Isolation and SNP Genotyping Data Collection
Genomic DNA was extracted from the soybean samples using the CTAB method, and the quality of extraction was assessed [56].The isolated high-quality DNA was performed using specific-site amplification fragment sequencing (SLAF-seq) [57].The restriction endonucleases MseI and HaeIII (Thermo Fisher Scientific Inc., Waltham, MA, USA) were selected to generate a minimum of 50,000 sequencing tags per tested sample, ranging in length from approximately 300 bp to 500 bp.The acquired tags were uniformly distributed across the distinct genomic regions of all 20 soybean chromosomes.The sequencing libraries of each tested samples were performed based on the sequencing tags.The Illumina Genome Analyzer II system (Illumina Inc., San Diego, CA, USA) was employed in conjunction with a barcode method to generate 45 bp sequence reads at both ends of the sequencing tags from each accession library.The Short Oligonucleotide Alignment Program 2 (SOAP2) software was employed for aligning the raw paired-end reads to the reference genome of soybean (Glycine max Wm82.a2.v1) [58].The SAMtools48 (Version: 0.1.18)software was utilized for the conversion of mapping results into BAM format, facilitating the efficient filtration of unmapped and non-unique reads [59,60].Quality control of genotype data was performed using PLINK 1.9 software (--maf 0.05 --geno 0.1) (http://pngu.mgh.harvard.edu/purcell/plink/) (accessed on 21 November 2023).
For twenty lines, the genome resequencing was performed on an Illumina HiSeq 2000 sequencer, generating paired-end reads with a depth of 10-fold.These reads were then aligned to the soybean Williams 82 reference genome (Glyma.Wm82.a2) using BWA [59].The SAMtools48 software was utilized to convert the mapping results into the BAM format and perform filtration of unmapped and non-unique reads [60].The Picard package (https://sourceforge.net/p/picard/wiki/Main_Page/) (accessed on 21 November 2023) was utilized to eliminate duplicated reads.The BEDtools coverageBed program was utilized to calculate the sequence alignment coverage [61].A sequence was considered absent if the coverage was below 90%, and present if it exceeded 90%.SNP detection was carried out using the Genome Analysis Toolkit and SAMtools [60,62].The SNP annotation was conducted based on the soybean genome through the ANNOVAR package [63].

Population Structure Evaluation, Linkage Disequilibrium (LD) and Genome-Wide Association Study (GWAS)
The soybean oil content association signals were found based on 23,150 SNPs from 227 soybean germplasm resources with the compressed mixed linear model (CMLM) model of the R (version 4.2.3)software GAPIT package [64], using the first three PCA (principal components analysis) as covariates for association analysis to reduce false positives.Significant SNP markers were selected based on the threshold horizontal line with −log (p) > 4 as a significant correlation.A total of 23,150 high-quality SNPs (MAF > 0.05, missing data < 10%) were selected.The Manhattan plots depicting the oil contents for each of the three environments were established using GAPIT [64].An LDdecay diagram was established through PopLDdecay [65].The R 2 value of LD was calculated using Tassel software [66].The decay point is determined by taking half of the maximum LD value.The SoyBase database (http://www.soybase.org/)(accessed on 23 November 2023) was utilized for the prediction and annotation of candidate genes.The GO (http://www.geneontology.org/)(accessed on 23 November 2023) enrichment analysis was conducted based on the SoyBase database.The KEGG (https://www.kegg.jp/)(accessed on 23 November 2023) database was utilized for conducting pathway enrichment analysis of candidate genes.

Weighted Gene Co-Expression Network Analysis (WGCNA)
WGCNA analysis was conducted using transcriptome data from 30 soybean varieties (including 15 extremely high oil and 15 extremely low oil soybean varieties), which were obtained from Zhao et al. (2023) [67].Total RNA was extracted from the R6 stage of soybean development using TRIzol reagent (Invitrogen).The purity and concentration of RNA samples were assessed, followed by the construction of the library.The construction of the cDNA library was performed utilizing the Illumina HiSeq sequencing platform.The highquality readings were aligned to the reference genomes (Glycine max Wm82.a2.v1) using the Hisat2 v2.0.5 software.Using the R software WGCNA package to construct a weighted gene co-expression network, the gene expression profile matrix was derived from the gene expression levels of all samples [68].Firstly, clustering analysis was performed subsequent to the exclusion of samples exhibiting low correlation or those that were unable to be grouped on the dendrogram.The pick Soft Threshold function within the WGCNA package was utilized to calculate the Soft Threshold, ensuring compliance with the prerequisite of scale-free network distribution.The value of the Threshold parameter β was selected at the point where the fitting curve first approached 0.9.Subsequently, the correlation-based association between phenotype and gene modules was performed to generate an adjacency matrix based on the β value.Further transforming the adjacency matrix into a topological overlap matrix (TOM), a gene connectivity network was constructed.Finally, the gene modules were generated and clustered using the dynamic tree cut method, which is based on the eigengenes (ME) of each module.The co-expression networks were visualized using the Cytoscape 3.10.1 package.

Prediction of Candidate Genes Controlling Oil Content
The genes in the upstream and downstream 100 kb genomic regions of each significant SNP were selected as candidate genes.The identified variations in the exonic, 5 ′ UTR, and 3 ′ UTR regions of 10 high oil content soybean materials' and 10 low oil content soybean materials' candidate genes from genomic resequencing data were obtained.The gene-based association analysis was performed using the General Linear Model (GLM) method to determine SNPs or haplotypes associated with oil content by TASSEL 5.0 software [66].SNPs with threshold −log10 (p) ≥ 2.5 were set as having a significant association.

Quantitative Real-Time PCR
The expression levels of candidate genes were analyzed by quantitative real-time PCR.Total RNA extracted from soybean developmental seeds was obtained using TRIzol reagent

Figure 1 .
Figure 1.Frequency distribution of oil content in the three environments.

Figure 1 .
Figure 1.Frequency distribution of oil content in the three environments.

Figure 2 .
Figure 2. Manhattan plot and QQ plot of association mapping of oil content in soybean.(A) Xiangyang in 2019.(B) Hulan in 2019.(C) Acheng in 2019.The black line on each subgraph indicates the log10 (p value) significance threshold.

Figure 2 .
Figure 2. Manhattan plot and QQ plot of association mapping of oil content in soybean.(A) Xiangyang in 2019.(B) Hulan in 2019.(C) Acheng in 2019.The black line on each subgraph indicates the log10 (p value) significance threshold.

Figure 3 . 14 Figure 3 .
Figure 3. Weighted gene co-expression network analysis.(A) Clustering dendrogram of genes and construction of modules.(B) Phenotype and module correlation analysis heat map.

Figure 4 .
Figure 4. KEGG enrichment and GO annotations of the MEblack module.(A) KEGG enrichment of the MEblack module.(B) GO annotations of the MEblack module.

Figure 4 .
Figure 4. KEGG enrichment and GO annotations of the MEblack module.(A) KEGG enrichment of the MEblack module.(B) GO annotations of the MEblack module.

Figure 5 .
Figure 5. Co-expression network analysis of candidate genes in the MEblack module.

Figure 5 .
Figure 5. Co-expression network analysis of candidate genes in the MEblack module.

Figure 5 .
Figure 5. Co-expression network analysis of candidate genes in the MEblack module.

Figure 6 .
Figure 6.Haplotypes analysis of genes with variations related to oil content.(A) Statistical analysis of locus variations in two haplotypes of Glyma.18G300100.(B) Comparison of oil content between two different haplotypes in 20 soybean germplasms.** indicates significance at p < 0.01.

Table 1 .
Statistical analysis of oil content in soybean.

Table 1 .
Statistical analysis of oil content in soybean.

Table 2 .
Single nucleotide polymorphisms (SNPs) associated with oil content of soybean and known QTLs overlapping with peak SNP.

Table 2 .
Single nucleotide polymorphisms (SNPs) associated with oil content of soybean and known QTLs overlapping with peak SNP.