Genome-Wide Association Study Reveals Key Genetic Loci Controlling Oil Content in Soybean Seeds
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThe manuscript presents a well-structured and comprehensive GWAS study focused on the identification of genomic loci associated with soybean seed oil content. The study uses advanced DNA sequencing, solid statistical methods, and computer analysis to identify important SNPs and potential genes, especially on chromosome 8, that could explain differences in oil content. The manuscript is of high scientific quality and relevance, especially for researchers and breeders focused on improving soybean seed quality.
However, some clarifications, expansions, and refinements are necessary to strengthen the manuscript, particularly in the areas of biological interpretation, methodological transparency, and scientific writing.
Comments:
Abstract
Line 25: Specify the number of replications and environments used in phenotyping.
Lines 33–36: Clarify how candidate genes were prioritized and what evidence supports their functional relevance.
Line 38: Replace vague expression with 'molecular markers with potential for marker-assisted selection (MAS).
Introduction
Line 56: Include recent GWAS studies in soybean with diverse panels.
Line 67: Clarify the relevance of rice QTLs to soybean; otherwise, remove.
Line 84: Add discussion of transcriptional or epigenetic regulation of oil biosynthesis.
Materials and Methods
Lines 98–100: Clarify the selection criteria for the 341 accessions.
Line 111: Indicate in-row plant spacing.
Lines 119–122: Clarify if NIRS results were validated against Soxhlet extraction.
Line 141: Explain the genome-wide significance threshold and correction method.
Results
Line 170: Indicate number of accessions per year.
Lines 214–216: Avoid use of 'penetrance'; clarify proportion with increased oil.
Line 247: Include DeltaK or cross-validation plot to support fastSTRUCTURE results.
Lines 283–285: Discuss possible haplotypes/epistasis for SNPs with opposing effects.
Lines 313–328: Use tools like SIFT or PROVEAN to support the impact of amino acid substitutions.
Discussion
Lines 344–347: Expand limitations and mention alternative models like FarmCPU.
Line 367: Provide functional annotation of Glyma.08G123500.
Lines 378–381: Recommend experimental validation of candidate genes.
Conclusion
Line 402: Emphasize applicability of SNPs for genomic selection.
Additional
- Language: Suggest review by a native English speaker.
- Figures: In Figure 2, indicate the number of accessions per region.
- References: Ensure consistency and avoid duplication (e.g., references [7] and [51]).
Comments on the Quality of English Language
I strongly recommend that the authors seek assistance from a native English speaker or a professional editing service to refine the language and improve the overall readability and grammatical accuracy of the manuscript.
Author Response
Response to Reviewer 1
We are grateful to Reviewer 1 for the thorough and constructive feedback that has helped improve the scientific rigor and clarity of our manuscript.
Abstract
Reviewer Comment 1 (Line 25): "Specify the number of replications and environments used in phenotyping."
Response: We have revised the abstract to specify that the study was conducted over two growing seasons (2021 and 2023) with three replications per environment.
Reviewer Comment 2 (Lines 33–36): "Clarify how candidate genes were prioritized and what evidence supports their functional relevance."
Response: We have clarified that candidate genes were prioritized based on proximity to significant SNPs (≤10 kb) and added comprehensive functional analysis of these genes, including GO term annotations and expression profiles.
Reviewer Comment 3 (Line 38): "Replace vague expression with 'molecular markers with potential for marker-assisted selection (MAS).'"
Response: The text has been revised as suggested to specify "molecular markers with potential for marker-assisted selection (MAS)."
Introduction
Reviewer Comment 4 (Line 56): "Include recent GWAS studies in soybean with diverse panels."
Response: We have added references to recent GWAS studies that utilized diverse soybean panels for oil content analysis, providing better context for our work.
Reviewer Comment 5 (Line 67): "Clarify the relevance of rice QTLs to soybean; otherwise, remove."
Response: We have removed the rice QTL reference as it was not directly relevant to the soybean study and could cause confusion for readers.
Reviewer Comment 6 (Line 84): "Add discussion of transcriptional or epigenetic regulation of oil biosynthesis."
Response: We have expanded this section to include discussion of transcriptional regulation by MYB and AP2/ERF transcription factors, as well as epigenetic modifications that influence fatty acid biosynthesis genes.
Materials and Methods
Reviewer Comment 7 (Lines 98–100): "Clarify the selection criteria for the 341 accessions."
Response: We have clarified that accessions were selected based on: (1) geographic representation across northern China, (2) genetic diversity assessment using preliminary SNP data, (3) availability of seed materials, and (4) known variation in oil content from previous studies.
Reviewer Comment 8 (Line 111): "Indicate in-row plant spacing."
Response: In-row plant spacing information has been added to the experimental design section.
Reviewer Comment 9 (Lines 119–122): "Clarify if NIRS results were validated against Soxhlet extraction."
Response: Yes, NIRS calibration was validated using Soxhlet extraction on a subset of 30 samples (R² = 0.94, RMSE = 0.42%). This validation information has been added to the methods section.
Reviewer Comment 10 (Line 141): "Explain the genome-wide significance threshold and correction method."
Response: We have clarified that the genome-wide significance threshold was set at -log₁₀(P) ≥ 6.0 (P ≤ 1×10⁻⁶) using Bonferroni correction for multiple testing, accounting for approximately 1 million independent tests.
Results
Reviewer Comment 11 (Line 170): "Indicate number of accessions per year."
Response: We have specified that all 341 accessions were evaluated in both years (2021 and 2023).
Reviewer Comment 12 (Lines 214–216): "Avoid use of 'penetrance'; clarify proportion with increased oil."
Response: We have replaced "penetrance" with clearer terminology and specified that 78% of accessions carrying the favorable allele showed oil content above the population mean.
Reviewer Comment 13 (Line 247): "Include DeltaK or cross-validation plot to support fastSTRUCTURE results."
Response: We have added the cross-validation plot as Supplementary Figure S1 to support our population structure analysis.
Reviewer Comment 14 (Lines 283–285): "Discuss possible haplotypes/epistasis for SNPs with opposing effects."
Response: We have added discussion suggesting that opposing effects may result from different haplotype backgrounds or epistatic interactions and recommend future haplotype-based analysis.
Reviewer Comment 15 (Lines 313–328): "Use tools like SIFT or PROVEAN to support the impact of amino acid substitutions."
Response: We have conducted comprehensive PROVEAN analysis for all identified amino acid substitutions across the four key genes:
- Glyma.08G110000: T247P substitution (PROVEAN score: 0.031) - neutral/tolerated
- Glyma.08G117400: Multiple variants with PROVEAN score of 1.450 - neutral/tolerated
- Glyma.08G117600: A233V substitution (PROVEAN score: 1.233) - neutral/tolerated
- Glyma.08G123500: Multiple variants ranging from deleterious (-1.633) to neutral (0.933), indicating complex functional impacts
Discussion
Reviewer Comment 16 (Lines 344–347): "Expand limitations and mention alternative models like FarmCPU."
Response: We have expanded the limitations section to acknowledge: (1) population structure bias, (2) environmental specificity, (3) lack of functional validation, and mentioned that future studies could benefit from alternative methods like FarmCPU or 3VmrMLM.
Reviewer Comment 17 (Line 367): "Provide functional annotation of Glyma.08G123500."
Response: We have provided comprehensive functional annotation for Glyma.08G123500: It encodes a receptor-like kinase containing NB-ARC, protein tyrosine kinase, and leucine-rich repeat domains. The gene shows protein kinase activity (GO:0004672), ATP binding (GO:0005524), and is involved in protein phosphorylation (GO:0006468). Expression analysis reveals highest activity in roots (7.45 FPKM) and nodules (5.33 FPKM), consistent with roles in nutrient sensing, symbiotic signaling, and stress responses.
Reviewer Comment 18 (Lines 378–381): "Recommend experimental validation of candidate genes."
Response: We have added recommendations for future functional validation, including gene expression analysis across developmental stages, CRISPR/Cas9 knockout studies, and complementation analysis.
Conclusion
Reviewer Comment 19 (Line 402): "Emphasize applicability of SNPs for genomic selection."
Response: We have emphasized that the identified SNPs provide valuable molecular markers for genomic selection models and can be integrated into breeding programs for accelerated genetic gain.
Additional Improvements Made
- Language: The manuscript has been revised to improve clarity and grammatical accuracy.
- Figures: Figure 2 has been updated to include the number of accessions per geographic region.
- References: We have checked for consistency and removed the duplication between references [7] and [51].
Reviewer 2 Report
Comments and Suggestions for AuthorsThe ms presents a genome-wide association study (GWAS) conducted on a panel of 341 soybean assessed across two growing seasons, mainly from Northern China. The authors identify 119 significant SNP loci associated with oil content, with a particular focus on chromosome 8. The study aims to contribute to marker-assisted selection strategies in soybean breeding programs. The study relies on a large number of high-quality SNPs (over 1 million) and two years of replicated phenotypic data, strengthening statistical power; uses MLM to incorporate population structure and kinship is appropriate and widely accepted in GWAS analyses. The convergence of significant associations on chromosome 8 across years and environments adds reliability to the findings.
However, the primary QTLs and genes identified (especially on chromosome 8) overlap significantly with previously published studies, which weakens the study’s originality. Besides, the candidate genes identified were not supported by experimental data (e.g., gene expression analysis, mutant lines, or transgenic validation). Thus, they imply functional roles for candidate genes based solely on SNP effects, without clear functional annotation or expression evidence. Besides, the ms does not sufficiently acknowledge the constraints of GWAS-only studies or the potential false positives inherent to association studies.I prepared specific comments about the document:
- I missed hte the novelty of the study. I suggest to the autor explicitly differentiate what is being newly discovered versus what confirms existing QTLs (e.g., Seed oil 11-g2). If Glyma.08G123500 has been previously reported, explain how this study adds new insight (e.g., new SNPs, broader population context).
- Avoid definitive language when suggesting functional roles for genes unless supported by additional evidence. Phrases such as “may be involved” or “potentially associated” are more appropriate.
- Enhance the discussion including a section on the limitations of the study, such às lack of expression validation, the limited transferability of results to other soybean populations or environments, and the potential bias due to the geographic concentration of samples.
- Incorporate functional annotation, providing GO terms, pathway involvement, or gene expression evidence (from public datasets) to better support the functional relevance of the candidate genes.
- I would like to suggest to the authors to consider breaking the discussion into thematic subsections for clarity (e.g., “Validation of GWAS Loci,” “Candidate Gene Function,” “Implications for Breeding”). I felt this section was overly dense.
Minor suggestions:
- Standardize gene nomenclature and ensure that all gene names follow accepted conventions.
- Double-check the consistency of units (e.g., percentage values) and formatting of figures/tables.
Author Response
Response to Reviewer 2
We appreciate Reviewer 2's detailed and critical assessment, which has helped us strengthen the manuscript's scientific contribution and presentation.
Reviewer Comment: "The ms presents a genome-wide association study (GWAS) conducted on a panel of 341 soybean assessed across two growing seasons, mainly from Northern China. The authors identify 119 significant SNP loci associated with oil content, with a particular focus on chromosome 8. The study aims to contribute to marker-assisted selection strategies in soybean breeding programs. The study relies on a large number of high-quality SNPs (over 1 million) and two years of replicated phenotypic data, strengthening statistical power; uses MLM to incorporate population structure and kinship is appropriate and widely accepted in GWAS analyses. The convergence of significant associations on chromosome 8 across years and environments adds reliability to the findings. However, the primary QTLs and genes identified (especially on chromosome 8) overlap significantly with previously published studies, which weakens the study's originality. Besides, the candidate genes identified were not supported by experimental data (e.g., gene expression analysis, mutant lines, or transgenic validation). Thus, they imply functional roles for candidate genes based solely on SNP effects, without clear functional annotation or expression evidence. Besides, the ms does not sufficiently acknowledge the constraints of GWAS-only studies or the potential false positives inherent to association studies."
Response: We acknowledge these important concerns and have made substantial revisions to address them comprehensively.
Major Comments
Reviewer Comment 1: "I missed the novelty of the study. I suggest to the author explicitly differentiate what is being newly discovered versus what confirms existing QTLs (e.g., Seed oil 11-g2). If Glyma.08G123500 has been previously reported, explain how this study adds new insight (e.g., new SNPs, broader population context)."
Response: We have added a dedicated paragraph distinguishing our novel contributions:
- Identification of 15+ novel variants in Glyma.08G123500 alone, with comprehensive PROVEAN analysis revealing both deleterious and neutral variants not previously characterized
- Integrated analysis of genetic variation, functional annotation, and tissue-specific expression for four key chromosome 8 genes
- Novel identification of Glyma.08G123500 as a receptor-like kinase potentially regulating oil content through signal transduction pathways
- Population-specific analysis providing new insights for regional breeding programs
Reviewer Comment 2: "Avoid definitive language when suggesting functional roles for genes unless supported by additional evidence. Phrases such as 'may be involved' or 'potentially associated' are more appropriate."
Response: We have revised all definitive statements throughout the manuscript to use appropriate tentative language such as "may be involved," "potentially associated," "suggests a possible role," and "likely contributes to" when discussing gene functions without experimental validation.
Reviewer Comment 3: "Enhance the discussion including a section on the limitations of the study, such as lack of expression validation, the limited transferability of results to other soybean populations or environments, and the potential bias due to the geographic concentration of samples."
Response: We have added a comprehensive limitations section discussing:
- Lack of experimental expression validation and functional studies
- Geographic specificity limiting transferability to other soybean populations
- Potential population stratification bias due to regional sampling
- Environmental specificity of associations
- Need for functional validation through gene knockout/overexpression studies
- Potential false positives inherent in association studies
Reviewer Comment 4: "Incorporate functional annotation, providing GO terms, pathway involvement, or gene expression evidence (from public datasets) to better support the functional relevance of the candidate genes."
Response: We have incorporated comprehensive functional annotations for all four key candidate genes on chromosome 8:
- Glyma.08G110000: Hydroxycinnamoyl coenzyme A-quinate transferase (GO:0016747) involved in secondary metabolism and phenylpropanoid pathway, with highest expression in reproductive tissues (Shoot Apical Meristem: 65.95 FPKM, Flower: 42.85 FPKM)
- Glyma.08G117400: PPR repeat protein (GO:0005515) involved in RNA metabolism and chloroplast/mitochondrial function, showing broad but moderate expression across tissues, potentially affecting lipid biosynthesis in organelles
- Glyma.08G117600: WD40 repeat scaffolding protein (GO:0005515) with high expression in seeds (4.04 FPKM) and pods (3.54 FPKM), indicating roles in reproductive development and potentially seed oil accumulation
- Glyma.08G123500: Receptor-like kinase with protein kinase activity (GO:0004672), ATP binding (GO:0005524), and involvement in protein phosphorylation (GO:0006468). Highly expressed in roots (7.45 FPKM) and nodules (5.33 FPKM), suggesting roles in nutrient uptake and symbiotic relationships that may indirectly affect oil content
Reviewer Comment 5: "I would like to suggest to the authors to consider breaking the discussion into thematic subsections for clarity (e.g., 'Validation of GWAS Loci,' 'Candidate Gene Function,' 'Implications for Breeding'). I felt this section was overly dense."
Response: We have reorganized the discussion into clear thematic subsections:
- "Validation of GWAS Loci and Comparison with Previous Studies"
- "Candidate Gene Analysis and Functional Annotation"
- "Breeding Implications and Marker-Assisted Selection"
- "Study Limitations and Future Directions"
Minor Suggestions
Reviewer Comment: "Standardize gene nomenclature and ensure that all gene names follow accepted conventions."
Our Response: All gene names have been standardized according to soybean genome nomenclature guidelines .
Reviewer Comment: "Double-check the consistency of units (e.g., percentage values) and formatting of figures/tables."
Our Response: We have standardized all percentage values throughout the manuscript and checked figure/table formatting for consistency. All oil content values are now consistently reported as percentages with two decimal places.
Reviewer 3 Report
Comments and Suggestions for AuthorsDear authors, you research presented sufficient information about the oil content of 341 accessions of soybean in different growing regions of North and North West China, in two growing seasons in years 2021 and 2023. Phenotypic analysis indicated a variation in oil content, ranging from 11.00% to 21.77%, in the growing regions and between the two growing seasons. Here is my first comment for you. You did not describe anything about the climate during growing years. Are the two growing years are similar in the climate conditions or the two years are different. What about monthly precipitation amounts compared to norms for the different locations. I suppose because you detected significant differences in oil content across the three treatment groups (2021, 2023, and Average), you have to present data about the climate conditions in the two growing seasons. Also I am wondering way the two growing years are not consecutive but they are missing a year. Your findings indicate significant molecular markers for breeding programs of soybean related to the oil content and maybe your article will be of a of great interest to the readers. But I insist that you augment the article with the aforementioned data from the two years of research, as well as comply with the other notes mentioned here.
In the aim of the study on page 2 line 88 you have to replace multiple years with a two year period or two growing season because you have only two growing season not more/multiple!!!!
on page 11 line 304 the table is 1 not 2
page 13 line 331 table is 2 not 3.
Please present the required information about the climate during two growing years and monthly precipitations compared to the norms for the different locations and also correct the other noted omissions.
Author Response
Response to Reviewer 3
We thank Reviewer 3 for highlighting important environmental considerations that have improved the manuscript's context and completeness.
Reviewer Comment: "Dear authors, your research presented sufficient information about the oil content of 341 accessions of soybean in different growing regions of North and Northwest China, in two growing seasons in years 2021 and 2023. Phenotypic analysis indicated a variation in oil content, ranging from 11.00% to 21.77%, in the growing regions and between the two growing seasons."
Response: We appreciate this positive assessment of our phenotypic analysis and the recognition of the substantial variation observed in our study population.
Reviewer Comment 1: "You did not describe anything about the climate during growing years. Are the two growing years similar in the climate conditions or the two years are different. What about monthly precipitation amounts compared to norms for the different locations. I suppose because you detected significant differences in oil content across the three treatment groups (2021, 2023, and Average), you have to present data about the climate conditions in the two growing seasons."
Response: We have added a comprehensive description of climatic conditions for both growing seasons . Monthly precipitation and temperature data for all locations have been added to Supplementary Table S1. associations.
Reviewer Comment 2: "Also I am wondering why the two growing years are not consecutive but they are missing a year."
Response: We have no planting occurred in 2022 due to excessive rainfall and flooding conditions during the critical planting season, which prevented seed production and phenotyping activities across our experimental sites.
Reviewer Comment 3: "In the aim of the study on page 2 line 88 you have to replace multiple years with a two year period or two growing seasons because you have only two growing seasons not more/multiple."
Response: Line 88 has been corrected to read "two growing seasons" instead of "multiple years" to accurately reflect our experimental design.
Reviewer Comment 4: "On page 11 line 304 the table is 1 not 2."
Response: Line 304 table reference has been corrected from "Table 2" to "Table 1."
Reviewer Comment 5: "Page 13 line 331 table is 2 not 3."
Response: Line 331 table reference has been corrected from "Table 3" to "Table 2."
Final Comment: "Your findings indicate significant molecular markers for breeding programs of soybean related to the oil content and maybe your article will be of great interest to the readers."
Response: We appreciate the reviewer's encouraging assessment. We believe that our identified SNP markers and candidate genes will indeed provide valuable resources for soybean breeding programs focused on improving oil content, particularly for northern China germplasm.
Round 2
Reviewer 2 Report
Comments and Suggestions for AuthorsAfter reviewing the revised version of the manuscript, I acknowledge that the manuscript has been improved compared to the original submission. The authors have made commendable efforts to address the reviewers' comments and enhance the scientific clarity of their work.
Among the comment, the revised version explicitly distinguishes the new contributions of this study, highlighting the identification of 15+ previously uncharacterized SNP variants in Glyma.08G123500 and the integration of PROVEAN analysis, functional annotation, and population-specific insights for regional breeding. A new dedicated section acknowledges major limitations, including the absence of expression validation, geographic restriction of samples, environmental specificity of associations, and inherent false positives in GWAS. GO terms, pathway information, and public expression datasets were incorporated to strengthen the biological interpretation of candidate genes. And, the discussion is now organized into thematic subsections, improving clarity and flow. However, I still worried about some parts of the text, which I believe should be improved or better explored:
- The revised text acknowledges that many significant loci overlap with previously reported QTLs (e.g., Seed oil 11-g2). They described new SNP variants are described, but the distinction between “confirmatory findings” and “novel contributions” remains somewhat limited.
- A more quantitative comparison with prior GWAS studies is needed (e.g., percentage of overlap, novelty of alleles/haplotypes, differences in population structure or environmental effects) to convincingly demonstrate the added value of this study.
Author Response
Comment 1: The revised text acknowledges that many significant loci overlap with previously reported QTLs (e.g., Seed oil 11-g2). They described new SNP variants are described, but the distinction between “confirmatory findings” and “novel contributions” remains somewhat limited.
Response: We have added a comprehensive quantitative comparison section 4.4 that clearly delineates our novel contributions from confirmatory findings.
Comment 2: A more quantitative comparison with prior GWAS studies is needed (e.g., percentage of overlap, novelty of alleles/haplotypes, differences in population structure or environmental effects) to convincingly demonstrate the added value of this study.
Response: We have incorporated comprehensive quantitative comparisons across multiple dimensions: