Abstract
Soybean (Glycine max L. Merr.) is rich in proteins, fats, and other nutrients, and the genetic improvement of soybean protein content has long been a key research focus in breeding programs. Based on the chromosome segment substitution line (CSSL) population, this study screened target lines within this population using genotypic and phenotypic information to establish an initial mapping population for soybean seed protein content. Through single-marker analysis, a quantitative trait locus (QTL) interval was mapped to the region between 26,705,080 bp and 33,180,908 bp on chromosome 16, designated as qSPJ_1. A secondary segregating population was constructed based on the initial mapping results for fine mapping, which narrowed the interval to 0.076 Mb. A total of 9 candidate genes were identified within this interval. By comparing amino acid and promoter sequences between the two parents, performing quantitative real-time PCR (qRT-PCR) analysis, and conducting haplotype analysis, Glyma.16G165100 was preliminarily predicted as a candidate gene affecting soybean seed protein content. The single nucleotide polymorphism (SNP) variation sites in its promoter region were significantly associated with the variation in protein content in the resource population. This study provides important theoretical guidance for dissecting the genetic mechanism of soybean seed protein content and advancing its breeding improvement.
1. Introduction
1.1. Functions of Soybean Seed Protein
As a crucial natural macromolecule in nature [], protein is an indispensable material basis for sustaining life activities. Soybean is an economic crop with high-quality protein [,], enriched in essential amino acids for humans []. The health benefits of soybean protein have been recognized by the U.S. Food and Drug Administration, and soybean products have been recommended by the World Health Organization as the best health food for the 21st century [].
1.2. Influencing Factors of Soybean Seed Protein Content
The protein content in soybean seeds is influenced by multiple factors, which are mainly categorized into external and internal factors. External factors primarily include external environmental conditions such as geographical latitude, altitude, light, moisture, and temperature, as well as cultivation practices like irrigation and fertilization []. Relevant studies have shown that low CO2 concentration and short daylight hours are conducive to soybean protein formation. Excessively high or low temperatures are both unfavorable for protein accumulation, with the optimal temperature being 24 °C [,,]. Additionally, moderate fertilization and irrigation levels contribute to increased soybean seed protein content []. Internal factors refer to the genetic background of soybean itself, such as additive effects, dominant effects, and epistatic effects among genes. Soybean seed protein content is a quantitative trait controlled by multiple genes, with a complex regulatory mechanism [], and it generally exhibits a negative correlation with yield [,]. Traditional breeding methods are labor-intensive, time-consuming, and lack precision, making it difficult to break the unfavorable genetic correlation between soybean protein content and yield. Therefore, new technical approaches are needed to advance the genetic improvement of soybean seed protein content.
1.3. Research Progress on QTLs Related to Soybean Seed Protein Content
With the development of molecular biology, biotechnology has provided new strategies for crop breeding. Techniques such as marker-assisted selection, genomic selection, gene editing, and transgenesis can overcome the limitations of traditional phenotypic selection, enabling early, precise, and efficient genetic improvement. Researchers worldwide have conducted studies on the molecular genetic improvement of soybean seed protein content for several decades. QTL analysis has laid a crucial foundation for the development of molecular markers and the mining of functional genes. In 1992, Diers et al. [] published the first study on QTL mapping for soybean protein content, which attracted widespread attention in the research community. Warrington et al. [] constructed a recombinant inbred line (RIL) population using a high-yield, late-maturing cultivar and a high-protein, mid-maturing cultivar, identifying 4 QTLs associated with protein content. Whiting et al. [] constructed a RIL population using the high-protein cultivar ACX790P and two high-yielding cultivars S18-R6 and S23-T5, and detected 14 QTL associated with protein content. Clevinger et al. [] developed an RIL population by crossing with relatively high protein content PI399084 and with relatively low protein content PI507429. Based on 12,761 SNP markers, they identified QTL related to protein content on chromosomes 2 and 15 under four different environments, respectively. Jamison et al. [] constructed a RIL population using high protein, low sucrose R08-3221 and high sucrose, low protein R07-2000. Genotypic data were generated via SoySNP6k chip analysis, leading to the identification of two protein content-related QTLs, which were localized on chromosomes 11 and 20, respectively. Kim et al. [] crossed the soybean cultivar “Daepung” with the high-protein cultivar “GWS-1887” and detected a QTL associated with protein content at the genomic position Gm20_29512680. To date, the Soybase database (www.soybase.org) has recorded over 200 QTLs associated with soybean seed protein content [].
1.4. Research Progress on Genes Related to Soybean Seed Protein Content
Notable progress has been made in mining genes related to soybean seed protein content in recent years. In 2020, Wang et al. [] reported via population genetics that GmSWEET10a and GmSWEET10b underwent progressive variation and artificial selection: GmSWEET10a was strongly selected during soybean domestication, leading to larger seeds, increased oil content, and decreased protein content in cultivated soybeans; the domestication and selection of GmSWEET10b lagged behind that of GmSWEET10a. These two genes coordinately regulate seed size, oil content, and protein content, playing a key role in soybean domestication and improvement. In 2022, researchers [] confirmed through transgenic experiments that GmST05—the ortholog of AtMFT in soybean, where AtMFT refers to Mother of Flowering Time from Arabidopsis thaliana—positively regulates soybean seed size; further studies revealed that GmST05 may affect seed oil and protein content by regulating the transcription of GmSWEET10a.In 2023, Qi et al. [] identified and validated FA9 as a major gene controlling soybean seed oil content on chromosome 9 using phenomics, genomics, and gene editing; FA9 encodes a SEIPIN protein and also exerts pleiotropic effects on seed size and protein content. In 2025, Yang et al. [] identified GmGASA12 on chromosome 8 as a key gene influencing soybean seed size and protein content; GmGASA12 encodes a gibberellin-responsive protein. Knockout of GmGASA12 synergistically increased water-soluble protein content and yield per plant, while expanding seed cells by 27.6%. The continuous identification and validation of functional genes provide a solid basis for the design-based breeding of soybean seed protein content and new insights into achieving simultaneous improvements in yield and quality.
1.5. Significance of This Study
Based on a genome-wide introgression line population containing genetic information from wild soybean, this study constructed an initial mapping population and a secondary segregating population for soybean seed protein content. Ultimately, a previously unreported QTL for soybean seed protein content, designated as qSPJ_1, was mapped on chromosome 16, with the gene conferring increased protein content derived from wild soybean. Additionally, potential candidate genes were predicted within this QTL region. This study provides new guiding information for the development of molecular markers related to soybean seed protein and offers a new reference for the application of wild soybean resources in soybean genetic improvement.
2. Results
2.1. Initial Mapping of Protein Quantitative Trait Loci (Protein QTLs)
Individual plants with homozygous backgrounds, minimal heterozygous segments, and few introgressed segments were selected from the CSSL population. Based on phenotypic data of protein content, the CSSL-646 population—with significant segregation and normal distribution of protein content—was selected as the initial QTL mapping population. The CSSL-646 population contained 7 heterozygous segments on chromosomes 8 (Gm08), 10 (Gm10), and 16 (Gm16), and 11 homozygous introgressed segments, with heterozygous and introgressed segments mainly concentrated on Gm16 (Figure 1).
Figure 1.
CSSL-646 resequencing BinMap results. Note: Blue represents the background segment; green represents the homozygous import segment; red represents the heterozygous segment.
The FOSS-1241TM analyzer was used to measure the seed protein content of 79 individual plants in the CSSL-646 population. The protein content ranged from 39.53% to 44.56%, showing significant segregation and normal distribution (Figure 2). Segregation Analysis (SEA) software [] analysis indicated that the protein content trait in this population was controlled by a major gene with dominant effects, confirming it as a single-gene population (Table 1).
Figure 2.
The protein content phenotypic distribution of F2.
Table 1.
Segregation analysis of soybean protein about F2.
Based on whole-genome resequencing data, SSR primers were selected from public primer libraries: 6 polymorphic SSR primers were screened for the homozygous introgressed segments on Gm08, 28 for the heterozygous segments on Gm16, and 8 for the introgressed segments on Gm16. Forty extreme phenotypic individuals, including 20 high-protein individuals and 20 low-protein individuals, were selected from the CSSL-646 population for genotyping using Polyacrylamide Gel Electrophoresis. By comparing genotypic differences between extreme phenotypes, significant genotypic differences were observed on Gm16. Thus, the QTL associated with soybean seed protein content was initially mapped to the interval of 26,705,080 bp–33,180,908 bp on Gm16, designated as qSPJ_1.
2.2. Fine Mapping of Protein Locus qSPJ_1
Four individuals, namely 646−2−14, 646−2−15, 646−2−15, and 646−2−25, with heterozygous genotypes in the qSPJ_1 interval were selected from the F2 generation and grown as single rows to construct the R1 population. The protein content of the R1 population showed a normal distribution (Figure 3). SEA software analysis was performed on the protein content phenotypic data of R1 (Table 2). Twenty high-protein and 20 low-protein individuals from R1 were selected for genotyping: the genotypes of high-protein and low-protein individuals on Gm08 were consistent with the SN14 background, excluding the influence of Gm08 on protein content. Marker densification was performed within the qSPJ_1 interval, and single-marker analysis narrowed the interval to 1.0 Mb (32,403,844 bp–33,428,851 bp), flanked by markers BARCSOYSSR_16_1082 and BARCSOYSSR_16_1134.
Figure 3.
Protein content distribution histogram of secondary segregating population. Note: (a) The Phenotypic Distribution of Protein Content in the R1 Population. (b) The Phenotypic Distribution of Protein Content in the R2 Population.
Table 2.
Segregation analysis of soybean protein about RHL.
Based on the genotyping results of the R1 population, four individuals—namely 646−2−15−14, 646−2−21−20, 646−2−21−23, and 646−2−21−24—with heterozygous target segments were selected, and their progeny (182 individuals) were used to construct the R2 population. The R2 population showed a distinct bimodal distribution of protein content. (Figure 3) Chi-square test using SPSS 26.0 software [] indicated that the phenotypic segregation ratio conformed to the 1:3 Mendelian ratio for single-gene control. SEA software analysis confirmed that the R2 population was controlled by a major gene (Table 2). Genotyping of 40 extreme phenotypic individuals (20 high-protein, 20 low-protein) from R2 revealed segregation of the target segment into two sub-intervals: 0.076 Mb (32,403,844 bp–32,480,066 bp, flanked by BARCSOYSSR_16_1082 and BARCSOYSSR_16_1087) and 0.36 Mb (33,069,984 bp–33,428,851 bp, flanked by BARCSOYSSR_16_1123 and BARCSOYSSR_16_1134).
Individuals with heterozygous genotypes in the two sub-intervals and homozygous genotypes in other regions were selected, and 9 heterozygous individuals (W1–W9) were used to construct the R3 population, which is an residual heterozygous line (RHL) population. The R3 population showed significant segregation of protein content, with values ranging from 39.95% to 47.76%. Single-marker analysis of primers within the two sub-intervals revealed that four markers—namely BARCSOYSSR_16_1082, BARCSOYSSR_16_1083, BARCSOYSSR_16_1084, and BARCSOYSSR_16_1087—were extremely significantly associated with protein content. Thus, the qSPJ_1 interval was finally refined to 0.076 Mb (32,403,844 bp–32,480,066 bp) on Gm16 (Figure 4 and Figure 5).
Figure 4.
Research process of Gm16 fine positioning.
Figure 5.
Fine mapping of protein phenotypic distribution of population.
2.3. Gene Annotation and Parental Sequence Alignment of Candidate Genes
Functional annotation of genes within the 0.076 Mb fine-mapped interval identified 9 candidate genes (Table 3).
Table 3.
Candidate gene annotation information.
The coding sequence (CDS) of candidate genes from SN14 and ZYD00006 were translated using the Expasy website (http://web.expasy.org/translate/), and their amino acid sequences were further compared. Only Glyma.16G164900, Glyma.16G165000, Glyma.16G165100, Glyma.16G165300, and Glyma.16G165600 were found to have nonsynonymous mutations (Figure 6).
Figure 6.
Amino acid changes in candidate genes in parents. Note: (A): Amino acid sequence does not change between parents; (B): Amino acid sequence is different between parents.
2.4. Quantitative Real-Time PCR Analysis (qRT-PCR Analysis)
To further validate the five candidate genes with non-synonymous mutations, two low-protein individuals—D646−1 and D646−2, which have the SN14 background in the target segment—and two high-protein individuals—D646−3 and D646−4, which have the ZYD00006 background in the target segment—were selected from the RHL population, with SN14 used as the control. qPCR analysis showed significant differences in the relative expression levels of Glyma.16G164900 and Glyma.16G165100 between high-protein and low-protein materials (Figure 7 and Figure 8).
Figure 7.
Real-time quantitative analysis of the significance of protein phenotypic differences in extreme materials. Note: *** p ≤ 0.001.
Figure 8.
Specific expression analysis of candidate genes at different stages of seed development. Note: Green represents parent SN14; blue and purple represent high protein materials; yellow and red represent low protein materials.
2.5. Analysis of Promoter Elements of Candidate Genes
Using the Plant CARE website, the 3000 bp upstream sequences of Glyma.16G164900 and Glyma.16G165100 from the sequencing data of the two parents (SN14 and ZYD0006) were extracted and compared for analysis. The promoter regions of the candidate genes Glyma.16G164900 and Glyma.16G165100 both contain a large number of core promoter elements around transcriptional promoters (TATA-box) and cis-elements that function cooperatively in the promoter and enhancer regions (CAAT-box). Additionally, variations in varying degrees exist in the promoter regions of these two genes between the parents, which may play a regulatory role in gene expression (Figure 9).
Figure 9.
Promoter element analysis of candidate genes. Note: (A): Glyma.16G164900 promoter element analysis and parent-to-parent comparison; (B): Promoter element analysis and parent-to-parent comparison of Glyma.16G165100.
2.6. Haplotype Analysis of Candidate Genes
Haplotype analysis of candidate genes Glyma.16G164900 and Glyma.16G165100 was performed in a resource population consisting of 350 soybean core germplasms (Table S1). Whole-genome resequencing data were used for the analysis of these two genes. For Glyma.16G164900, two elite haplotypes were identified in the soybean resource population, with no significant difference in protein content between the elite haplotypes (Figure S1). For Glyma.16G165100, a total of four elite haplotypes were classified in the resource population (Figure 10 and Figure 11). Among them, the average protein content of Hap_3 and Hap_4 was significantly higher than that of Hap_1. Therefore, Glyma.16G165100 was inferred to be the candidate gene associated with the soybean protein content phenotype.Hap_3 included 350 accessions, with 19 from Heilongjiang, 8 from Jilin, 2 from Inner Mongolia, and 1 each from Xinjiang and Liaoning. Hap_4 contained 16 accessions, including 9 from Heilongjiang, 4 from Jilin, and 3 from Liaoning. These results indicated that soybean germplasms from Heilongjiang had greater advantages in high protein content among the elite haplotypes.
Figure 10.
Distribution Map of Haplotype Variation Sites of Candidate Gene Glyma.16G165100.
Figure 11.
Boxplot of Glyma.16G165100 Gene’s Haplotypes.
The main differential loci between Hap_3, Hap_4, and Hap_1 were concentrated in the promoter region. In order to further explore the association between the variation in the 3000 bp upstream regulatory region of the candidate gene Glyma.16G165100 and the protein content of soybean seeds, this study detected and analyzed the SNP variation information in the upstream regulatory region of the gene in 350 soybean genotype materials. The results showed that a total of 6 SNP loci were identified in this region; sequence alignment and element prediction showed that these SNP variations may lead to structural changes in the core cis-acting elements (TATA-box and, CAAT-box and ARE, etc.) in the regulatory region, which may regulate the expression level of Glyma.16G165100 gene by affecting the transcription initiation efficiency (Figure S2). Red indicates Hap_1, Green indicates Hap_2, Blue indicates Hap_3, Purple indicates Hap_4, In the labels, groups sharing the same letter indicate no significant difference in the average 100-seed weight, while groups with no common letter indicate a significant difference in the average 100-seed weight.
3. Discussion
Wild soybean exhibits higher genetic diversity compared with cultivated soybean []. During long-term natural selection, wild soybean exhibits tolerance to abiotic stresses such as salinity-alkalinity, drought, aluminum toxicity, and phosphorus deficiency [,,]. Wild soybean also harbors numerous quality-related genes, which are associated with traits such as protein content, vitamin, and amino acid [,]. In this study, a CSSL population derived from the wild soybean ZYD00006 and the cultivated soybean Suinong 14 was used as the basic material. This population has a Suinong 14 genetic background, and each line contains a small number of wild soybean gene segments, which induce phenotypic variation. Lines with fewer introgressed segments were screened from the CSSL population; through continuous screening and generation advancement, a mapping population for soybean seed protein content was constructed. This mapping population has a clean genetic background and low genetic “noise”, thus ensuring higher accuracy and precision of the mapping results. Furthermore, the introgression of genetic segments from wild resources enables the detection of elite allelic variations that are absent in cultivated varieties. In this study, the allelic genes contributing to increased seed protein content were derived from wild soybean resources, which provides an important reference for the utilization of elite alleles from wild soybean and is conducive to broadening the genetic basis of cultivated soybean varieties.
Previous studies have mapped multiple QTLs associated with soybean seed protein content on chromosome 16. Jun et al. [] mapped two related QTLs using SSR markers, with interval lengths of 0.9 Mb (flanked by marker Satt287) and 1.7 Mb (flanked by marker BARC_025851), respectively. Among these QTLs, the one linked to marker BARC_025851 is close in position to qSPJ_1 detected in this study. Sonah et al. [] identified one QTL that simultaneously controls protein and oil content in the interval of 4,183,401–4,523,670 bp using GWAS and GBS technologies. Li et al. [] detected three QTLs associated with protein and oil content in the interval of ss715624939–ss715624938 using SNP markers. These QTLs do not overlap with the qSPJ_1 locus mapped in this study; thus, this study identified a novel QTL for soybean seed protein content, with the elite allelic variation derived from wild resources. Furthermore, within the QTL interval, this study mined candidate genes and performed haplotype analysis of these candidate genes using soybean germplasm resources from different regions, ultimately obtaining elite haplotypes associated with soybean seed protein content. Therefore, this study will provide an important reference for marker-assisted selection breeding of soybean seed protein content.
In this study, through gene sequence alignment, gene expression level analysis, and haplotype analysis, the WD-40 repeat protein family gene Glyma.16G165100 was preliminarily predicted to be a key candidate gene affecting soybean seed protein content. As one of the largest protein families in eukaryotes, the WD-40 repeat protein family is characterized by conserved repeat sequences of approximately 40 amino acids, with functions covering multiple key biological processes such as cell division, light signal transduction, cell cycle regulation, and growth and development []. Van et al. [] reported that WD-40 proteins interact with TPR family genes through their repeat motifs. TPR family proteins exhibit specific interactions with ribosomal components []. Liu et al. [] identified the WD-40 protein family gene SHREK1 in maize, which plays an important role in ribosome biogenesis and kernel development. Ribosomes are the “factories” for protein synthesis, enabling the continuous addition of amino acids through a cyclic reaction of “aminoacyl-tRNA entry → peptide bond formation → translocation”. When ribosomes move to the stop codons (UAA, UAG, UGA) of mRNA, protein synthesis enters the termination stage. Therefore, it is hypothesized that Glyma.16G165100 indirectly affects soybean seed protein content by influencing ribosome synthesis, which requires further verification.
Soybean protein content is negatively correlated with oil content, yield, and ambient temperature []. Genetic improvement of protein content requires balancing the synergy and trade-offs among multiple traits, which is also a core challenge in soybean quality breeding [,,]. This is because protein synthesis consumes large amounts of photosynthates and energy. For soybeans, the total amount of photosynthates is relatively fixed; if a large portion of photosynthates and energy is consumed for protein synthesis, the substances and energy available for other biological processes will decrease accordingly. Zhong et al. [] found that the rhizobium-induced cle1a/2a (ric1a/2a) mutants exhibited a moderate increase in nodule number, balanced carbon allocation, and enhanced carbon-nitrogen acquisition capacity. The two ric1a/2a lines showed improvements in seed yield, protein content, and persistent oil content, indicating that gene editing toward optimal nodulation enhances soybean yield and quality. The candidate gene Glyma.16G165100 predicted in this study has not been validated for its function through genetic transformation experiments, which represents a limitation. In subsequent studies, gene-edited mutants will be generated to confirm its gene function. In terms of gene function, Glyma.16G165100 may indirectly affect protein synthesis by influencing ribosome synthesis; meanwhile, various enzymes involved in oil synthesis and yield formation are also synthesized by ribosomes. Therefore, editing this gene may enable the coordinated positive regulation of soybean quality and yield.
In this study, the planting location, cultivation method, and management mode of the QTL mapping population were standardized over 4 years, and QTL mapping was performed each year to reduce the interference of environmental heterogeneity on protein phenotypes. Secondary segregating populations were constructed in different years, and the same QTL was repeatedly detected in all cases, indicating that the QTL is stable. Furthermore, haplotype analysis of the candidate gene was conducted using phenotypic data from a 3-year germplasm population collected from other locations, and elite haplotypes were identified. This confirms the accuracy of the candidate gene prediction and its broad applicability to a certain extent. However, there are also some limitations, such as QTL mapping materials planted in a single location. The test year is limited. Moreover, the source of q PCR test materials is relatively simple. There is a lack of verification of the function of candidate genes in genetic transformation. These limitations also point out the direction for our future research.
4. Materials and Methods
4.1. Experimental Materials
4.1.1. Construction of the CSSL Population
The genetic population used in this experiment was a CSSL population previously constructed from the Heilongjiang cultivated soybean variety Suinong 14 and the wild soybean accession ZYD00006 [], where Suinong 14 served as the recurrent parent and wild soybean ZYD00006 as the donor parent. After the F1 progeny were obtained from the cross between the two parents in 2006, high-generation genetically stable lines were developed through continuous backcrossing and selfing. Through genotypic screening and phenotypic identification, a total of 220 stable lines were obtained, which fully cover the soybean genome.
4.1.2. Construction of Mapping Populations
Based on whole-genome resequencing results, 23 individual plants with relatively homozygous backgrounds and few heterozygous segments were selected from the CSSLs population. In 2017, these individual plants were propagated into lines, and the CSSL-646 population was finally identified as the initial QTL mapping population according to the segregation pattern of protein content. According to the initial mapping results, in 2018, individual plants with heterozygosity in the initial mapping interval and homozygosity in other regions were selected from the CSSL-646 population to construct a secondary segregating population containing 99 plants, named R1. In 2019, based on the mapping results of R1, a RHL population with 182 plants was constructed, named R2. In 2020, according to the mapping results of R2, individual plants with heterozygosity in the target segment and homozygosity in the remaining background were selected from the population to construct another RHL population with 9 plants, named R3.
4.1.3. Materials for Quantitative Real-Time PCR (qPCR)
In this study, SN14 was selected as the control material. From the R2 population, two plants D646-1 and D646-2 with SN14 background as homozygous target fragment, and two plants D646-3 and D646-4 with ZYD00006 background as homozygous target fragment were selected. The single plant was grown into a plant line, and 5 plant lines were obtained. According to the description of different soybean seed development stages by soybean base website (https://soybase.org/, accessed on 25 November 2018), the seeds of different plants in five rows were sampled at five growth and development stages of EM1, EM2, MM, LM and DS as quantitative analysis materials. Three biological replicates were performed for each row at each stage.
4.1.4. Materials for Haplotype Analysis
A total of 350 soybean germplasm accessions collected from Heilongjiang Province, Jilin Province, Liaoning Province, and the Inner Mongolia Autonomous Region of China were used for haplotype analysis of the candidate gene (Table S1, Figures S3 and S4).
4.2. Experimental Methods
4.2.1. Methods for Planting and Field Management of Experimental Materials
From 2017 to 2020, the materials were grown at the Xiangyang Base of Northeast Agricultural University in Harbin (45.74° N, 126.73° E). The planting pattern was consistent across all years: 5 m row length, 60 cm row spacing, and 5 cm plant spacing, with unified management following local conventional field practices. During the growing-to-harvest period, the average temperatures over the four years were 17.66 °C, 17.51 °C, 17.21 °C, and 17.43 °C, respectively (Table 4); the average precipitation levels were 2.82 mm, 3.60 mm, 4.39 mm, and 4.20 mm, respectively (Table 5, Tables S2 and S3). The resource population used for haplotype analysis was grown at the Gongzhuling Base in Changchun from 2019 to 2021 (Tables S4 and S5).
Table 4.
Weather Conditions in Changchun from May to October.
Table 5.
The precipitation situation from May to October.
4.2.2. Determination of Soybean Seed Protein Content
The FOSS-1241TM Near-Infrared Grain Quality Analyzer was used to measure soybean seed protein content. During measurement, the moisture content of soybean seeds was maintained within the safe range. Each sample was measured in triplicate, and the average value of the three measurements was used as the final phenotypic data for seed protein content.
4.2.3. DNA Extraction and Detection
Young leaf samples were collected from the top of the main stem in July and stored at −80 °C. Genomic DNA was extracted using the cetyltrimethylammonium bromide (CTAB) method, referring to the protocol described by Doyle J []. The purity and concentration of extracted DNA were detected using a Nanodrop spectrophotometer. DNA samples were considered qualified if the A260/A280 ratio ranged from 1.8 to 2.0 and the absorption curve showed a single peak.
4.2.4. Acquisition and Identification of SSR Markers
Simple sequence repeat (SSR) markers were designed based on the Williams 82.a2.v1 reference genome sequence published on the Soybase website. Primers were synthesized by Shanghai Sangon Biotech Co., Ltd. The synthesized primers were in powder form: after centrifugation at 4000 rpm/min using a desktop centrifuge, the tube cap was carefully opened, and a corresponding volume of double-distilled water (ddH2O) was added to dissolve the primers according to the label. The tubes were sealed, vortexed thoroughly, and stored at −20 °C. Polyacrylamide gel electrophoresis (PAGE) was used to identify the polymorphism of SSR markers.
4.2.5. Segregation Analysis of Traits (SEA)
The SEA software was used to predict the genetic pattern of quantitative trait genes in the biparental segregating population. The specific calculation formulas are as follows:
Probability of the Smirnov test (nW2):
Probability of the Kolmogorov test (Dₙ):
4.2.6. Single-Marker Analysis
Single-marker analysis was performed using the Single Marker Analysis module in Windows QTL Cartographer 2.5 software [] to identify molecular markers significantly associated with seed protein content by analyzing phenotypic and genotypic data of the mapping population. A simple linear regression model was used for data analysis:
4.2.7. Functional Annotation of Candidate Genes
Genes within the candidate interval were annotated using the Williams 82.a2.v1 genome as the reference. The Phytozome database (https://phytozome.jgi.doe.gov/pz/portal.html, accessed on 20 February 2019) was used to analyze gene functions and predict candidate genes [].
4.2.8. Amino Acid Sequence Alignment of Candidate Genes
Based on the genotypic data from parental resequencing, variations (SNPs and Indels) in the 3000 bp upstream promoter region, 5′ untranslated region (UTR), CDS region, and 3′ UTR of candidate genes between the two parents were analyzed. The CDSs were translated into amino acid sequences to detect differences in amino acid sequences between the parents.
4.2.9. Quantitative Real-Time PCR (qPCR) Analysis
Transcript sequences of candidate genes were obtained from the Phytozome website, and primers were designed using Primer 5 software. GmActin4 (GenBank Accession No.: AF049106) was used as the reference gene. Total RNA was extracted using the Trizol method and reverse-transcribed into cDNA. qPCR was performed to analyze the relative expression levels of candidate genes.
4.2.10. Promoter Element Analysis
A 3000 bp sequence upstream of the start codon (ATG) of candidate genes was defined as the promoter region, which was retrieved from the Phytozome database. The Plant CARE database (http://bioinformatics.psb.ugent.be/webtools/plantcare/html/, accessed on 28 June 2019) was used to identify and annotate cis-acting elements in the promoter region. TBtools software [] was used for visualizing promoter elements.
4.2.11. Haplotype Analysis of Candidate Genes in the Germplasm Population
Local BLAST analysis was performed to obtain SNP information of candidate genes (Glyma.16G164900 and Glyma.16G165100) in the 350 soybean germplasm accessions. To investigate the genetic basis of seed protein content in soybean, haplotype analysis of candidate genes was conducted on 350 genotyped soybean accessions using CandiHap [] (version 1.3.2). Polymorphic SNPs located in the upstream regulatory regions and exons were prioritized for haplotype construction. The phenotypic effects of distinct gene haplotypes were visualized using the R programming language (version 4.2.2) to elucidate their associations with seed protein content variation. Haplotypes with accessions accounting for more than 5% of the total population were defined as elite haplotypes. SPSS 26.0 software was used for one-way analysis of variance (ANOVA) to determine the effect of haplotypes on phenotypic variation. Independent samples t-tests were performed to compare protein content among different haplotypes, with statistical significance set at p < 0.05.
5. Conclusions
Based on a CSSL population, this study integrated genotypic data with phenotypic data of seed protein content to select elite lines for the construction of secondary segregating populations over consecutive years. Ultimately, a QTL associated with soybean seed protein content, designated as qSPJ-1, was mapped to a region on soybean chromosome 16 spanning 32,403,844 bp to 32,480,066 bp. Within this interval, through sequence alignment, expression level analysis, and haplotype analysis, the WD-40 protein family gene Glyma.16G165100 was preliminarily predicted as a potential candidate gene influencing soybean seed protein content. The identified superior haplotypes Hap_3 and Hap_4 exhibited higher average protein content. Breeders can develop molecular markers targeting these two superior haplotypes for use in marker-assisted selection breeding to genetically improve related traits. This study utilized wild soybean resources to construct the mapping population, enabling the detection of rare elite allelic variants derived from wild resources. Notably, landraces with wild soybean ancestry were included in Hap_3 and Hap_4, which provides valuable insights for the effective utilization of wild soybean resources in the genetic improvement of soybean protein content and offers potential to enrich the genetic basis of cultivated soybeans in China and globally. Future work will focus on verifying the function of Glyma.16G165100 through genetic transformation and investigating its applicability across different environments.
Supplementary Materials
The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/plants14223525/s1, Table S1: A population of 350 soybean accessions; Table S2: Daily weather conditions in Harbin from 2017 to 2020; Table S3: Daily average precipitation in Harbin from 2017 to 2020; Table S4: Daily weather conditions in Gongzhuling from 2019 to 2021; Table S5: Daily average precipitation in Gongzhuling from 2019 to 2021; Figure S1: Boxplot of Glyma.16G164900 Gene’s Haplotypes; Figure S2: SNP variations in the promoter region of Glyma.16G165100 within the germplasm population; Figure S3: Distribution map of variety types among 350 accessions; Figure S4: Distribution map of origins for 350 varieties.
Author Contributions
J.C., J.X. and G.L. conceived and designed the manuscript. M.S. and Y.Z. (Yuhong Zheng) conducted data analysis. F.M., X.F. and X.S. performed formal analysis and validation. Y.Z. (Yunfeng Zhang) and M.W. conducted field management. Z.Y. and X.X. conducted phenotypic identification. Q.W., S.W. and H.J. critically revised the manuscript. All authors have read and agreed to the published version of the manuscript.
Funding
This research was funded by the National Soybean Industry Technology System Northeast Mid-late Maturity Variety Improvement Position Expert (CARS-04-PS08), the National Natural Science Foundation of China (32201825) and Natural Science Foundation of Jilin Province (20240101243JC).
Data Availability Statement
All data are contained within the article.
Conflicts of Interest
The authors declare no conflicts of interest.
Abbreviations
The following abbreviations are used in this manuscript:
| CSSL | Chromosome Segment Substitution Line |
| QTL | Quantitative Trait Locu |
| qRT-PCR | Quantitative Real-time PCR |
| SNP | Single Nucleotide Polymorphism |
| RIL | Recombinant Inbred Line |
| CDS | Coding Sequence |
| RHL | Residual Heterozygous Line |
| SSR | Simple sequence repeat |
| PAGE | Polyacrylamide Gel Electrophoresis |
| SEA | Segregation Analysis |
References
- Siddiqui, G.A.; Naeem, A. Connecting the dots: Macromolecular crowding and protein aggregation. J. Fluoresc. 2023, 33, 1–11. [Google Scholar] [CrossRef] [PubMed]
- Wang, J.; Mao, L.; Zeng, Z.; Yu, X.; Lian, J.; Feng, J.; Yang, W.; An, J.; Wu, H.; Zhang, M.; et al. Genetic mapping high protein content QTL from soybean ‘Nanxiadou 25’and candidate gene analysis. BMC Plant Biol. 2021, 21, 388. [Google Scholar] [CrossRef]
- Zhou, Z.; Jiang, Y.; Wang, Z.; Gou, Z.; Lyu, J.; Li, W.; Yu, Y.; Shu, L.; Zhao, Y.; Ma, Y. Resequencing 302 wild and cultivated accessions identifies genes related to domestication and improvement in soybean. Nat. Biotechnol. 2015, 33, 408–414. [Google Scholar] [CrossRef]
- Krishnan, H.B.; Jez, J.M. Review: The promise and limits for enhancing sulfur-containing amino acid content of soybean seed. Plant Sci. 2018, 272, 14–21. [Google Scholar] [CrossRef]
- Mejia, S.B.; Messina, M.; Li, S.S.; Viguiliouk, E.; Chiavaroli, L.; Khan, T.A.; Srichaikul, K.; Mirrahimi, A.; Sievenpiper, J.L.; Kris-Etherton, P.; et al. A meta-analysis of 46 studies identified by the FDA demonstrates that soy protein decreases circulating LDL and total cholesterol concentrations in adults. J. Nutr. 2019, 149, 968–981. [Google Scholar] [CrossRef] [PubMed]
- Gao, F. Progress on Factors Influencing Protein Formation and Accumulation in Soybean. Soybean Sci. Technol. 2014, 14–19. (In Chinese) [Google Scholar]
- Piper, E.L.; Boote, K.I. Temperature and cultivar effects on soybean seed oil and protein concentrations. J. Am. Oil Chem. Soc. 1999, 76, 1233–1241. [Google Scholar] [CrossRef]
- Xu, J.; Fang, Y.; Cheng, Y.; Wang, Y.; Guo, C. Soybean Cultivation in Low-Latitude Regions: Adaptive Strategies for Sustainable Production. Plant Cell Environ. 2025. [Google Scholar] [CrossRef]
- Thomas, J.; Boote, K.; Allen, L.; Gallo-Meagher, M.; Davis, J. Elevated temperature and carbon dioxide effects on soybean seed composition and transcript abundance. Crop Sci. 2003, 43, 1548–1557. [Google Scholar] [CrossRef]
- Haq, M.U.; Mallarino, A.P. Response of soybean grain oil and protein concentrations to foliar and soil fertilization. Agron. J. 2005, 97, 910–918. [Google Scholar] [CrossRef]
- Fasoula, V.A.; Harris, D.K.; Boerma, H.R. Validation and designation of quantitative trait loci for seed protein, seed oil, and seed weight from two soybean populations. Crop Sci. 2004, 44, 1218–1225. [Google Scholar] [CrossRef]
- Patil, G.; Mian, R.; Vuong, T.; Pantalone, V.; Song, Q.; Chen, P.; Shannon, G.J.; Carter, T.C.; Nguyen, H.T. Molecular mapping and genomics of soybean seed protein: A review and perspective for the future. Theor. Appl. Genet. 2017, 130, 1975–1991. [Google Scholar] [CrossRef] [PubMed]
- Chiluwal, A.; Haramoto, E.R.; Hildebrand, D.; Naeve, S.; Poffenbarger, H.; Purcell, L.C.; Salmeron, M. Late-season nitrogen applications increase soybean yield and seed protein concentration. Front. Plant Sci. 2021, 12, 715940. [Google Scholar] [CrossRef]
- Diers, B.W.; Keim, P.; Fehr, W.; Shoemaker, R. RFLP analysis of soybean seed protein and oil content. Theor. Appl. Genet. 1992, 83, 608–612. [Google Scholar] [CrossRef]
- Warrington, C.; Abdel-Haleem, H.; Hyten, D.; Cregan, P.; Orf, J.; Killam, A.; Bajjalieh, N.; Li, Z.; Boerma, H. QTL for seed protein and amino acids in the Benning× Danbaekkong soybean population. Theor. Appl. Genet. 2015, 128, 839–850. [Google Scholar] [CrossRef] [PubMed]
- Whiting, R.M.; Torabi, S.; Lukens, L.; Eskandari, M. Genomic regions associated with important seed quality traits in food-grade soybeans. BMC Plant Biol. 2020, 20, 485. [Google Scholar] [CrossRef]
- Clevinger, E.M.; Biyashev, R.; Haak, D.; Song, Q.; Pilot, G.; Saghai Maroof, M. Identification of quantitative trait loci controlling soybean seed protein and oil content. PLoS ONE 2023, 18, e0286329. [Google Scholar] [CrossRef]
- Jamison, D.R.; Chen, P.; Hettiarachchy, N.S.; Miller, D.M.; Shakiba, E. Identification of Quantitative Trait Loci (QTL) for Sucrose and Protein Content in Soybean Seed. Plants 2024, 13, 650. [Google Scholar] [CrossRef]
- Kim, W.J.; Kang, B.H.; Moon, C.Y.; Kang, S.; Shin, S.; Chowdhury, S.; Choi, M.-S.; Park, S.-K.; Moon, J.-K.; Ha, B.-K. Quantitative trait loci (QTL) analysis of seed protein and oil content in wild soybean (Glycine soja). Int. J. Mol. Sci. 2023, 24, 4077. [Google Scholar] [CrossRef]
- Wang, S.; Liu, S.; Wang, J.; Yokosho, K.; Zhou, B.; Yu, Y.-C.; Liu, Z.; Frommer, W.B.; Ma, J.F.; Chen, L.-Q. Simultaneous changes in seed size, oil content and protein content driven by selection of SWEET homologues during soybean domestication. Natl. Sci. Rev. 2020, 7, 1776–1786. [Google Scholar] [CrossRef]
- Duan, Z.; Zhang, M.; Zhang, Z.; Liang, S.; Fan, L.; Yang, X.; Yuan, Y.; Pan, Y.; Zhou, G.; Liu, S.; et al. Natural allelic variation of GmST05 controlling seed size and quality in soybean. Plant Biotechnol. J. 2022, 20, 1807–1818. [Google Scholar] [CrossRef]
- Qi, Z.; Guo, C.; Li, H.; Qiu, H.; Li, H.; Jong, C.; Yu, G.; Zhang, Y.; Hu, L.; Wu, X. Natural variation in Fatty Acid 9 is a determinant of fatty acid and protein content. Plant Biotechnol. J. 2024, 22, 759–773. [Google Scholar] [CrossRef]
- Yang, Y.; Zhang, L.; Zuo, H.; Yang, Y.; Hu, D.; Zhang, S.; Yuan, W.; Zhai, X.; He, M.; Xu, M. GmGASA12 coordinates hormonal dynamics to enhance soybean water-soluble protein accumulation and seed size. J. Integr. Plant Biol. 2025, 67, 2401–2415. [Google Scholar] [CrossRef] [PubMed]
- Cao, x.; Liu, B.; Zhang, Y. SEA: A software package of segregation analysis of quantitative traits in plants. J. Nanjing Agric. Univ. 2013, 36, 1–6. (In Chinese) [Google Scholar]
- Liu, J.; Jiang, A.; Ma, R.; Gao, W.; Tan, P.; Li, X.; Du, C.; Zhang, J.; Zhang, X.; Zhang, L. QTL Mapping for seed quality traits under multiple environments in soybean (Glycine max L.). Agronomy 2023, 13, 2382. [Google Scholar] [CrossRef]
- Guo, X.; Jiang, J.; Liu, Y.; Yu, L.; Chang, R.; Guan, R.; Qiu, L. Identification of a novel salt tolerance-related locus in wild soybean (Glycine soja Sieb. & Zucc.). Front. Plant Sci. 2021, 12, 791175. [Google Scholar] [CrossRef]
- Miao, N.; Zhou, J.; Li, M.; Zhang, J.; Hu, Y.; Guo, J.; Zhang, T.; Shi, L. Remodeling and protecting the membrane system to resist phosphorus deficiency in wild soybean (Glycine soja) seedling leaves. Planta 2022, 255, 53. [Google Scholar] [CrossRef]
- Liu, Y.; Cao, L.; Wu, X.; Wang, S.; Zhang, P.; Li, M.; Jiang, J.; Ding, X.; Cao, X. Functional characterization of wild soybean (Glycine soja) GsSnRK1. 1 protein kinase in plant resistance to abiotic stresses. J. Plant Physiol. 2023, 280, 153881. [Google Scholar] [CrossRef]
- Chotekajorn, A.; Hashiguchi, T.; Hashiguchi, M.; Tanaka, H.; Akashi, R. Evaluation of seed amino acid content and its correlation network analysis in wild soybean (Glycine soja) germplasm in Japan. Plant Genet. Resour. 2021, 19, 35–43. [Google Scholar] [CrossRef]
- Dwiyanti, M.S.; Maruyama, S.; Hirono, M.; Sato, M.; Park, E.; Yoon, S.H.; Yamada, T.; Abe, J. Natural diversity of seed α-tocopherol ratio in wild soybean (Glycine soja) germplasm collection. Breed. Sci. 2016, 66, 653–657. [Google Scholar] [CrossRef]
- Jun, T.-H.; Van, K.; Kim, M.Y.; Lee, S.-H.; Walker, D.R. Association analysis using SSR markers to find QTL for seed protein content in soybean. Euphytica 2008, 162, 179–191. [Google Scholar] [CrossRef]
- Sonah, H.; O’Donoughue, L.; Cober, E.; Rajcan, I.; Belzile, F. Identification of loci governing eight agronomic traits using a GBS-GWAS approach and validation by QTL mapping in soya bean. Plant Biotechnol. J. 2015, 13, 211–221. [Google Scholar] [CrossRef]
- Li, X.; Shao, Z.; Tian, R.; Zhang, H.; Du, H.; Kong, Y.; Li, W.; Zhang, C. Mining QTLs and candidate genes for seed protein and oil contents across multiple environments and backgrounds in soybean. Mol. Breed. 2019, 39, 139. [Google Scholar] [CrossRef]
- Wang, Y. Cloning and Expression Analysis of Three WD40 Transcription Factors in Roses. Master’s Thesis, Shandong Agricultural University, Weihai, China, 2019. (In Chinese). [Google Scholar]
- van der Voorn, L.; Ploegh, H.L. The WD-40 repeat. Febs Lett. 1992, 307, 131–134. [Google Scholar] [CrossRef]
- Will, N.; Piserchio, A.; Snyder, I.; Ferguson, S.B.; Giles, D.H.; Dalby, K.N.; Ghose, R. Structure of the C-Terminal helical repeat domain of eukaryotic elongation factor 2 Kinase. Biochemistry 2016, 55, 5377–5386. [Google Scholar] [CrossRef]
- Liu, H.; Xiu, Z.; Yang, H.; Ma, Z.; Yang, D.; Wang, H.; Tan, B.-C. Maize Shrek1 encodes a WD40 protein that regulates pre-rRNA processing in ribosome biogenesis. Plant Cell 2022, 34, 4028–4044. [Google Scholar] [CrossRef]
- Sun, J.; Li, W.; Wei, X.; Shou, H.; Tran, L.S.P.; Feng, X.; Wang, S. Mechanistic roles of GmSWEET10a/b and GmSUT1 in the oil–protein balance in soybean mature seeds at transcriptional and metabolic levels. Plant J. 2025, 123, e70435. [Google Scholar] [CrossRef]
- Zhong, X.; Wang, J.; Shi, X.; Bai, M.; Yuan, C.; Cai, C.; Wang, N.; Zhu, X.; Kuang, H.; Wang, X. Genetically optimizing soybean nodulation improves yield and protein content. Nat. Plants 2024, 10, 736–742. [Google Scholar] [CrossRef]
- Zheng, H.; Feng, X.; Wang, L.; Shao, W.; Guo, S.; Zhao, D.; Li, J.; Yan, L.; Miao, L.; Sun, B. GmSop20 functions as a key coordinator of the oil-to-protein ratio in soybean seeds. Adv. Sci. 2025, 12, e05181. [Google Scholar] [CrossRef] [PubMed]
- Xin, D.; Qi, Z.; Jiang, H.; Hu, Z.; Zhu, R.; Hu, J.; Han, H.; Hu, G.; Liu, C.; Chen, Q. QTL location and epistatic effect analysis of 100-seed weight using wild soybean (Glycine soja Sieb. & Zucc.) chromosome segment substitution lines. PLoS ONE 2016, 11, e0149380. [Google Scholar]
- Doyle, J.; Doyle, J.; Brown, A. Analysis of a polyploid complex in Glycine with chloroplast and nuclear DNA. Aust. Syst. Bot. 1990, 3, 125–136. [Google Scholar] [CrossRef]
- Wang, S.; Basten, C.; Zeng, Z.; Wang, S.C.; Basten, C.J.; Zeng, Z.B. Windows QTL Cartographer 2.5; Department of Statistics, North Carolina State University: Raleigh, NC, USA, 2007. [Google Scholar]
- Zhang, D.; Zhang, H.; Hu, Z.; Chu, S.; Yu, K.; Lv, L.; Yang, Y.; Zhang, X.; Chen, X.; Kan, G. Artificial selection on GmOLEO1 contributes to the increase in seed oil during soybean domestication. PLoS Genet. 2019, 15, e1008267. [Google Scholar] [CrossRef] [PubMed]
- Chen, C.; Chen, H.; Zhang, Y.; Thomas, H.R.; Frank, M.H.; He, Y.; Xia, R. TBtools: An integrative toolkit developed for interactive analyses of big biological data. Mol. Plant 2020, 13, 1194–1202. [Google Scholar] [CrossRef]
- Li, X.; Shi, Z.; Gao, J.; Wang, X.; Guo, K. CandiHap: A haplotype analysis toolkit for natural variation study. Mol. Breed. 2023, 43, 21. [Google Scholar] [CrossRef] [PubMed]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).