Meta-Analyses of QTLs Associated with Protein and Oil Contents and Compositions in Soybean [Glycine max (L.) Merr.] Seed

Soybean [Glycine max (L.) Merr.] is a valuable and nutritious crop in part due to the high protein meal and vegetable oil produced from its seed. Soybean producers desire cultivars with both elevated seed protein and oil concentrations as well as specific amino acid and fatty acid profiles. Numerous studies have identified quantitative trait loci (QTLs) associated with seed composition traits, but validation of these QTLs has rarely been carried out. In this study, we have collected information, including genetic location and additive effects, on each QTL for seed contents of protein and oil, as well as amino acid and fatty acid compositions from over 80 studies. Using BioMercator V. 4.2, a meta-QTL analysis was performed with genetic information comprised of 175 QTLs for protein, 205 QTLs for oil, 156 QTLs for amino acids, and 113 QTLs for fatty acids. A total of 55 meta-QTL for seed composition were detected on 6 out of 20 chromosomes. Meta-QTL possessed narrower confidence intervals than the original QTL and candidate genes were identified within each meta-QTL. These candidate genes elucidate potential natural genetic variation in genes contributing to protein and oil biosynthesis and accumulation, providing meaningful information to further soybean breeding programs.


Introduction
Although significant crop improvements have been made internationally, there is still demand for increased food supply and quality [1]. In addition to traditional phenotypic selection, molecular breeding, often facilitated by genome sequence analysis, has been a primary tool for development of new cultivars to increase food security [1]. Like many other legumes, the demand for soybean has increased due to its nutritional importance in animal feed, potential use as an industrial raw material, for its benefits to human health [2], and use as a fuel feedstock [3].
Because soybean seed contains approximately 40% protein and 71% of the world's meal consumption is reliant on soybean, a percentage which is increasing [4], total protein content has been one of the primary quality traits on which soybean breeding has focused. Both human and animal consumption of soybean is predicted to increase as the demand for plant and animal protein as well as the world population grows [5]. About 77% of animal meal consumption is dependent on soybean meal, used as a source of protein and specific amino acids. Thus, soybean breeding programs have initiated development of new soybean cultivars which possess essential and balanced amino acid

Collection of QTLs for Soybean Seed Protein and Oil Contents and Compositions
SoyBase (available online: http://soybase.org) and recent literature were mined up to 2016, inclusive, to identify all map and QTL information for soybean seed protein and oil contents and amino acid and fatty acid profiles. A total of 84 studies were identified (Table S1) which reported a total of 1320 QTLs for protein, oil, amino acids and fatty acids, 648 QTLs with no R2 value and no genetic map information were excluded (Table S2). Thus, a total of 672 QTLs (184 for protein, 212 for oil, 156 for amino acids and 120 for fatty acids) were used in our meta-analyses (Table S2). Of these QTLs, 23 had LOD scores below 2.0, as suggested by Qi et al. [24] these were removed and separate meta-QTL analyses were conducted with only "high confidence" QTLs, which consisted of 649 QTLs (175 QTLs for protein, 205 QTLs for oil, 156 QTLs for amino acids and 113 QTLs for fatty acids) with LOD scores > 2.0 (Table S2).

QTL Projection on a Soybean Consensus Map and Meta-Analysis of Seed Compositions QTL
The soybean consensus map 4.0 [44] was used as a reference map for the projection of QTLs. Of the 672 QTLs (649 high confidence QTLs), 284 QTLs (263 high confidence QTLs) were able to be projected on the reference map. Though not evenly distributed, projected QTLs were positioned on all 20 chromosomes. With 38 (35 high confidence) projected QTLs, chromosome (Chr) 20 had the largest number of projected QTLs, whereas only three QTLs were projected on Chr 11 (Table S3). Overall, more than half of the QTLs were not projected on the reference map. Failure to project a QTL is caused by a lack of common marker pairs which flank the QTL in the original and soybean reference maps. For example, meta-analysis of amino acid QTLs was initiated with 156 QTLs which were primarily mapped with SNP markers which were largely unique to these studies, thus, only 9 QTLs were able to be projected (Chrs 1, 6, 9 and 20; Table S3).
Using BioMercator v4.2, meta-analyses were performed for seed protein and oil content separately. Meta-analyses of amino acid composition and fatty acid composition were carried out in two groups, respectively. Soybean seed protein and oil contents have a well-established negative correlation [10,45,46] and seed protein versus Cys concentration and seed protein versus Met concentration have also been reported as displaying positive or negative correlations depending on the study [5,11,38,42,47,48]. Therefore, meta-analyses of QTL for these traits known to co-vary were also carried out in predetermined combinations of protein and oil (protein+oil) and protein, Cys and Met (protein+Cys+Met).
In the first step of meta-analysis (Meta-Analysis step 1 of 2) [49], for each chromosome and trait, the best meta-QTL models with K consensus QTLs was selected according to the minimum Akaike Information Criterion (AIC) (Tables S3 and S4). With the model chosen, meta-QTLs were successfully generated from six chromosomes with the most likely positions and CIs calculated in the second step of meta-analysis (Meta-Analysis step 2 of 2) (Tables S3 and S4) [49].
A total of 55 meta-QTLs (43 high confidence meta-QTLs) were identified for all traits and trait combinations ( Table 1). The meta-QTLs represented 284 projected QTLs (263 high confidence projected QTLs). While the majority of meta-QTLs were identified only when the meta-analysis was carried out across traits and included both protein and oil QTLs ( Figure 1 and Table S4), Chr 20 possessed high confidence meta-QTLs for protein, oil, protein+oil and protein+Cys+Met (designated as mP20-#, mPO20-#, and mPCM20-#, respectively) ( Figure 2 and Table S4). Twelve meta-QTLs on Chrs 9, 14, and 15 were identified only when QTLs with LOD scores < 2.0 were included in the analysis (Table 1). Meta-analysis reduced the CI of meta-QTLs relative to the projected QTLs by a wide margin, with the projected QTLs spanning an average of 21.77 cM (30.07 cM for high confidence QTLs) and meta-QTLs spanning an average of only 3.88 cM (4.30 cM for high confidence meta-QTLs).  LG I). Projected QTL are displayed to the left of Chr 20 for each trait or trait combination. A 95% of confidence interval of each meta-QTL is represented as filled colors on the chromosome arm, with four, five, four and four meta-QTLs identified for protein, oil, protein+oil and protein+Cys+Met, respectively. A detailed view of the projected QTL contributing to each meta-QTL cluster #2 of Chr 20 (Table 1) are also shown on the right of the chromosome. In the detailed view, each bar represents a projected QTL and is color-coded for each trait (protein, khaki; oil, red; cysteine, green; and methionine, blue). LG I). Projected QTL are displayed to the left of Chr 20 for each trait or trait combination. A 95% of confidence interval of each meta-QTL is represented as filled colors on the chromosome arm, with four, five, four and four meta-QTLs identified for protein, oil, protein+oil and protein+Cys+Met, respectively. A detailed view of the projected QTL contributing to each meta-QTL cluster #2 of Chr 20 (Table 1) are also shown on the right of the chromosome. In the detailed view, each bar represents a projected QTL and is color-coded for each trait (protein, khaki; oil, red; cysteine, green; and methionine, blue).

Identification of Candidate Genes for Each Meta-QTL
In order to identify positional candidate genes potentially contributing to the meta-QTLs for each trait, the physical and genetic positions of the left and right markers of each meta-QTL's CI were obtained for those markers found on both the consensus map 4.0 [44] and G. max genome assembly, version Glyma.Wm82.a2.v1 (Gmax2.0) (available online: http://soybase.org). For those markers not positioned on either the soybean consensus map 4.0 or Gmax2.0, the next nearest marker located on both the consensus map and Gmax2.0 was selected. The meta-QTLs encompassed a total of 7412 positional candidate genes (5440 positional candidate genes for high confidence meta-QTLs), with each meta-QTL encompassing an average of 135 positional candidate genes (127 positional candidate genes for high confidence meta-QTLs) ( Table 1). Meta-QTL mPO20-2 contained the smallest number of positional candidate genes, with 14 genes within the CI of 1.83 cM (24.07-25.90 cM on Chr 20, Figure S2), whereas there were an immense 590 positional candidate genes encompassed by mPO7-6 with a CI of 17.41 cM (57.99-75.40 cM on Chr 7). While, generally, a small number of positional candidate genes were detected if the CI of the meta-QTL was small, the CIs of mP20-3, mO20-4, mPO20-4 and mPCM20-3 were only 0.61, 0.01, 0.00 and 0.14 cM, respectively; however, 49 positional candidate genes were identified within each of these meta-QTL. This was largely due to the closest marker being located at least 4 cM away on the reference map from the CI boundaries (Table 1). Thus, while the CIs of meta-QTL are generally narrower than the projected QTL from which they are synthesized, the advantage of this narrowed region in identifying positional candidate genes can only be observed when the markers can be easily translated to the genome sequences (Gmax2.0).
For the 14 meta-QTLs having less than 50 positional candidate genes, detailed functional information of candidate genes from each meta-QTL were obtained from Phytozome v. 12.0 (available online: http://www.phytozome.net; Table S5). Interestingly, each meta-QTL encompassed at least one positional candidate gene annotated with potential metabolic functions relating to protein and oil biosynthesis and/or accumulation, such as sucrose biosynthesis, glycolysis, gluconeogenesis, amino acids biosynthesis/degradation, etc. (Table 2 and Figure 2).

Meta-Analysis Aids in the Identification of Robust QTLs and Narrowing of Confidence Intervals
Meta-analysis of QTLs was developed by Goffinet and Gerber [14] in order to assist in identification of consistent and robust QTLs and to improve the precision of their genetic locations. The composition of soybean seed is a well-studied trait with numerous genetic studies having been carried out to identify QTLs for protein and oil contents and amino acid and fatty acid compositions (Table S1). A meta-analysis can aid in the synthesis of these myriad QTLs in order to identify the genetic regions robustly associated with each trait in multiple environments and genetic backgrounds. Previously, two studies have carried out meta-analyses related to QTLs for soybean seed protein and oil contents. The previous meta-analysis of protein content QTLs identified 23 meta-QTLs on 13 chromosomes with CIs of meta-QTLs ranging from 1.52 to 14.31 cM [24]. In the previous meta-analysis of oil content, 20 meta-QTLs were detected on 13 chromosomes and their CIs ranged from 1.3 to 12.35 cM [25]. The meta-analyses carried out in this study integrate the locations of hundreds of QTLs for seed protein and oil contents and amino acid and fatty acid compositions using maximum likelihood estimation with consideration of population size and additional QTL information. We have projected 284 of 672 QTLs for these traits onto a single reference map and integrated these into 55 meta-QTLs for protein, oil, protein+oil, and protein+Cys+Met to provide a genetic framework for seed protein and oil contents (Table 1).
No meta-QTLs were synthesized for fatty acids or amino acids when analyzed as independent traits. This may be due to the limited number of studies as compared to protein and oil as well as the minimal overlap of QTLs detected among those studies. However, genes which contribute to environmentally stable changes in fatty acid composition in multiple genetic backgrounds have been cloned, such as 3-keto-acyl-ACP synthase II gene (KAS II) [50], ω-3 fatty-acid desaturase gene (FAD3) [51], microsomal oleate desaturase gene (FAD2) [52] and ∆ 9 -stearoyl-ACP-desaturase gene (SACPD-C) gene [53]. Thus, it is known that at least a portion of the QTL for fatty acid composition are robust.
In Chr 20, meta-QTLs consistently possessed narrower CIs than the projected QTLs from which they were synthesized. Meta-QTLs for protein, oil, protein+oil, and protein+Cys+Met decreased CI by 13.56 (73.3%), 21.69 (81.1%), 28.63 (89.6%) and 18.67 (88.7%) cM, respectively, as compared to the projected QTL. The translation of genetic distances to the identification of positional candidate genes on a physical map is dependent on the ability to locate the physical positions of flanking genetic markers from the reference map. Accordingly, we found that the decreased confidence intervals often, but not always (e.g., mP20-3, mO20-4, mPO20-4 and mPCM20-3), correlated with a decrease in the number of positional candidate genes encompassed by the meta-QTLs in comparison to the projected QTLs. Meta-QTLs for protein, oil, protein+oil, and protein+Cys+Met decreased the average number of positional candidate genes by 367 (480 to 113), 686 (815 to 129), 526 (702 to 176) and 461 (527 to 66), respectively, as compared to the projected QTLs.
Traditionally, QTL positions are refined through the tedious process of fine-mapping which can include selecting for recombination in targeted regions and evaluating those individuals (e.g., [35]), increasing marker density (e.g., [54]), and/or "mendelizing" the QTL by generating near-isogenic lines [35,55]. In some cases, meta-analysis of QTLs may replace or strengthen these methods. For example, mPO15-2, formed from eight projected QTL is coincident with the confirmed seed protein QTL cqSeed protein-001 (available online: http://soybase.org) [56]. Satt384 (at 19.62 cM on the soybean consensus map) is positioned within the mPO15-2 CI. Satt384 was the key marker in the fine-mapping of cqSeed protein-001 and fine-mapping had previously served to decrease the interval to 535 kb based on Gmax2.0 [35]. Though the positioning of the mPO15-2 flanking markers on the reference genome are inverted relative to the reference map, making a direct interpretation of results difficult, mPO15-2 coincided with the 535 kb interval identified by fine-mapping and even further narrowed this region (Table 1).

Meta-QTLs Can Be Further Defined and Refined Through the Combined Analysis of Correlated Traits
We found that combining QTLs identified from correlated traits in a meta-analysis resulted in the identification of additional meta-QTLs and further narrowed CIs. Most of the meta-QTLs identified in this study were for combined protein and oil contents. However, on Chr 20, meta-QTLs were detected when projected QTLs from protein and oil traits were analyzed separately, in combination, as well as for combined projected QTLs from the correlated traits protein, Cys and Met (Figure 1 and Table 1). Many studies have demonstrated Chr 20 (LG I) has QTLs with a strong effect on protein content and a lesser effect on oil content in soybean seed [46,57,58]. In our study, six meta-QTL regions were identified on Chr 20. Three meta-QTL regions were identified in all four meta-analyses, protein, oil, protein+oil and protein+Cys+Met. These meta-QTL spanned from approximately 13 to 18 cM (mP20-1, mO20-1, mPO20-1 and mPCM20-1), 22 to 27 cM (mP20-2, mO20-2, mPO20-2 and mPCM20-2), and 49 to 50 cM (mP20-3, mO20-4, mPO20-4 and mPCM20-3). The remaining two meta-QTL regions were only found in some subset of the four meta-analyses (approximately 29 to 41 cM for mO20-3 and mPO20-3, approximately 54 to 67 cM for mP20-4 and mO20-5, and 75.05 to 78.48 cM for mPCM20-4). Thus, there were meta-QTL regions that were identified both only for QTL from a single trait and only for QTL from combined, correlated traits.
By combining the negatively correlated traits of seed protein and oil contents, CIs for Chr 20 meta-QTLs were reduced by an average of 3.32 cM in comparison to meta-analysis of each trait separately (Figure 1 and Figure S2). This narrowing of CIs resulted in a decrease in the number of positional candidate genes encompassed by each meta-QTL. This is exemplified by mP20-2, mO20-2, and mPO20-2 where the number of positional candidate genes was reduced from 26 and 63, to only 14, respectively (Table 1). Thus, our study suggested that meta-analysis of QTLs from not only single traits but also of correlated traits can be used to identify meta-QTLs with potential pleiotropic effects and result in narrowed CIs.3.3 The incorporation of QTLs with low LODs score into meta-analysis does not hinder analysis and provides confirmation.

Putative Functional Candidate Genes Were Identified from the Positional Candidates Encompassed by Meta-QTLs
Of the 55 meta-QTLs identified, 14 encompassed a limited number (<50) of positional candidate genes. For these 14 meta-QTLs, representing nine distinct regions on Chrs 3, 5, 6, 15 and 20, the positional candidate gene lists were mined for functional candidates based on their gene annotation ( Table 2). While functional candidates may include, for example, transcription factors and unannotated genes, here we have focused on metabolic activities that may influence seed protein and oil biosynthesis and accumulation. This lead to the identification of 69 functional candidate genes for these nine meta-QTL regions.
Meta-QTL mPO3-1 possessed 35 positional candidate genes, among which included putative phenylacetaldoxime monooxygenases and anthocyanidin 3-O-glucosyltransferases involved in anthocyanin biosynthesis [76] (Table 2 and Figure 2). Phenylacetaldoxime monooxygenase has been shown to be involved in the production of volatile organic compounds, especially phenylpropanoids and benzenoids [77]. Both enzymes have been shown to participate in pathways starting from phenylalanine, generated from Calvin cycle via shikimic acid pathway ( Figure 2). Thus, if metabolic flux is toward production of anthocyanin and other secondary metabolites, pull from the phenylalanine pool may lead to a concomitant decrease in oil accumulation.
Among the 49 positional candidate genes encompassed by mPO5-1, 11 candidate genes had predicted functions related to protein and oil accumulation and/or biosynthesis ( Figure 2 and Table 2). Of potential functional interest were the candidate genes annotated as transporters of either fatty acids (Glyma.05g028500) or nitrate. Fatty acids are transported across most membrane systems for modification and lipid assembly following synthesis in plastids [78]. Glyma.05g028500 may contribute to the transportation of chloroplast produced fatty acids into the peroxisome, where they can be incorporated into β-oxidation. Nitrate uptake from the soil is transcriptionally controlled by both nitrate and photosynthate availability. De Jong et al. [79] showed nitrate uptake and assimilation was closely coordinated with glucose to supply amino acids and protein for plant growth. Thus, the five candidate genes annotated as nitrate transporters may play a role in production of amino acids and total protein.
For mPO5-5, six of 27 positional candidates had predicted functions contributing to either the carbohydrate metabolic process or fatty acid activation and biosynthesis (Table 2 and Figure 2). Proteins coded by Glyma.05g216400, Glyma.05g216700 and Glyma.05g217100 are putatively involved in increasing the hexose phosphate pool, which is the backbone of the main metabolic pathways and exchange between cellular compartments [80][81][82]. The remaining three candidate genes encoded proteins for fatty acid activation and biosynthesis via triacylglycerol (TAG) degradation. Oil crops, such as soybean, oilseed rape and sunflower, store 20-60% of dry weight oil in the form of TAG [83]. These three candidate genes involved in TAG degradation may contribute to changes of oil accumulation in the seed.
Of the 23 positional candidate genes within mPO6-3, five putatively encode proteins involved in major metabolic processes, such as TAG degradation (Glyma.06g087100), gluconeogenesis (Glyma.06g087800), amino acid transportation (Glyma.06g088200 and Glyma.06g088300) and glycolysis (Glyma.06g088600) (Table 2 and Figure 2). As noted for several candidate genes in mPO5-5 (see as above), Glyma.06g087100 may control oil content in seed by TAG degradation. Glyma.06g087800 putatively encoded a malate dehydrogenase. Malate is oxidized and the resulting oxaloacetate is converted into hexoses by gluconeogenesis, so this candidate gene may contribute to increasing the hexose pool for production of amino acids, polysaccharides, other metabolic intermediates, and energy [84].
Meta-QTL mPO15-1 encompassed 49 positional candidate genes. Seven of these candidate genes putatively encode contributing factors to the biosynthesis of jasmonic acid, threonine, cell wall and alkaloid as well as degradation of protein and sucrose (Table 2 and Figure 2). Glyma.15g028200 annotated as a peroxidase is involved in active oxygen species-scavenging systems [85] and methylglyoxal degradation by Glyma.15g028900 may serve to increase the level of pyruvate [86].
Among the 12 positional candidate genes encompassed by mPO15-5, notable functional candidates are Glyma.15g142500 and Glyma.15g143100, which are putatively involved in cell wall degradation and acetyl-CoA biosynthesis, respectively (Table 2 and Figure 2). Enzymes in cell wall degradation of cell wall including glucan endo-1,3-β-D-glucosidase (Glyma.15g142500) involved in the complete hydrolysis of polysaccharides to glucose can lead to protein and oil accumulation [89]. Among the 26 positional candidate genes within mP20-2 and mPCM20-2 were eight functional candidates ( Figure 2 and Table 2); however, only three functional candidate genes were within the further narrowed region defined by mPO20-2. Encompassed by these three meta-QTL, is Glyma.20g024800 which putatively encodes a pyruvate kinase, the key enzyme of glycolysis. Pyruvate kinase leads to the tricarboxylic acid (TCA) cycle and provides pyruvate for the fatty acid biosynthesis in seeds [90]. In addition there are two functional candidate genes also encompassed by these three meta-QTL. Glyma.20g025200 has a putative function in ascorbic acid biosynthesis and degradation, potentially influencing the hexose phosphate pool [91], whereas Glyma.20g025400 has a putative function in asparagine biosynthesis, potentially influencing flux through the TCA cycle as well as the pool of Aspartate (Asp), a precursor to the Met and other Asp family amino acids [92].

Collection of Mapping and QTL Information for Soybean Protein, Oil, Amino Acids and Fatty Acids
Literature including scientific journals and available theses and dissertations were mined for QTL mapping studies conducted on soybean seed contents and compositions of protein and oil. Recent literature published up to and including 2016 were mined for relevant QTL mapping studies by search of Google Scholar using key words. In addition, mapping studies identified from SoyBase (available online: http://soybase.org) were also included in the study. Parents, size, crossing type and generation of population and genetic map information of each population were collected. For each QTL, name, trait, experiment location, year of experiment, names of chromosome and linkage group, LOD score, R 2 value (proportion of phenotypic variance explained), most likely position of QTL (in cM) and confidence interval (CI, in cM) were collected. If the CI of a QTL was not provided, the formula proposed by [93] was used for calculation of a 95% of CI, where N is the size of population. CI of both backcross and F 2 populations were calculated using equation (1), equation (2) was applied to QTL studies carried out using recombinant inbred line populations. The formula derived from [94] was used for the estimation of LOD, which is the explanatory power of a QTL, if the LOD value was not reported, All information was arranged by the name of the originating map. QTL information were discarded if R 2 value is unavailable from the original study.

Meta-QTL Analysis
Using the soybean consensus map [44] as a reference map, QTLs for protein content, amino acid composition, oil content and fatty acid composition were projected on the reference map by BioMercator V4. 2 [95]. For QTL projection on the soybean reference map, two default parameters were considered; (i) 0.25 was applied as the minimum value of the ratio of the flanking marker interval distance and (ii) 0.5 was applied as the minimum P value threshold for testing homogeneity of the flanking marker interval distances between original and consensus maps. If any QTL was not projected on the newly built map, it was excluded.
Meta-analyses were performed to validate and refine confidence intervals of QTLs from those QTLs projected onto the reference map. This was done using BioMercator V4.2, including algorithms from MetaQTL [49]. In Meta-analysis step 1/2 [49], the projected QTLs were clustered by each chromosome using default parameters and traits for protein, oil, amino acid, fatty acid and combined, correlated traits. A total of five criteria, which are Akaike information criterion (AIC), corrected Akaike information criterion (AICc and AIC3), Bayesian information criterion (BIC) and approximate weight of evidence (AWE), were used for examination of potential meta-QTL models with the output from ClustQTL suggesting the best model for the next step. In Meta-analysis step 2/2, Meta-QTLs were generated in accordance with the best model [49].

Identification of Candidate Genes
Flanking/closest left or right markers of the CI of each meta-QTL were selected, only if these markers were presented on both the reference map and G. max genome assembly version Glyma.Wm82.a2.v1 (available online: http://soybase.org). Detailed information of each candidate gene was obtained from SoyBase (available online: http://soybase.org) and Phytozome v12.0 (available online: http://www.phytozome.net) for those meta-QTLs encompassing 50 or fewer positional candidate genes.

Conclusions
Comparisons of QTLs across multiple populations can be difficult due in part to the limited number of common markers shared across the populations. Our meta-analyses were able to integrate and project over 250 QTLs for seed composition onto a reference map, the soybean consensus map 4.0 [44]. Meta-analysis of QTLs helps to extricate robust loci which have been identified in multiple studies and defines an, often narrower, CI for those robust loci. In our study, the CIs for meta-QTLs were further narrowed through the incorporation of QTLs from multiple, correlated traits. In combination with a physical map (Gmax2.0), the flanking markers defining the meta-QTLs were used to identify a limited list of positional candidate genes from which functional candidates for seed composition traits could be selected based on putative functional annotations of the candidate genes. These genes provide potential targets for marker-assisted selection, fine mapping and positional cloning.