The Soybean High Density ‘Forrest’ by ‘Williams 82’ SNP-Based Genetic Linkage Map Identifies QTL and Candidate Genes for Seed Isoflavone Content

Isoflavones are secondary metabolites that are abundant in soybean and other legume seeds providing health and nutrition benefits for both humans and animals. The objectives of this study were to construct a single nucleotide polymorphism (SNP)-based genetic linkage map using the ‘Forrest’ by ‘Williams 82’ (F×W82) recombinant inbred line (RIL) population (n = 306); map quantitative trait loci (QTL) for seed daidzein, genistein, glycitein, and total isoflavone contents in two environments over two years (NC-2018 and IL-2020); identify candidate genes for seed isoflavone. The FXW82 SNP-based map was composed of 2075 SNPs and covered 4029.9 cM. A total of 27 QTL that control various seed isoflavone traits have been identified and mapped on chromosomes (Chrs.) 2, 4, 5, 6, 10, 12, 15, 19, and 20 in both NC-2018 (13 QTL) and IL-2020 (14 QTL). The six QTL regions on Chrs. 2, 4, 5, 12, 15, and 19 are novel regions while the other 21 QTL have been identified by other studies using different biparental mapping populations or genome-wide association studies (GWAS). A total of 130 candidate genes involved in isoflavone biosynthetic pathways have been identified on all 20 Chrs. And among them 16 have been identified and located within or close to the QTL identified in this study. Moreover, transcripts from four genes (Glyma.10G058200, Glyma.06G143000, Glyma.06G137100, and Glyma.06G137300) were highly abundant in Forrest and Williams 82 seeds. The identified QTL and four candidate genes will be useful in breeding programs to develop soybean cultivars with high beneficial isoflavone contents.


Introduction
Soybean seeds are rich in secondary metabolites beneficial for human and animal consumption including tocopherols, phenolic compounds, saponins, and isoflavones such as genistein, daidzein, and glycitein that showed beneficial health and nutrition effects in animals and humans [1][2][3]. It is well established that isoflavones reduce menopausal symptoms, low density lipoprotein (LDL) cholesterol levels, breast and prostate cancers risks, improve the immune system [4][5][6][7][8][9][10][11], and play an important role in nitrogen fixation and defense against pathogens [12].
The objectives of this study were to construct a SNP-based genetic linkage map using the F×W82 RIL population (n = 306); map quantitative trait loci (QTL) for seed daidzein, genistein, glycitein, and total isoflavone contents in two environments over two years; identify candidate genes involved in soybean seed isoflavone biosynthesis.

Isoflavone Contents Frequency Distribution, Heritability, and Correlation
The seed isoflavone contents were normally distributed in the FxW82 RIL population based on Shapiro-Wilk's method for normality test, even though the positive or negative skewness and kurtosis value (>3) were observed in the RIL population (Table 2; Figure 1). The individual component of isoflavone also displayed small ranges of phenotypic variations in the seeds obtained from two geographically diverse field trials (Table 2, Figure 2). Daidzein 2018 in Spring Lake, NC had the highest coefficients of variation (CV) value (19.37%); however, the CV of this trait in Carbondale, IL (2020) was only 12.59% suggesting that phenotypic variability among isoflavone contents was impacted by different environmental conditions. Table 2. Seed isoflavone means, ranges, CVs, skewness, and kurtosis in the FxW82 RIL population evaluated in Spring Lake, NC (2018) and Carbondale, IL (2020). Mean and range values are expressed in µg/g of seed weight.  [35] reported that the total isoflavones ranged from 745 to 5253.98 µg/g, with highest mean of 2689.27 µg/g observed in some regions and up to 2518.91 and 1942.78 µg/g in others due to climatic conditions. Similar results have been reported by other studies [25][26][27][36][37][38]. Our results showed over 1000 µg/g and in some cases over 1100 µg/g. Therefore, the total concentrations of isoflavones are in the expected range of soybean seed. In addition, there is no premium to be given to growers for soybean seed isoflavone content and no docking is done at the grain elevator for seed isoflavone. Isoflavone concentrations vary depending on the year and environmental growing conditions. Although isoflavones are genetically controlled, environmental conditions including temperature, drought, absence or presence of diseases each year, many other biotic and abiotic factors can significantly affect the contents (by increasing or decreasing) and profile of isoflavone.
The broad sense heritability of percentage dry weight for daidzein, glycitein, and genistein across two different environments over two years appeared to be quite different. Glycitein had the highest heritability (72.4%) and the values for both daidzein and genistein were 42.8% and 42.5%, respectively, which displayed a similar fashion. The lower heritability of daidzein and genistein contents suggested that some portion of phenotypic variation was still not detected by the mapped QTL due to the complexity of these traits. The genotype-environment interactions still played a significant role in the molecular formation of daidzein and genistein molecules in soybean seeds based on our two-way ANOVA analysis because the σ GE2 is relatively high compared to that of glycitein (data not shown). It will certainly impact future breeding strategies for trait improvement based on the data we presented on these traits.
We used type I sum of squares (ANOVA (model)) function in R program to obtain the Sum Sq and Mean Sq and calculated σ G 2 and σ GE 2 for each trait (Table 3). However, σ e 2 was 0 due to limited replicates. In this study, we only had three technical replicates due to cost effect of this student-centered project, but these replicates could only be considered as one biological replicate and hence, F value and probability could not be generated ( Table 3). The FxW82 RIL population derived from two parental cultivars with different maturity groups (MGs). Forrest belongs to MG5-6 and Williams 82 to MG2-3 suggesting that the locations may play an important role on major agronomic traits including seed isoflavone. Based on our data ( Figure 2), glycitein showed less correlation with daidzein and genistein which may indicate that its production may be less impacted by environment. Furthermore, Fayetteville, NC is a subtropic favorable weather for MG6-7 soybeans while Carbondale, IL is the favorable weather for MG 4-5. Therefore, further studies of the seed isoflavone in the FxW82 RIL population in different environments would be beneficial. The correlogram demonstrates a novel correlation among these assessed traits ( Figure 2). Based on the unassorted data (all lines were included), each of the isoflavone components was positively correlated with the other sister isoflavones (p < 0.001) from the same geographical location but negatively correlated with the isoflavones from the other location inferring that the production of these isoflavones has been strongly impacted both by genotype and environmental conditions. The assorted data (lines tested in both locations) showed similar positivity, but the level of negative correlation was low ( Figure 2). To the best of our knowledge, this observation has not been described in other studies.

Seed Isoflavone Contents QTL
Both interval mapping (IM) and composite interval mapping (CIM) methods of Win-QTL Cartographer 2.5 [39] were used to identify QTL for seed daidzein, genistein, glycitein, and total isoflavone contents in the present RIL population. A total of 27 QTL that control seed isoflavone contents have been identified in this population in both NC-2018 (13 QTL) and IL-2020 (14 QTL) (Table 4, Figure 3 and Figure S1).          Figure S2.
In Carbondale, IL (IL-2020), one QTL that controls seed daidzein content (qDAID03) has been identified and mapped on Chr. 2 and one QTL that controls seed genistein content (qGEN04) has been identified and mapped on Chr. 4 (Table 4, Figure 3 and Figure S1). One QTL that controls seed daidzein content (qDAID04) and two QTL that control seed genistein content (qGEN01 and qGEN05) have been identified and mapped on Chr. 10. Two QTL that control each of seed daidzein (qDAID01 and qDAID05) and seed genistein contents (qGEN02 and qGEN06) have been identified and mapped on Chr. 12 (Table 4, Figure 3 and Figure S1). One QTL that controls seed glycitein (qGLY01) has been identified and mapped on Chr. 15 (Table 4, Figure 3 and Figure S1). Two QTL that control each of seed daidzein (qDAID02 and qDAID06) and genistein contents (qGEN03 and qGEN07) have been identified and mapped on Chr. 20 (Table 4, Figure 3 and Figure S1).
In Spring Lake, NC (NC-2020), one QTL that controls each of seed daidzein (qDAID01), genistein (qGEN03), and glycitein contents (qGLY02) have been identified and mapped on Chr. 5 (Table 4, Figure 3 and Figure S1). Two QTL that control seed genistein content (qGEN01 and qGEN04) and one QTL that controls seed daidzein content (qDAID02) have been identified and mapped on Chr. 6 (Table 4, Figure 3 and Figure S1). Two QTL that control each of seed genistein (qGEN02 and qGEN05) and glycitein contents (qGLY01 and qGLY03) have been identified and mapped on Chr. 12 (Table 4, Figure 3 and Figure S1). One QTL that controls each of seed daidzein (qDAID03) and glycitein contents (qGLY04) have been identified and mapped on Chr. 19 (Table 4, Figure 3 and Figure S1). One QTL that controls seed genistein content (qGEN06) has been identified and mapped on Chr. 20 (Table 4, Figure 3 and Figure S1). No QTL that controls total seed isoflavone contents has been identified in both years and locations.
No previous studies identified QTL that control seed isoflavone contents in the QTL region identified on Chr. 2 (qDAID03-(IL-2020), 23-25 cM), indicating that this is a novel QTL region; however, other studies identified QTL that control seed calcium content, plant height, and few other traits [40,41]. Likewise, no other studies identified QTL that control seed isoflavone contents in the QTL region identified on Chr. 4 (qGEN04-(IL-2020), 189.3-191.3 cM) which indicates that it is also a novel QTL region. The length of Chr. 4 in the soybean consensus map is only 136 cM [29,31]. Additionally, no other studies identified QTL that control seed isoflavone contents in the QTL region identified on Chr.  [29,31]. Interestingly, within the same QTL region that controls seed genistein and daidzein contents on Chr. 6 (q-GEN01-(NC-2018), qGEN04-(NC-2018), and qDAID02-(NC-2018), other studies identified QTL that control seed genistein, daidzein, glycitein, and total isoflavone contents (see a summary in [30]) which is coherent with our results making it an important genomic region to further investigate for candidate genes. Other studies identified QTL that control seed protein, oil, γ-tocopherol, and amino acids contents, and few other traits [29,31]. Interestingly, within the same QTL region that controls seed genistein and daidzein contents on Chr. 10 (q-GEN01-(IL-2020), qGEN05-(IL-2020), and qDAID04-(IL-2020), 128.8-132.9 cM), another study identified QTL that control seed isoflavone content [41,42] which is consistent with our data making it an important genomic region for gene discovery. In fact, two candidate genes have been previously identified in this region [42,43]. Two QTL regions have been identified on Chr. 12. The first region containing QTL that control seed genistein and daidzein contents (qGEN02-(IL-2020), qGEN06-(IL-2020), and qDAID05-(IL-2020), 51.2-71.9 cM). Interestingly, other studies identified QTL that control seed daidzein, genistein, glycitein, and total isoflavone contents in the same QTL region (see a summary in [30]) which makes it an important genomic region for discovering novel candidate genes. The second region contained QTL that control seed genistein, and glycitein contents (qGEN02-(NC-2018), qGEN05-(NC-2018), qGLY01-(NC-2018), and qGLY03-(NC-2018), 169.3-181.3 cM). No other studies identified QTL that control seed isoflavone contents in this second QTL region which indicates that it is also a novel QTL region. The length of Chr. 12 in the soybean consensus map is only 125 cM and the second QTL region identified here falls outside of its current limit [29,31]. No previous studies identified QTL that control seed isoflavone contents in the QTL region identified on Chr. 15 (qGLY01-(IL-2020), 212.3-218.8 cM) which indicates that it is a novel QTL region. The length of Chr. 15 in the soybean consensus map is only 85 cM and the QTL region identified here falls outside of its current limit [29,31]. Two QTL regions have been identified on Chr. 19. The first region contains QTL that control seed glycitein content (qGLY04-(NC-2018), 34.3-36.3 cM). Interestingly, other studies identified QTL that control seed genistein, daidzein, and isoflavone content within the same QTL region (see a summary in [30]). Previous studies identified also QTL for seed protein content (see a summary in [44]). The second region contained QTL that control (qDAID03-(NC-2018), 108.5-110.5 cM). No other studies identified QTL that control seed isoflavone contents in this QTL region, making it a novel QTL region. Two QTL regions have been identified on Chr. 20. Within the first region containing QTL that control seed glycitein content (qGEN05-(NC-2018), 0-2 cM), other studies identified QTL that control seed daidzein (qD20), genistein (qG20), malonyldaidzein (qMD20), malonylgenistein (qMG20), and total isoflavone content (qTIF20) [41,44] which makes it an important region to investigate further for candidate genes. In addition, other studies identified QTL for seed calcium [30,44] and sucrose contents within this QTL region as well (see a summary in [44]). Within the second region containing QTL that control seed daidzein and genistein contents (qDAID02-(IL-2020), qDAID06-(IL-2020), qGEN03-(IL-2020), and qGEN07-(IL-2020), 64.3-80.5 cM), other studies identified QTL that control seed genistein content (qGEN20, [17]) and seed daidzein and glycitein contents (qGC|proI_1 and qDZ|proI_2, [39,41] which makes it another important region to investigate further for candidate genes. In addition, other studies identified QTL that control seed phytate, stearic acid, calcium, alpha-tocopherol, and few amino acids [44].

Seed Isoflavone Candidate Genes
A total of 130 candidate genes involved in soybean isoflavone biosynthetic pathway have been identified in all 20 soybean Chrs. (Table S1); however, 16 candidate genes have been identified within or close to the seed isoflavone QTL identified in this study on Chrs. 2, 6, 10, 12, 15, 19, and 20 ( Figure 4, Table 5).

Expression Analysis
To gain insight into the role of isoflavone genes in soybean seeds, RNA-Seq analysis was conducted to check the expression levels of the 16 candidate genes that are located within or near the isoflavone QTL identified in FxW82 RIL population. Expression analysis of these genes showed that four genes, Glyma.10G058200, Glyma.06G143000, Glyma.06G137100, and Glyma.06G137300, are highly expressed in seeds of both Forrest and Williams 82 cultivars. Whereas, Glyma.19G030800 is highly expressed in Williams 82 and have a low expression in Forrest cv.; the rest of the 16 genes showed lower expressions; whereas Glyma.02G067900 and Glyma.15G156300 were not expressed neither in Forrest nor in Williams 82 cultivars ( Figure 5). Surprisingly, Glyma.10G058200 expression in Forrest cv. is higher than its expression in Williams 82. This could explain the presence of RILs from the FxWI cross that showed higher daidzein, glycitein or genistein content than the parent Williams 82 (parent with the high isoflavones content), these lines inherited most likely the beneficial alleles from both parents Forrest and Williams 82.

Conclusions
In conclusion, we constructed the FxW82 dense SNP-based genetic linkage map (2075 SNPs and 4029.9 cM covered) and identified 27 QTL that control soybean seed isoflavone contents and 16 candidate genes involved in soybean isoflavone biosynthetic pathways among which four candidate genes are highly expressed in seeds of both Forrest and Williams 82, in addition to Glyma.19G030800 that has a higher expression profile in Williams 82 compared to its expression in Forrest cv. (Figure 5).
A comparison of the Forrest and Williams 82 sequences of these four genes has shown that two of these genes have SNPs between Forrest and Williams 82 sequences, Glyma.10G058200 and Glyma.06G143000. Glyma.10G058200 has ten SNPs, one SNP is upstream 5 UTR, four SNPs are located at the intron, two SNPs are at the exon 1, one of them caused a missense mutation (A127G) and the other one caused a silent mutation (A32A). The last three are in the 3 UTR downstream region. For Glyma.06G143000, there is only one SNP located in the 5'UTR upstream region ( Figure 6). These SNPs could potentially play a role in the difference of isoflavones content between Forrest and Williams 82 cultivars. Moreover, Glyma.10G058200 and Glyma.06G143000 are highly expressed in the seed tissue of both Forrest and Williams 82 ( Figure 5). Glyma.10G058200 is associated with qGEN01-(IL-2020) QTL, qDAID04-(IL-2020) QTL and qGEN05-(IL-2020) QTL. Additionally, Glyma.06G143000 is associated with qGEN01-(NC-2018) QTL. The two genes could be useful for breeding for increased isoflavones content in soybeans.

Plant Material and Growth Conditions
In this study, we used 'Forrest' × 'Williams 82' RIL population (n = 306). The cultivar 'Forrest' was derived from the cross of 'Dyer' and 'Bragg' developed by USDA [46]. The cultivar 'Williams 82' was derived from the cross of 'Williams' and 'Kingwa' [47]. The genomes of soybean cultivars including Forrest and Williams 82 genomes are duplicated polyploid genomes with highly conserved gene-rich regions [48]. Originally, the 'Forrest' × 'Williams 82' RIL population was developed with more than 1000 RILs [49]. The genetic map used in this study was based on 306 RILs and 2075 SNP markers; however, QTL data analysis in Spring Lake, NC-2018 was based on 190 RILs.
The RIL population was evaluated in a farm in Spring Lake, NC (35.17 • N, 78.97 • W) in 2018 and in a farm in Carbondale, IL (37 • N, 89 • W) in 2020. Seeds of parents (Forrest and Williams 82) were sown directly in the field in a randomized complete block design (RCB) and 75 cm row spacing between seeds with three replicates. The plants were watered by drip irrigation and kept in the field until maturity. No pesticide, herbicide, or fertilizer were applied. In September 2018, hurricane Florence hit NC and its winds of 90+ mph damaged the fence in the farm in Spring Lake, NC and the deer damaged about 119 RILs; therefore, QTL data analysis for this location involved 187 RILs (n = 187). The plants grown in Carbondale, IL (n = 306) were not damaged.
In Spring Lake, NC (2018) during the growing season (May-Sept.), the temperatures ranged from 7.2 to 35 • C, it was partly (40%) to mostly cloudy (80%), wind speeds ranging from 55 to 90+ mph (hurricane Florence), and humidity comfort level ranged from comfortable to miserable [50]. The soil type in this location is mainly sandy (NC Sandhills). In Carbondale, IL (2020), the temperatures ranged from 7.2 to 29.4 • C, it was mostly clear (25%) to mostly cloudy (80%), wind speeds ranging from 30 to 38 mph, and humidity comfort level ranged from comfortable to miserable (weatherspark.com). The field was treated first using Firestorm (contains Paraquat dichloride) to control annual grass and broad-leaved weeds. As pre-emergent herbicide, Dual II Magnum Herbicide with longlasting control of most annual grasses and small-seeded broadleaf weeds was used to eliminate early-season weed competition. As post-emergent herbicide, Round Up Pro Concentrate (50.2% Glyphosate) was used/sprayed between the rows to control emerging weed. Weed grown inside the plastic mulch very close to the soybeans were removed manually. The soil type in this location is mainly silty clay loam (Southern IL).

Isoflavone Quantification
Mature seeds of parents Forrest and Williams 82, and the 190 RILs were analyzed for the concentrations aglycones daidzein, genistein, and glycitein. Approximately 25 g of mature seeds from each plot were ground using a Laboratory Mill 3600 (Perten, Springfield, IL, USA). Concentrations of daidzein, genistein, and glycitein were analyzed using a nearinfrared reflectance (NIR) diode array feed analyzer (Perten, Spring Field, IL, USA). The calibration equation has been updated every 6 months to 1 year and developed using the Thermo Galactic Grams PLS IQ software developed by Perten Company (Perten, Springfield, IL, USA). Thermo Galactic Grams PLS IQ from Perten (Perten) was used to develop the calibration equations, which was initially developed by the University of Minnesota. Descriptions of quantifying daidzein, genistein, glycitein and total isoflavone contents was reported by others (Akond et al., 2015 [22]; Bellaloui et al., 2012 [51]; Wang et al., 2019 [38]). The calibration equation development and updating for isoflavones was based on standard laboratory analytical methods (AOAC 2002) using High Performance Liquid Chromatography (HPLC) and use of adequate number of samples, providing sufficiently accurate estimations of isoflavones concentrations. The produced calibration equation was characterized by high correlation, indicating the accuracy of the method. The concentrations were calculated on a seed dry matter basis.

DNA Isolation, SNP Genotyping, and Genetic Map Construction
Genomic DNA of the RIL population and the parents were extracted using a standard cetyltrimethyl ammonium bromide (CTAB) method with minor modifications as previously described [52]. DNA concentration was quantified with a spectrophotometer (NanoDrop Technologies Inc., Centreville, DE, USA) and then normalized at 50 ng/µL for genotyping. SNP genotyping was performed in the Soybean Genomics and Improvement Laboratory, USDA-ARS, Beltsville, MD, USA, using the Illumina Infinium SoySNP6K BeadChips (Illumina, Inc. San Diego, CA, USA) as previously described [53]. Subsequently, SNP alleles were called using GenomeStudio Genotyping Module 2.0 (Illumina, Inc. San Diego, CA, USA). JoinMap 4.0 [54] was used to construct the genetic linkage map with a LOD score threshold of 3.0 and a maximum genetic distance of 50 cM to group markers. The linkage groups were assigned to corresponding soybean chromosomes as described in Soy-Base [29,31].

Isoflavone QTL Detection and Statistical Analysis
The broad sense (mean based) heritability analysis from two-way ANOVA was conducted using the following equation: h 2 = σ G 2 /[σ G 2 + (σ GE 2 /e) + (σ e 2 /re)] where σ G 2 (variance of genetic factor), σ GE 2 (variance of genotype-environment interactions), and σ e 2 (variance of random effect) were calculated with e (number of environment) and r (number of replicates) normalization [55]. R [56] was employed in the statistical analysis including agronomic traits, histogram of trait distribution, two-way ANOVA, and broad sense heritability using its native packages. The significant level of the assessed traits was showed using R package car (type II Wald chi-square tests) [56].

Isoflavone Candidate Genes Identification
The Glyma numbers of the isoflavone genes were obtained by searching the available data at the SoyBase [29,31] and Phytozome database [45]. The name of the isoflavone pathway enzymes ( Figure 5) were used as a query in a search of the Glycine max reference genome, version Williams 82. The obtained isoflavone genes were mapped to the identified isoflavone QTL.

Expression Analysis
The expression analysis of the genes that are located within or near the isoflavone QTL was conducted using the publicly available soybean expression database from Phytozome database [45] to infer expression profiles of isoflavone genes in the soybean reference genome Williams 82. Gene expression was estimated in FPKM (Fragments Per Kilobase of transcript per Million mapped reads).
For the Forrest cv., RNA-seq library was prepared by using four plant soybean tissues including seed, leaf, root, flower and pods as shown earlier [58]. From 100 mg of frozen grounded samples, total RNA was extracted using RNeasy QIAGEN Kit (Qiagen, Hilden, Germany). The DNase I (Invitrogen, Carlsbad, CA, USA) was used to treat the total RNA. Using Illumina NovaSeq 6000, RNA-seq libraries preparation and sequencing were performed at Novogene INC. Multiplexing and sequencing of the four libraries were done in two different lanes generating 20 million raw pair end reads per sample (150 bp). Quality of sequenced reads was assessed using fastqc, version 0.11.9. [59]. The low-quality reads and adapters were removed with trimmomatic, version V0.39, the remaining highquality reads were mapped to the soybean reference genome Wm82.a2.v1 using STAR, version v2.7.9 [60,61]. Uniquely mapped reads were counted using Python package HTseq v0.13.5. [62]. Read count normalization and differential gene expression analysis were conducted using the Deseq2 package v1.30.1 [63] integrated in the OmicsBox platform from BioBam (Valencia, Spain) [58,64].