Identification of QTNs and Their Candidate Genes for 100-Seed Weight in Soybean (Glycine max L.) Using Multi-Locus Genome-Wide Association Studies

100-seed weight (100-SW) in soybeans is a yield component trait and controlled by multiple genes with different effects, but limited information is available for its quantitative trait nucleotides (QTNs) and candidate genes. To better understand the genetic architecture underlying the trait and improve the precision of marker-assisted selection, a total of 43,834 single nucleotide polymorphisms (SNPs) in 250 soybean accessions were used to identify significant QTNs for 100-SW in four environments and their BLUP values using six multi-locus and one single-locus genome-wide association study methods. As a result, a total of 218 significant QTNs were detected using multi-locus methods, whereas eight QTNs were identified by a single-locus method. Among 43 QTNs or QTN clusters identified repeatedly across various environments and/or approaches, all of them exhibited significant trait differences between their corresponding alleles, 33 were found in the genomic region of previously reported QTLs, 10 were identified as new QTNs, and three (qHSW-4-1, qcHSW-7-3, and qcHSW-10-4) were detected in all the four environments. The number of seed weight (SW) increasing alleles for each accession ranged from 8 (18.6%) to 36 (83.72%), and three accessions (Yixingwuhuangdou, Nannong 95C-5, and Yafanzaodou) had more than 35 SW increasing alleles. Among 36 homologous seed-weight genes in Arabidopsis underlying the above 43 stable QTNs, more importantly, Glyma05g34120, GmCRY1, and GmCPK11 had known seed-size/weight-related genes in soybean, and Glyma07g07850, Glyma10g03440, and Glyma10g36070 were candidate genes identified in this study. These results provide useful information for genetic foundation, marker-assisted selection, genomic prediction, and functional genomics of 100-SW.


Introduction
Soybean (Glycine max L. Merr.), which provides 69% dietary protein and 30% oil [1], is economically imperative food and oilseed crop worldwide. The 100 seed weight (100-SW) is an essential trait in soybean yield component, affected by seed size and shape, and positively correlates with seed yield [2]. There are numerous soybean food items for various seed sizes, for (Glysoja.04g010563) was identified to be associated with SW and highly expressed during seed developmental stages [30].
To date, several studies reported QTLs/QTNs regarding soybean SW using linkage and association mapping, but the related genes in soybean are relatively limited. The possible reason for this, firstly, a high degree of linkage disequilibrium (LD) in the soybean genome makes it difficult to detect the QTNs and genes precisely. Secondly, the number of molecular markers used in soybean GWAS is relatively small with low density, which reduces the efficiency of GWAS. Thirdly, single-locus GWAS (SL-GWAS) models were used in soybean for yield traits [45][46][47], and these models are single-locus genome-wide scanning and need multiple tests correction (e.g., Bonferroni correction) that removes many significant small-effect QTNs [23]. To overcome these limitations, Zhang's group developed a series of multi-locus GWAS (ML-GWAS) methods, such as mrMLM [22], and these ML-GWAS methods were used to dissect the genetic foundations of complex traits in different crops [23,[48][49][50]. Currently, most of the studies have used SL-GWAS methods for the detection of 100-SW QTNs. However, almost no ML-GWAS articles have been found to detect QTNs for 100-SW in soybean. Therefore, more efficient studies are required to dissect the genetic basis of 100-SW, and exploring the QTNs/candidate genes associated with SW will be paramount for the genetic improvement and production of this crop.
Therefore, the objectives of this study were: (a) to dissect the genetic basis for 100-SW using the ML-GWAS methods, and compare the QTN results with those in all the previous studies; (b) to identify the seed weight (SW) increasing alleles of these QTNs for MAS, and (c) to find the potential candidate genes regulating 100-SW in the region of stable QTNs. The findings in this study will provide reliable information for MAS in soybean breeding and functional gene validation/cloning.

Plant Materials
A total of 250 soybean accessions were selected from different geographic regions of China. These soybean accessions came from 23 provinces and were disseminated in six eco-regions of China [25], and were obtained from the National Center for Soybean Improvement and Linyi Academy of Agricultural Sciences with 139 landraces and 111 cultivars.
All the accessions were planted at the Jiangpu Experimental Station of Nanjing Agricultural University (from June to October) and Experimental Station of Huazhong Agricultural University (from May to October) in 2014 (denoted as E1 and E3) and 2015 (E2 and E4). Plants were grown in 150 cm wide and 200 cm long plots according to the randomized complete block design with three replicates. The flowering time was started after six to eight weeks of emergence. The trait phenotypes were measured from five plants in the middle row of each plot, and 100-SW for each accession was averaged based on three replicates.

Statistical Analysis and Heritability Estimation
The best linear unbiased prediction (BLUP) of 100-SW for each accession was calculated using the R (http://www.R-project.org/, v3.5.0) package lme4 [51] with the following model: The aov function in the R software was used to calculate the variances of 100-SW, and mixed linear model (MLM) was used to estimate polygenic variance components and heritability [22] with the following equation: where y is the phenotypic vector; X is an incident matrix for fixed (non-genetic) effects, and α is a vector of fixed effects; is the polygenic effect with a multivariate normal distribution with zero mean, 2 g σ is polygenic variance, and kinship matrix K was calculated from marker information [52]; is the vector of residues, and 2 e σ was residual variance. The above two variance components were estimated from restricted maximum likelihood [53]. The broad-sense heritability was calculated using the following equation:

Population Structure Analysis and Genome-Wide Association Studies
RAD-seq was used to obtain high-density SNPs, while RAD-seq, genotyping of soybean accessions, methods of sequencing data calling variations, and the quality control were described in Zhou et al. [25]. In this study, a total of 43,834 SNPs with minor allele frequency (MAF) > 0.05 were used to construct a population structure using the STRUCTURE 2.3.4 software [54]. The hypothetical number of subgroups (k) ranged from 1 to 10. The length of the burn-in period for each run was set to 10,000, and the number of Markov chain Monte Carlo replications after burn-in was set to 100,000. The best k in this population was identified according to Evanno et al. [55] using STRUCTURE HARVESTER [56]. Six ML-GWAS approaches with population structure (Q) and kinship (K) were used to detect significant QTNs, including mrMLM [22], FASTmrEMMA [20], pLARmEB [57], ISIS EM-BLASSO [58], FASTmrMLM [59], and pKWmEB [60]. These methods were included in package mrMLM (https://cran.r-project.org/web/packages/mrMLM/index.html, v4.0). In the above methods, the first step was to select all the potentially associated markers, and kinship matrix K was automatically calculated. In the second step, the effects of all the selected markers were estimated by empirical Bayesian, the significances of the effects apart from zero were obtained by likelihood ratio test, and the threshold level LOD ≥ 3 (p = 0.0002) was used to determine significant QTNs [20,22,23,50,[57][58][59][60].

Elite Allele Analysis
Based on the QTN effect value and code 1 for genotype, SW increasing alleles of each stable QTN can be determined. If the QTN effect value is positive, the genotype with the code of 1 is regarded as the SW increasing allele; if the QTN effect value is negative, then alternative genotype is viewed as the SW increasing allele [22,23]. The average seed weight of the accessions with one allele was calculated to verify the QTN [61]. For each QTN, the SW increasing allele percentage in mapping population was measured as the number of accessions having SW increasing allele divided by the total number of accessions. The SW increasing allele percentage for each accession was equal to the number of SW increasing alleles divided by the total number of stable QTNs. Using the stable QTN information, the best cross combinations were predicted for the soybean breeding program. If we want to add seed weight, SW increasing allele is elite allele, while SW decreasing allele is elite allele if we want to decrease seed weight.

Prediction of Candidate Genes for 100-Seed Weight
Prediction of candidate genes for 100-SW was performed in 100 kb downstream and upstream of each stable QTN in SoyBase (http://soybase.org/; Wm82.a1.v1.1). For the screening of genes, the transcriptomic datasets of seven different seed developmental stages such as 4, 12-14, 22-24 (DAF: Days after flowering), seed weight 5-6 mg period (5-6 mgWS), cotyledon weight 100-200 mg period (100-200 mgCOT), cotyledon weight 400-500 mg period (400-500 mgCOT), and full seed maturity period (Dry seed) of soybean Williams 82 [62] were retrieved from the Gene Expression Omnibus database (https://www.ncbi.nlm.nih.gov/geo/; accession no GSE42871). This is because genes with high RPKM at these stages are related to seed size, seed weight, cotyledon, seed coat tissues, embryo, endosperm, seed storage proteins, and seed maturation protein [63]. Thus, candidate genes were determined as below [23]. Firstly, we removed all the genes with expression level < 1 in all the seven stages and selected those genes with a higher expression levels double their average expression levels in at least one seed developmental stage. Then, homologous genes related to seed weight in Arabidopsis were identified using BLAST analysis with the critical E value 1E-30. Finally, all homologous genes from soybean accompanied seed weight were selected, and considered candidate genes for 100-SW.

Gene Expression Level Analysis
The freely available RNA-Seq datasets of 14 soybean tissues [63], including whole seeds from 11 stages of reproductive tissue development (flower, pod, and seed) and three vegetative tissues (leaves, root, and nodules) were obtained from SoyBase (http://soybase.org/), in order to analyze candidate genes with special higher gene expression levels in soybean seeds. The heat maps were generated by using R software packages "pheatmap".

Kyoto Encyclopedia of Genes and Genomes (KEGG) Pathway Analysis
The Kyoto Encyclopedia of Gene and Genomes (KEGG) enrichment analysis was conducted for potential candidate genes to identify the functional categories, implemented by KEGG Orthology-Based Annotation System network software (KOBAS v3.0) [64] (http://kobas.cbi.pku.edu.cn/kobas3), with adjusted p value < 0.05 as threshold criteria.

Phenotype Variation of 100-Seed Weight
The 100-SW phenotype of each accession was the average of three replicates in each environment. The mean phenotypic values of 100-SW across 250 accessions in E1 to E4 environments were 18.39, 19.86, 17.98, and 19.22 (g), with standard deviations of 5.96, 5.58, 5.07, and 5.83 (g), respectively, and their coefficient of variations ranged from 28.08-29.07 (%) ( Table 1). The highest phenotypic value was observed in E4, whereas the lowest phenotypic value seen in E2 ( Figure 1). The continuous distribution was found in these environments ( Figure 1). Two-way ANOVA showed the significant difference of 100-SW across all the accessions (p-value < 0.01), indicating the existence of genetic variation among these accessions (Table 1). Meanwhile, the estimates of broad-sense heritabilities (h 2 B) for 100-SW in E1 to E4 environments were 93.70, 88.51, 90.15, and 83.20 (%), respectively, using polygenic and residual variances ( Table 1), suggesting that the genetic effects play an essential role in phenotypic variation.

Prediction of the Best Parental Combinations for 100-Seed Weight in Two Directions
The number of SW increasing allele for each stable QTN in 250 accessions ranged from 3 (1.2%) to 225 (90.00%). Among the above 43 stable QTNs, 21 had more than 50% SW increasing alleles in the 250 accessions, while 22 had less than 50% SW increasing alleles in the 250 accessions (Table S9). The number of SW increasing alleles for each accession ranged from 8 (18.60%) to 36 (83.72%). Among the 250 accessions, 69 had more than 22 (50%) SW increasing alleles, while 181 had less than 21 (50%) SW increasing alleles. Interestingly, eight accessions, Yixingwuhuangdou, Yafanzaodou, Nannong 95C-5, Ribendaheidou, Fujiandadou, Quxiandahuangdou, Bayueqing, and Nanchengqingpidadou had 36,36,35,34,30,29,27, and 27 SW increasing alleles, respectively. Six accessions Qinyan 1, Baihuadou, Mayidan, Mingshanhongxingjiroudou, Heibidou, and Qingcha 1 had 11, 10,9,8,8, and 8 SW increasing alleles, respectively, while these accessions had 30, 28, 33, 30, 25, and 32 SW decreasing alleles, respectively. All the accessions can be used for the soybean breeding program by increasing or decreasing the number of SW increasing alleles in one cultivar. For example, the cross between Yafanzaodou (36 SW increasing alleles) and Ribendaheidou (34 SW increasing alleles) may produce the offspring with 41 SW increasing alleles. Thus, the best five cross combinations of large and small seeds were predicted and listed in Table 4. It should be noted that some parents were repeatedly present in these predicted combinations. For example, Ribendaheidou was used as parent in three cross combinations for larger seed, while Heibidou, Mayidan, and Qinyan1 were used as parents in at least two combinations for small seeds (Table 4). Table 4. The best parental combinations predicted from genome-wide association studies for 100seed weight in soybean.

Candidate Genes Underlying the Stable QTNs for 100-SW in Soybean
A search for putative candidate genes resulted in 774 potential candidate genes located between 50 kb-100 kb up-or downstream of the above 43 stable QTNs. Among the 774 genes, 205 exhibited high expression levels at seven seed development stages. Among the 205 genes, 175 were found to have homologs in Arabidopsis. Among these homologs, 36 genes were found to be related to seed weight (Table S10). Kyoto encyclopedia of genes and genomes (KEGG, http://kobas.cbi. pku.edu.cn/kobas3) analysis from the above 36 genes indicated that nine genes were involved in eleven seed-weight-related pathways (Table S10). In RNA-seq analysis, twenty-nine genes had two times higher gene expression, as compared with Glyma03g29431, Glyma04g33610, Glyma06g07940, Glyma07g03810, Glyma07g05260, Glyma11g07523, and Glyma20g21082 in seed development stages ( Figure S7). Between the above nine and twenty-nine genes, there were six common genes, which were considered as candidate genes in this study (Table 5; Figures 4 and 5). Among these candidate genes, Glyma05g34120, Glyma06g10830 (GmCRY1), and Glyma06g16920 (GmCPK11) had known functions for seed size/weight in soybean (Table 5; Figure 4 bold text), and the others were homologous to the known genes for seed size and seed development in Arabidopsis, for example, candidate gene Glyma07g07850 underlying the stable QTN qcHSW-7-3 were homologous to AT4G00710 (BSK3), which annotate BR-signaling kinase 3 in Arabidopsis. Glyma10g03440 and Glyma10g36070 underlying the stable QTNs qcHSW-10-1 and qcHSW-10-4 were homologous to, respectively, the Arabidopsis genes AT1G03090 (MCCA) and AT1G35680 (RPL21), which are related to seed weight or development (Table 5; Figures 4 and 5). . The expressional levels [log2(RPKM + 1)] of potential candidate genes associated with seed weight in seven soybean tissues. Among the nine genes, three were previously reported seedweight genes (bold text) in soybean, three were lowly expressed genes (grey text), and three were newly identified as candidate genes (red text) to be related to seed size/weight and development.

Figure 5.
Manhattan plot in the detection of QTNs for 100-seed weight in soybean using multi-locus genome-wide association study approaches. The blue color dots were used to represent the QTNs near previously reported genes (Glyma05g34120, GmCRY1, and GmCPK11), and the red color dots were used to represent the QTNs near predicted candidate genes in this study, whereas light blue and light green color dots were used to indicate the negative log10(p-value) of each marker on the adjacent chromosomes in the first step of multi-locus approaches. The bold text candidate genes were previously reported in soybean and the remaining newly identified in this study.

Discussion
To dissect the genetic basis of 100-SW and provide SW increasing alleles for molecular breeding in soybean, the 100-SW phenotypes of 250 soybean accessions in four environments were used to be associated with 43,834 SNP markers using seven GWAS approaches in this study. As a result, we obtained some valuable results in two aspects. On one hand, 43 stable QTNs were identified, and showed significant differences of 100-SW between the two alleles (Figure S1-S6; Tables 3 and S8). Using the above 43 stable QTN information, new cross combinations were predicted (Table 4), and a number of SSR markers were obtained from overlapping and previously published QTLs and comparative genomics analysis (Tables 6 and S8). Thus, these SSR markers can be used to conduct marker-assisted selection in soybean breeding. Based on the above 43 stable QTNs, on the other hand, multi-omics analyses were used to mine candidate genes. As a result, six candidate genes were obtained in this study. Among the six candidate genes, Glyma05g34120, Glyma06g10830 (GmCRY1), and Glyma06g16920 (GmCPK11) were found to be associated with soybean 100-SW, respectively, in Li et al. [7], Du et al. [37], and Aghamirzaie et al. [66], and Glyma07g07850, Glyma10g03440, and Glyma10g36070 were new in soybean (Table 5). These new 100-SW-related candidate genes are valuable in soybean molecular biology research.

Reliability of QTNs and Application of SW Increasing Allele in Soybean Breeding
In this study, 218 significant QTNs were identified to be associated with 100-SW in soybean. Among them, a total of 43 QTNs were repeatedly identified in more than three environments and/or methods, and viewed as stable QTNs. Of these stable QTNs, 36 QTNs were identified in at least three environments/BLUP by multiple methods and 22 QTNs were detected by at least three methods in multiple environments. Among them, eight were detected in one environment by at least three ML-GWAS methods, whereas three QTNs were detected by one ML-GWAS method in at least three environments/BLUP (Tables 3 and S8). The QTNs found across different environments are reliable, i.e., Zhou et al. [8] repeatedly detected 31 QTNs associated with 100-SW in 185 soybean accessions in multiple environments, while Li et al. [7] identified 35 QTNs associated with soybean yield traits in at least three environments. Likewise, the QTNs identified by multiple methods are also reliable when several multi-locus approaches are used to evaluate the same dataset [23]. For example, 58 QTNs associated with embryonic callus-related traits have been detected by at three multi-locus methods in Ma et al. [49], seven QTNs associated with starch pasting properties-relate traits in maize have been identified by more than one method in Xu et al. [89], and all the 56 QTNs associated with seven salt tolerance-related traits have been determined by at least three multilocus methods in Cui et al. [48].
To verify the reliability of each stable QTN, we divided the 250 accessions into two groups based on their allelic types and compared the mean phenotypic values of both alleles. As a result, forty-three QTNs exhibited significant differences of 100-SW between the two alleles (Figures S1-S6), suggesting the reliability of QTNs identified in this study. More importantly, these SW increasing alleles can be utilized in molecular breeding [7,8,18,25,61].
In this study, the average number of SW increasing alleles per accession was 18.42, indicating the predominance of SW increasing alleles in cultivars after the disappearance of some alleles during artificial or natural selection process. Based on these SW increasing and decreasing alleles, we also predict some parental combinations. In these combinations, the cultivars Ribendaheidou, Mayidan, and Heibidou are repeatedly present. These predictions might be valuable for the following reasons. Firstly, three selected parents, Yixingwuhuangdou, Quxiandahuangdou, and Nannong 95C-5, were also predicted as parents for seed size traits based on the effects of elite alleles in Niu et al. [18]. Secondly, some selected parents have been widely planted in some areas owing to their high yield, e.g., Fujiandadou, Nannong 95C-5, Yixingwuhuangdou, Quxiandahuangdou, and Ribendaheidou. The similar idea can be found in rice breeding, e.g., Wang et al. [90] developed a japonica cultivar (chromosome segment substitution line) for large grain (>8.5 mm grain length × 3.2 mm grain width) by molecular breeding, demonstrating that elite alleles from different cultivars can be pyramided into a new cultivar.

Candidate Genes Underlying Stable QTNs for Seed Weight
Identification of candidate genes underlying stable QTNs is of great interest for practical plant breeding and is necessary for further gene cloning and functional verification. To date, only a few seed-weight-related genes have been identified based on association mapping in soybean. Based on functional annotations, available literature, comparative genome analysis, KEGG pathways, and gene expression data, the present study mined candidate genes regulating seed size/weight and development in soybeans. Among 774 genes within the physical regions of 43 stable QTNs, therefore, 36 genes were considered as candidate genes to be involved in seed size/weight and development. Among the 36 candidate genes (Table S10), nine were found in KEGG pathway analysis, 29 had significantly higher expressed at seed developmental stages, and there were six common genes between the nine and twenty-nine genes (Table 5; Figure 5). Among the six candidate genes, Glyma05g34120, Glyma06g16920 (GmCPK11), and Glyma06g10830 (GmCRY1) have been reported to directly control seed weight in soybean and to have seed size/weight related functions [7,37,66]. Therefore, these candidate genes are very reliable and useful in the improvement of 100-SW in soybean.
The other three candidate genes are newly identified in this study. Glyma07g07850 is homologous to BSK3, one brassinosteroid (BR) biosynthesis or signaling gene in Arabidopsis. The BSK3 gene has a decisive role in the initial steps of the BR signal transduction pathway [91], and mutant bsk3 has been known to play an important role in seed size [67]. BRs are plant hormones that regulate plant growth and development, and the deficiency of this hormone causes abnormal plant growth and hence yield reduction [92]. Physiological, cellular, and molecular mechanisms influencing plant growth and yield production also indicate the diverse role of BR in plant growth and development [93]. In addition, some factors have been found to affect the soybean 100-SW, such as seed size, hormones (ABA, BRs, GA3, and IAA), enzymes, silique development, cell growth rate, cotyledon cell number, pollen development and cell volume [13,94].
Glyma10g03440 encodes 3-Methylcrotonyl CoA carboxylase (MCCA), which is a nuclearencoded biotin-localized enzyme and also plays an important role in leucine and isoprenoids catabolism. In Arabidopsis, knockout alleles of the MCCA gene and metabolite study suggest that MCCA mutations block mitochondrial leucine catabolism, which is associated with reduced reproductive growth phenotype, including abnormal flower and silique development [68]. Glyma10g36070 encodes ribosomal protein L21 (RPL21) that is required for chloroplast and pollen development and embryogenesis in Arabidopsis [69]. However, these candidate genes require further functional validation/cloning to determine their actual role in seed weight in soybean.

Statistical Power of Multi-Locus GWAS Approaches
In this study, 218 significant QTNs for 100-SW in soybean were detected from six ML-GWAS approaches. These significant QTNs were divided into four groups. In the first group, all the QTNs are both msQTNs and esQTNs. All the QTNs in the second group are esQTNs rather than msQTNs, while all the QTNs in the third group are msQTNs rather than esQTNs. In the last group, all the QTNs are neither esQTNs nor msQTNs. Thus, we summarized their characteristics of the above four groups, such as the number of significant QTNs, the average of absolute effects, LOD score and r 2 , and the proportion of previously reported QTNs in Table 7. As a result, it is easy (the highest proportion of previously reported QTNs) to identify the QTNs in the first group and these QTNs have the largest values for QTN effects, LOD scores and r 2 , while it is relatively difficult (the lowest proportion of previously reported QTNs) to detect the QTNs in the last group and these QTNs have relatively small values for QTN effects, LOD scores, and r 2 . The above results show the advantage of our multi-locus GWAS approaches in detecting small-effect QTNs. The results support our previous recommendation that the QTNs identified by individual approaches or in individual environments are valuable in mining the genes for the trait of interest [23]. Table 7. Comparison of four kinds of QTNs for soybean 100-seed weight in this study.

Group
No. of QTNs Absolute Effect LOD Score r 2 (%) % Known QTNs 1 (Both esQTN and msQTN In addition, we also found the gap between the trait heritability (83.23-93.70%) and the sum of r 2 (24.13-35.52%) for all the QTNs identified by each approach in one environment or BLUP. This is the heritability missing in GWAS [23]. The possible reasons are the exclusion of QTN-byenvironment and QTN-by-QTN interactions in this study. Thus, it is necessary to develop the methodologies for detecting QTN-by-environment and QTN-by-QTN interactions in the near future.

Conclusions
In this study, 43 stable QTNs were detected in at least three environments/BLUP and/or by at least three ML-GWAS methods, and they showed significant differences of 100-SW between the two alleles in the GWAS population. Using these SW increasing or decreasing alleles of stable QTNs, the best five cross combinations were predicted in large or small seed directions. Among the 36 potential candidate genes from multi-omics analysis, Glyma05g34120, GmCRY1, and GmCPK11 are the known seed-size-related genes in soybean, and Glyma07g07850, Glyma10g03440, and Glyma10g36070 were identified to be candidate genes in this study.

Conflicts of Interest:
The authors declare that they have no conflict of interest.