The Identification of a Yield-Related Gene Controlling Multiple Traits Using GWAS in Sorghum (Sorghum bicolor L.)

Sorghum bicolor (L.) is one of the oldest crops cultivated by human beings which has been used in food and wine making. To understand the genetic diversity of sorghum breeding resources and further guide molecular-marker-assisted breeding, six yield-related traits were analyzed for 214 sorghum germplasm from all over the world, and 2,811,016 single-nucleotide polymorphisms (SNPs) markers were produced by resequencing these germplasms. After controlling Q and K, QTLs were found to be related to the traits using three algorisms. Interestingly, an important QTL was found which may affect multiple traits in this study. It was the most likely candidate gene for the gene SORBI_3008G116500, which was a homolog of Arabidopsis thaliana gene-VIP5 found by analyzing the annotation of the gene in the LD block. The haplotype analysis showed that the SORBI_3008G116500hap3 was the elite haplotype, and it only existed in Chinese germplasms. The traits were proven to be more associated with the SNPs of the SORBI_3008G116500 promoter through gene association studies. Overall, the QTLs and the genes identified in this study would benefit molecular-assisted yield breeding in sorghum.


Introduction
Sorghum bicolor is one of the oldest crops cultivated by human beings, and it is also one of the only crops affected by artificial selection evolution [1]. Meanwhile, the wild sorghum that is commonly distributed in Africa is a promising resource for sorghum improvement [2]. Sorghum is widely distributed all over the world and has various uses, such as in food, wine making, and so on [3]. Moreover, sweet sorghum could be used for bioenergy with genetically diverse characteristics and variation exits [4]. Sweet sorghum was considered in the model system for bioenergy crops due to its favorable low-input cost traits [5]. Due to the different climate, soil setting, and cultivation system in the cultivation area, various types in sorghum have formed different agronomic, physiological, and biochemical characteristics through natural and artificial selection in the long-term cultivation process [6]. The variations of traits laid a foundation for the exploration of related genes in sorghum.
With the increase in the world population, the demand for grain production is also higher [7]. Understanding the genetic variations in traits related to yield was a necessary basis for breeding [8]. In the process of breeding, yield has always been the concern of breeders, so analyzing the traits relating to yield could provide effective guidance to breeders [9]. of 214 landraces from diverse countries to observe six traits related to yield. Via the resequencing method, 2,811,016 markers were called. Using the emmax, fast-lmm, and lmm algorisms with controlling Q and K, we identified numbers of SNP markers associated with the traits. Finally, the SORBI_3008G116500 gene was found, which may control four traits through the annotation of genes in the LD block, and the elite haplotype was analyzed.

Population Collection and Genotyping
The landraces from different regions could provide more genetic variation; therefore, 214 landraces from China (86), India (71), America (36), Japan (14), Russia (6), and Mexico (1) were collected (Figure 1a, Table S1). The landraces laid the foundation of genetic variation for the GWAS. After that, the population was genotyped using the resequencing method. As a result, we obtained about 2198 G raw base data, and the Q30 percentage was 92.5%. The raw reads were 14.6 Gb. The average sequencing depth was 13.4, and the average coverage ratio of 10 times of population was 71.2%. After analysis, we obtained 6,738,606 SNPs and 1,336,590 indels. The transition-transversion rate was about 1.9 in all varieties, and the average heterozygosity was 21.4%. Overall, the high-quality molecular markers derived from the germplasm were obtained based on resequencing and analyzing data, which then laid the foundation for subsequent analysis. brix and height, were conducted for association mapping [31]. The previous study above showed that GWAS could be useful to analyze the population and detect the significant SNPs in sorghum.
Although there are some studies on the traits of yield, the genetic variation in other specific yield traits, such as plant height and the relationship between genetic variations with various traits, remains largely unknown. We herein collected a population consisting of 214 landraces from diverse countries to observe six traits related to yield. Via the resequencing method, 2,811,016 markers were called. Using the emmax, fast-lmm, and lmm algorisms with controlling Q and K, we identified numbers of SNP markers associated with the traits. Finally, the SORBI_3008G116500 gene was found, which may control four traits through the annotation of genes in the LD block, and the elite haplotype was analyzed.

Population Collection and Genotyping
The landraces from different regions could provide more genetic variation; therefore, 214 landraces from China (86), India (71), America (36), Japan (14), Russia (6), and Mexico (1) were collected (Figure 1a, Table S1). The landraces laid the foundation of genetic variation for the GWAS. After that, the population was genotyped using the resequencing method. As a result, we obtained about 2198 G raw base data, and the Q30 percentage was 92.5%. The raw reads were 14.6 Gb. The average sequencing depth was 13.4, and the average coverage ratio of 10 times of population was 71.2%. After analysis, we obtained 6,738,606 SNPs and 1,336,590 indels. The transition-transversion rate was about 1.9 in all varieties, and the average heterozygosity was 21.4%. Overall, the high-quality molecular markers derived from the germplasm were obtained based on resequencing and analyzing data, which then laid the foundation for subsequent analysis.

SNP Markers Selection and Population Structure Analysis
Firstly, 2,811,016 SNP markers were left after filtering with the missing rate (<0.8) and the minor allele frequency (>0.05). Through separating the markers into each chromosome, we found that the average number of SNP markers was about 281,102. The

SNP Markers Selection and Population Structure Analysis
Firstly, 2,811,016 SNP markers were left after filtering with the missing rate (<0.8) and the minor allele frequency (>0.05). Through separating the markers into each chromosome, we found that the average number of SNP markers was about 281,102. The Chr.NC_012871 had the most SNPs (327,237), while chr.NC_012876 had the least SNPs (224,286) ( Figure 1b, Table S2). Based on the SNP markers, including the population structure, phylogenetic tree, PCA, and kinship analysis, admixture software was used, and the cross-validation (CV) error was calculated (Figure 1d, Figure S1). The results show that the CV error was the smallest when K = 9. The population could therefore be divided into nine subpopulations (Figure 1d). Combined with the origin of the landraces, the landrace from China showed the highest proportion in Q1, Q3, and Q5, while the landraces from India were mainly found in Q2 and Q8 (Figure 1d). Moreover, the neighbor-joining clustering of landraces based on genetic distance was analyzed, and the result was almost consistent with that of population structure ( Figure 1c). Then, most varieties had distant phylogenetic relationships according to the kinship analysis ( Figure S2). Lastly, PCA analysis showed almost no varieties with the same background ( Figure S3, Table S3). The above results show that the population we selected was suitable for GWASs and could be used to identify the number of variants in some loci.

Phenotype Analysis
In order to elucidate the genetic base of traits related to the yield of sorghums, traits including the heading date, plant height, spike type ( Figure S4), spike length, number of internodes, and 100-grain weight were observed (Figure 2a (Table S4, Table 1). After calculating the correlation of each trait, the number of internodes had the most relevance with the heading date, while the smallest relevance was found with the spike length ( Figure 2g). The results show that the total observed six traits were abundant, but the intrinsic mechanism relating to yield formation may be different.
structure, phylogenetic tree, PCA, and kinship analysis, admixture software was used, and the cross-validation (CV) error was calculated (Figure 1d, Figure S1). The results show that the CV error was the smallest when K = 9. The population could therefore be divided into nine subpopulations (Figure 1d). Combined with the origin of the landraces, the landrace from China showed the highest proportion in Q1, Q3, and Q5, while the landraces from India were mainly found in Q2 and Q8 ( Figure 1d). Moreover, the neighbor-joining clustering of landraces based on genetic distance was analyzed, and the result was almost consistent with that of population structure ( Figure 1c). Then, most varieties had distant phylogenetic relationships according to the kinship analysis ( Figure  S2). Lastly, PCA analysis showed almost no varieties with the same background ( Figure  S3, Table S3). The above results show that the population we selected was suitable for GWASs and could be used to identify the number of variants in some loci.

Phenotype Analysis
In order to elucidate the genetic base of traits related to the yield of sorghums, traits including the heading date, plant height, spike type ( Figure S4), spike length, number of internodes, and 100-grain weight were observed (Figure 2a (Table S4, Table 1). After calculating the correlation of each trait, the number of internodes had the most relevance with the heading date, while the smallest relevance was found with the spike length ( Figure  2g). The results show that the total observed six traits were abundant, but the intrinsic mechanism relating to yield formation may be different.

LD Decay Analysis and GWAS
To estimate the linkage disequilibrium (LD) of the population, the r 2 was calculated using the Plink software. The variation in LD decay ranged from 9.9 to 70.2 kb, and the average was 17.4 K. The greatest LD decay in all chromosomes was found in NC_012875, while the least was found in NC_012870 ( Figure S5).
Based the collection of phenotype data and the population analysis, three GWAS algorisms were used to detect the QTLs. As results, 29 SNPs which exceeded the threshold (−log 10 (p) = 7.75) were observed in all six traits using the emmax (Figure 3, Table S5). The greatest number of SNPs (21) was detected in spike length trait, while the smallest numbers of SNPs (0) were found in plant height, the number of internodes, and 100-grain weight trait. Of the 21 SNPs significantly associated with spike length, 3 were found on chromosome NC_012871.2, 1 was found on chromosome NC_012872.2, 7 were found on chromosome NC_012874.2, 6 were found on chromosome NC_012876.2, 1 was found on chromosome NC_012878.2, and 3 were found on chromosome NC_012879.2. The other three and five significant SNPs on chromosome NC_012875.2 were found in the heading date trait and on chromosome NC_012874.2 in the spike-type trait, respectively (Table S5). With the calculation of the fastlmm model, in total, 1175 SNPs were found ( Figure S6, Table S6). The same positions were detected in the heading date and the spike-type trait. Among all the significant SNPs, the NC_012874.2 chromosome had the highest spike length (1079) ( Table S6). The SNP (NC_012874.2 1992075) was found in the plant height trait, while no SNP was found using emmax. Moreover, the lmm model of GEMMA was used to identify the 1320 associated SNPs ( Figure S7, Table S7). Luckily, there were two significant SNPs in chromosomes NC_012876.2 and NC_012878.2 (Table S7), while no SNPs were found in the other two algorisms (Tables S5 and S6).

Prediction and Analysis of Yield-Related Gene
Although there were numbers of QTLs in the GWAS, it was important to find the QTLs which were identified in several traits. Additionally, we could clone genes that affected multiple phenotypes and explored excellent alleles. Luckily, there were signals at the adjacent position on the NC_012877 chromosome in the traits of heading date, plant The change in heading date may lead to increases in the yield [32]. By using GWAS to detect the loci controlling the heading stage, three SNPs exceeded the threshold (−log 10 (p) = 7.75), representing the same QTL (Figure 3). The other loci did not exceed the threshold, and it may contain genes controlling heading date.
In the plant height trait, there was one common significant SNP (NC_012874.2 1992075) within one QTL using the fast-lmm and lmm algorisms. From the Manhattan plot, this peak was more continuous than other peaks (Figure 3). Therefore, a gene which controls the plant height would need to be cloned in future studies.
The shape and length of spike were also directly related to the yield of plants. In the GWAS results, 5 and 1307 SNPs were found to be significantly associated with spike traits using the lmm model. In panicle length traits, multiple SNPs on the NC_012874.2 chromosome were found to exceed the threshold (Figure 3). This indicates that there may be multiple genes related to ear length on this chromosome.
After detecting the significant SNPs in the two traits, no SNPs associated with the number of internodes and 100-grain weight were identified unfortunately. From the Q-Q plot, it is possible that the model did not match the two phenotypes ( Figure S3). Although there were no significant SNPs, there were some peaks which showed the trend of further research, such as the QTL located in the end of NC_012878 (Figure 3).

Prediction and Analysis of Yield-Related Gene
Although there were numbers of QTLs in the GWAS, it was important to find the QTLs which were identified in several traits. Additionally, we could clone genes that affected multiple phenotypes and explored excellent alleles. Luckily, there were signals at the adjacent position on the NC_012877 chromosome in the traits of heading date, plant height, spike type, and spike length (Figure 4a up). There was probably a gene with multiple functions at the locus, such as the DTH8 gene, in rice [12].
To identify whether some genes were the causal genes in the QTL, we calculated the LD block. The significant SNP was in the LD block from 52,637,266 to 52,792,992 and from 52,793,087 to 53,095,551 (Figure 4a bottom). In the LD blocks, there were 23 predicted genes (Figure 4a middle, Table S8). In all genes, the SORBI_3008G116500 was annotated as VIP5. Luckily, the homologue gene of SORBI_3008G116500 in Arabidopsis thaliana was reported to affect the heading date [33,34]. So, the SORBI_3008G116500 gene was mostly the candidate gene.
To observe the elite haplotype, three haplotypes were found according to the SNP in the promoter and gene (Table S9). The haplotype analysis showed that hap3 was superior compared to the other two based on the four traits' data ( Figure 4b). In all varieties, Hap3 provided the least percentages compared with the other two haplotypes, suggesting that the elite haplotype was not fully used (Figure 4c). Combined with the origin of the three haplotypes, the Hap1 almost spread all over the regions where we collected the germplasm. The Hap2 was most present in China, while the elite haplotype-Hap3 only existed in China (Figure 4c). This indicates that higher polymorphism in the candidate gene was found in China. Through the gene association study, the SNP in the promoter may play a more significant role in the gene function (Figure 4d). Above that, we believed that SORBI_3008G116500 was most likely the candidate gene in the QTL and the superior haplotype could be widely used for sorghum breeding, especially in other countries.
germplasm. The Hap2 was most present in China, while the elite haplotype-Hap3 only existed in China (Figure 4c). This indicates that higher polymorphism in the candidate gene was found in China. Through the gene association study, the SNP in the promoter may play a more significant role in the gene function (Figure 4d). Above that, we believed that SORBI_3008G116500 was most likely the candidate gene in the QTL and the superior haplotype could be widely used for sorghum breeding, especially in other countries.

GWAS Results and Observed Traits
In our study, the most significant SNPs were detected in the spike length trait using three algorisms. Meanwhile, the panicle length has been investigated by linkage mapping [35]. Compared with the previous QTLs, more novel QTLs were found in chromo-

GWAS Results and Observed Traits
In our study, the most significant SNPs were detected in the spike length trait using three algorisms. Meanwhile, the panicle length has been investigated by linkage mapping [35]. Compared with the previous QTLs, more novel QTLs were found in chromosome NC_012874.2 and so on (Tables S5-S7). At the same time, GWAS was also carried out on the grains per panicle and grain size in sorghum [36]. In our study, many QTLs were detected using GWAS, which improved the knowledge on the genetic variations in yield, but more QTLs would be found if we observed more traits related to yield such as grain size [37] or traits related to leaf [16]. Thus, more characteristics could be added for the GWAS, and we will obtain more about the genetic variation in the configuration factors of sorghum yield, which is more conducive to the analysis of the genetic variation in each character and the aggregation analysis of the favorable allele variation.

A More Appropriate Model or Lower Threshold for Detection
Association analysis is an important method used to analyze the genetic mechanism of complex traits [21]. Population structure analysis could reduce the impact of population classification on the division of subgroups in association analysis [38]. We found no significant loci in the number of internodes and 100-grain weight traits. The positive sites may be screened out due to the problem of model selection [17]. Additionally, the 100-grain weight trait may be a complex trait. In the further study, a higher p-value was selected for site screening. Meanwhile, 100-grain weight trait could be divided into some other related traits, such as the grain length and width, and the significant loci may be detected. Combined with gene verification, we could also clone the genes that control the traits. Finally, some other algorisms, such as restrictive two-stage multipoint (RTM) GWASs, could be used for significant SNP identification in the 100-grain weight trait [39]. Meanwhile, more careful verification was required because of the lower threshold of RTM-GWAS. In our study, the novel QTL with higher reliability was the purpose and the RTM-GWAS could be used for further study if more QTL want to be detected.

SORBI_3008G116500 Breeding
In sorghum, researchers have identified many QTLs with multiple traits [26,28,36,40,41], and so did we. After that, the genes were cloned by GWAS, and superior alleles and markers linked to the allele were found [42]. These genes would be beneficial for molecular breeding, such as marker-assisted selection. The AtVIP5 functioned in the heading date trait [43]. Loss of the mutants, such as the other VIP genes (for example, VIP2/ELF7, VIP4, VIP5, and VIP6/ELF8) also flower earlier because of silencing of FLOWERING LOCUS C (FLC) [33,44,45]. In rice, the IPI1 gene which was annotated as VIP2 modulates IPA1 protein levels, and the plant architecture, including plant height and branch number, were changed when overexpressing the IPI1 [43]. In our study, we supposed that the homologue of AtVIP5 could function not only in the heading date, but also in the spike trait. In the later stage, combined with transgenic lines or mutant lines, further validation would need to be conducted to understand the function of SORBI_3008G116500, and excellent allele markers would need to be developed for molecular marker selection in sorghum breeding, especially improving the traits including heading date, plant height, spike type, and spike length. Additionally, the Chinese sorghums have formed a rich genetic base and unique characteristics have evolved as reported previously [46]. The elite haplotype of SORBI_3008G116500 was mostly distributed in China, validating that the sorghum germplasm in China had more potential for breeding compared with the germplasm in other countries. The average precipitation is 1073.8 mm, the average frost-free period is 225 days, and the average annual sunshine total is 1956.2 h. The standard agricultural practice was followed for plant growth and development, including irrigation, fertilizer application, and pest control. All experiments used a completely randomized block design with three replicates. The authors declare that all that permissions or licenses were obtained to collect the sorghum plant, and that all aspects of the study comply with relevant institutional, national, and international guidelines and legislation for plant ethics in the "Methods" section.

Genotyping
All accessions of DNA were extracted using the DNAsecure plant kit (Qiagen, Cat. No. DP320). The truseq library construction kit (FC-121-4001) was used to build the library. DNA fragments were prepared by end repair, and PloyA tail and sequencing connector were added, followed by purification and PCR amplification. The libraries were analyzed by the Bioanalyzer (Agilent, Santa Clara, CA, USA), and the PCR products were quantified by a Qubit 3.0 fluorometer (Invitrogen, Carlsbad, CA, USA). The constructed libraries were sequenced at Illumina Hiseq Xten platform to produce 150 bp paired-end reads. The sequencing reads were extracted from the raw data; reads with an adaptor, low-quality reads, and reads with an unidentified nucleotide sequence were removed using fastp-0.20.1 (http://opengene.org/fastp/fastp, accessed on 3 December 2021) [47]. After that, the BWA and GATK software were used for alignment and variant calling for high-quality reads [48,49]. The reference was downloaded from NCBI website (ftp.ncbi.nlm.nih.gov/ genomes/all/GCF/000/003/195/GCF_000003195.3_Sorghum_bicolor_NCBIv3, accessed on 3 December 2021). Moreover, we used Plink to filter the variants [50]. On the premise of minimum allele frequency (MAF) ≥ 0.05 and missing data ≤ 0.2, 2,811,016 high-quality SNPs were obtained.

Phenotypic Trait Evaluation and Data Analysis
The heading date was collected at the heading stage. At the maturity stage, the other phenotypic data of five yield-related traits were collected. For collecting plant height trait, five plants were randomly selected to be measured. Numbers of Internode were determined after stripping leaves from the stem. The data of spike type was collected according to the "Guidelines for the conduct of tests for distinctness, uniformity and stability-Sorghum (Sorghum bicolor (L.) Moench)" in China. However, during specific measurements, the sorghum spike phenotype has a phenotype that resides between two types. We have defined the specific type in the Supplementary File. According to the guidelines for conducting tests for distinctness, uniformity, and stability-Sorghum (Sorghum bicolor (L.) Moench) [GB/T 19557. , spike length was measured from the leaf scars under the spike to the top of the main spike. Lastly, mature grains of each inbred line were harvested and dried at 50 • C until the weight became constant. After drying, we collected the 100-grain weight data. The means of all phenotypic data and correlation coefficients were calculated using SAS. Population structure (Q) was determined using Admixture software and filtered SNP markers [51]. The number of subpopulations was determined by the CV error. Relative pairwise kinship (K) was calculated using TASSEL5.0 [52]. The PCA was analyzed with SNP markers using the TASSEL 5.0, and the results were visualized using R (https://www. r-project.org, accessed on 25 May 2022). The LD was also calculated with Plink using the SNP data of whole genomes and each chromosome.

Genome-Wide Association Study (GWAS) and QTL analysis
The GWAS for the six yield-related traits was performed for the three algorisms, including lmm, fastlmm and emmax, using GEMMA, FaST-LMM, and EMMAX programs [53][54][55]. The significant value was decided by 0.05/markers. The results were visualized using R package-CMplot. The LD block was defined with LDBlockShow software [56]. The gene annotation was based on Gramene (https://www.gramene.org/, accessed on 29 July 2022). The gene association study was conducted using TASSEL5.0 [57].

Conclusions
Overall, six traits related to yield were analyzed by three different algorisms based on natural variations in 214 landraces, and many significant loci were obtained. We also analyzed the differences of the three algorisms and the detection ability for different traits. Finally, a candidate gene was obtained that could affect multiple traits and find the elite haplotype. Our results enrich the discovery of sorghum-associated SNPs, and the discovered QTLs and gene will be helpful to sorghum breeding.

Supplementary Materials:
The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/plants12071557/s1, Figure S1: The CV error computed by Admixture; Figure S2: The kinship of 214 landraces based on TASSEL5 analysis; Figure S3: Principal component analysis of sorghum association panel used based on the 2,811,016 SNPs. Different color of each landrace meant different subpopulation based on Admixture; Figure S4: The spike type in the study; Figure S5: The LD decay of the population (left) and each chromosome (right) based on R 2 ; Figure S6: The Manhattan and Q-Q plot of six traits including heading date (HD), plant height (PL), spike type (ST), spike length (SL), number of internodes (NI), and 100-grain weight (100GW) based on the Fast-lmm algorism; Figure S7: The Manhattan and Q-Q plot of six traits including heading date (HD), plant height (PL), spike type (ST), spike length (SL), number of internodes (NI), and 100-grain weight (100GW) based on the Lmm algorism; Table S1: The list of all 214 landraces; Table S2: The SNP number of chromosome.; Table S3: The result of PCA; Table S4: The all phenotype data of all 214 landraces; Table S5: The analysis of SNPs which exceeded threshold using the emmax model; Table S6: The analysis of SNPs which exceeded threshold using the fastlmm model; Table S7: The analysis of SNPs which exceeded threshold using the lmm model; Table S8: All genes in the LD block; Table S9: The all variations of three haplotypes.

Data Availability Statement:
The resequencing data were uploaded in the NCBI database (PR-JNA826146, https://www.ncbi.nlm.nih.gov/sra/?term=PRJNA826146, accessed on 30 November 2022). The datasets supporting the conclusions of this article are included within the article and its additional files.