Identification and Allele Combination Analysis of Rice Grain Shape-Related Genes by Genome-Wide Association Study

Grain shape is an important agronomic character of rice, which affects the appearance, processing, and the edible quality. Screening and identifying more new genes associated with grain shape is beneficial to further understanding the genetic basis of grain shape and provides more gene resources for genetic breeding. This study has a natural population containing 623 indica rice cultivars. Genome-wide association studies/GWAS of several traits related to grain shape (grain length/GL, grain width/GW, grain length to width ratio/GLWR, grain circumferences/GC, and grain size/grain area/GS) were conducted by combining phenotypic data from four environments and the second-generation resequencing data, which have identified 39 important Quantitative trait locus/QTLs. We analyzed the 39 QTLs using three methods: gene-based association analysis, haplotype analysis, and functional annotation and identified three cloned genes (GS3, GW5, OsDER1) and seven new candidate genes in the candidate interval. At the same time, to effectively utilize the genes in the grain shape-related gene bank, we have also analyzed the allelic combinations of the three cloned genes. Finally, the extreme allele combination corresponding to each trait was found through statistical analysis. This study’s novel candidate genes and allele combinations will provide a valuable reference for future breeding work.


Introduction
Rice is an important food crop, and its production is of great significance to global food security, social stability, and economic development. 1000-grain weight is an essential factor affecting yield, and it can be inherited stably [1]. In addition, the factors determining 1000-grain weight can be divided into three aspects of grain shape: grain length, grain width, and grain thickness, and with the increase of grain length, grain width, and grain thickness, 1000-grain weight also increased [2]. In addition, grain shape has an essential effect on rice yield. It affects the appearance, processing, and edible quality of rice, which directly affects the market demand for rice. Across the globe, rice preferences vary markedly in grain shape: People in Europe, the United States, southern China, and Southeast Asia prefer slender rice, while people in Japan, the Korean Peninsula, and northern China prefer short, round rice [3]. In addition to grain shape, other appearance qualities of rice also include chalkiness, transparency, and color. Some studies showed that grain length was negatively correlated with chalkiness rate, while grain width was positively correlated [2]. For processing quality, in general, grain width and grain thickness were positively correlated with head rice rate, while grain length was negatively correlated. Grain shape has an essential effect on rice yield and quality traits. Therefore, identifying the genes that affect rice grain shape and elucidating their mechanism is an effective way to develop high-yield and high-quality rice.
Many factors affect the grain shape of rice, including the size and shape of the glume, the process of grain filling after flowering, and the development of grains. Genes involved in these pathways can directly or indirectly regulate the development of grain shape. The classification of cloned genes affecting grain shape generally shows the following ways to control rice grain shape: 1. endogenous hormone regulation, 2. G protein signaling pathway, 3. ubiquitin-proteasome pathway, 4. transcriptional regulation, 5. epigenetic pathway, 6. other regulatory pathways. These regulatory pathways are not independent of each other but often interweave together and act together. For example, OsSCP46 interacts with ABA-induced protein DI19-1 and participates in the abscisic acid signaling pathway. OsSCP46 coding a serine peptidase is a vital control factor for grain grouting and seed germination. Knocking out OsSCP46 would reduce grain size, grain length, grain width, and 1000-grain weight [4]. GW5 protein is a novel positive regulator of brassinolide signal transduction. It regulates the expression level of the brassinolide response gene and growth response. In addition, GW5 encodes a calmodulin-binding protein and is a significant grain width gene. The grain size of GW5 overexpressed mutants was slenderer than that of the wild type [5]. GS3 encodes a γ subunit of G protein and is a significant grain length and weight gene. GS3 plays a negative regulator role in regulating grain size. GS3 does not directly restrict grain size but competes with DEP1 and GGC2 in binding G protein β subunits [6]. OsDER1 is associated with the endoplasmic reticulum-associated protein degradation pathway. Overexpression or inhibition of OsDER1 would activate the unfolded protein response, making rice more sensitive to ER stress and significantly increasing the ubiquitinated protein level. Inhibition of OsDER1 expression would reduce both grain length and grain width [7]. Both OsmiR396a and OsmiR396c are micro RNAs that regulate the expression of OsGRFs, a growth regulator. Overexpression of OsmiR396a did not affect grain size. However, downregulation of OsmiR396a expression level would increase grain length and 1000-grain weight but decrease grain width [8]. OsmiR396c overexpressed mutants' grain length, width, and weight would fall [9]. JMJ703, an active H3K4-specific demethylase in rice, can specifically reduce the methylation level of histone H3K4. The grain length, width, and thickness of the JMJ703 mutant were reduced [10]. OsCTPS1 encodes a CTP synthase. OsCTPS1 interacts with tubulin, participates in microtubule function, promotes endosperm nuclear separation, influences endosperm early development, and positively regulates rice grain size and weight. Grain length, grain width, and grain thickness of OsCTPS1 overexpressed lines were higher than those of wild type [11]. OsARG encodes an arginase, a key enzyme in arginine metabolism. OsARG-deficient mutants had smaller grains. OsARG plays a vital role in the panicle development of rice, especially under the condition of insufficient exogenous nitrogen. Therefore, OsARG can improve the nitrogen use efficiency of rice and is a potential target gene in crop improvement [12].
In this study, we have a natural population containing 623 indica rice cultivars with rich genetic diversity for GWAS (Genome-wide association studies) analysis. We combined rich phenotypic and genotypic data and adopted a strict MLM model for GWAS analysis. We further analyzed the results of GWAS using a variety of methods. Finally, we identified three cloned genes and seven new candidate genes. In addition, we have done allele combinations research among the three cloned genes through statistical analysis. The discovery of new genes will help us better understand grain shape regulation mechanisms. In addition, the new candidate genes and allele combinations found in this experiment will provide a valuable reference for future breeding work.

Distribution and Correlation of Phenotype and Heritability of Grain Shape
In general, the phenotype of quantitative traits is normally distributed. This is because multiple genes with minor effects mainly control quantitative characteristics. However, in this study, the phenotypic distribution of grain length, grain width, grain length to width ratio, and grain circumference showed bimodal distribution except for the normal distribution of grain area (Figure 1). The distribution of phenotypes suggested that grain length, grain width, grain length to width ratio, and grain circumference were controlled by a major effect gene [13]. In contrast, grain area was controlled by multiple minor effect genes. Moreover, the wide range of phenotypic differences also indicates the rich genetic diversity of the study population. In terms of phenotypic correlation, grain length was significantly negatively correlated with grain width, while grain length was significantly positively correlated with other traits; Grain width was negatively correlated with grain length to width ratio and grain circumference, but positively correlated with grain area; Grain length-width ratio was positively correlated with grain area and grain circumference; Grain area was positively correlated with grain circumference (Figure 2). We also carried out Mahalanobis distance calculation and canonical variables analysis (CVA) for these grain shape traits, and the results were consistent with the above analysis (Table 1, Figure 3). In terms of heritability, the grain length to width ratio had the highest heritability (0.816), and the grain area had the lowest heritability (0.311). A high heritability indicates the stability of heredity, while a low heritability indicates that the character is greatly affected by the environment (Table 2).

Population Structure, Kinship, and LD Decay
As can be seen from the diagram of principal component analysis, scattered points are continuously distributed without evident clustering ( Figure 4A). There were also no significant hot spots on the kinship map ( Figure 4B). This indicates that our experimental population has no significant genetic structure and kinship. Combined with the distribution of phenotypes, it can be said that our experimental population fully conforms to the standard of the related population.
The LD decay distance determines the minimum number of molecular markers (Minimum molecular markers = genome size /LD decay distance) required for association analysis and the resolution of association analysis. As can be seen from the LD attenuation diagram, the LD attenuation distance is about 65 kb ( Figure 4C). We have 2,416,716 SNPs, which is perfectly sufficient. In general, the richer the genetic diversity of the population, the shorter the LD attenuation distance, and vice versa [14]. Previously, it was reported that the LD decay distance of rice was about 130 kb [15]. Compared with this, our LD decay distance was smaller, and the population's genetic diversity was richer. All of these will help us identify candidate genes.

Candidate Genes Screen in Important QTL Regions
We have selected 39 important QTLs for further analysis involving four-grain shape traits: grain length, grain width, grain length to width ratio, and grain circumference. Through analysis, three cloned genes were identified: GS3, GW5, and OsDER1, seven novel candidate genes were identified: Os02g0805100, Os02g0805400, Os02g0164600, CYP93G1, Os10g0344500, Os10g0344900, and Os02g0805100 ( Figure 5). Most of these genes are pleiotropic and affect multiple traits.  In terms of grain length/GL, we have screened 4 QTLs (qGL2, qGL3.2, qGL3.3, qGL5) for analysis. Moreover, three (qGL2, qGL3.3, qGL5) of the four QTL intervals contained reported genes, 2 QTLs: qGL3.3 and qGL5 were detected repeatedly (Table 3). According to the functional annotation, there are 147 genes in these 4 QTLs intervals. In addition, according to the haplotype analysis, 15 genes belong to group I, 32 belong to group II. In the two groups, 15 genes were repeatedly detected, and, in total, 32 genes were detected, including two cloned genes: GS3 and GW5 ( Figure 6A,B). GS3 was detected in all environments, and GW5 was detected except for 2018GA. Moreover, GS3 also has a greater PVE value than GW5 (Table 3), which coincides with previously reported that GS3 has a major effect on grain length, while GW5 has a minor impact on grain length. However, in the qGL2 interval, we have not identified OsmiR396a and OsmiR396c (Table 3). In qGL2, four genes (Os02g0804900, Os02g0805100, Os02g0805300, and Os02g0805400) belong to group II. According to the functional annotation, Os02g0804900 is identical to RNRL2 [16]. RNRL2 encodes a large ribonucleic acid reductase subunit involved in chlorophyll metabolism and regulates leaf color. Os02g0805100 is an auxin/IAA gene ( Figure 6C), Os02g0805300 encodes an expressed protein, and Os02g0805400 encodes a kelch repeat protein ( Figure 6D). In the previous study, we know that auxin/IAA plays a vital role in regulating rice grain shape; Furthermore, one gene, OsPPKL2, encodes a protein phosphatase containing the kelch repeat domain and plays a positive regulator role in rice grain length regulation [17]. Therefore, we selected Os02g0805100 and Os02g0805400 as candidate genes based on their function correlation with cloned grain shape genes. Although qGL3.2 had a large PVE value (20.37%, Table 3), only one gene, Os03g0400600, with the unknown function, was found in this interval.
In terms of grain length to width ratio/GLWR, we have screened 22 QTLs for analysis. Among the 22 QTLs, 2 QTLs (qGLWR3.3 and qGLWR5.2) contained cloned genes; Except for qGLWR2.4, qGLWR2.6, qGLWR2.7, qGLWR10.2, and qGLWR10.4, the other 17 QTLs were repeatedly detected (Table 3). There were 770 genes in the 22 QTLs intervals, including 15 genes belonging to group I and 51 genes belonging to group II. A total of 51 genes were found between the two groups, including two cloned genes: GS3 and GW5 ( Figure 6J,K). Both GS3 and GW5 were detected in all environments, and both GS3 and GW5 had large PVE (39.32% and 41.66%, respectively) ( Table 3) values and had important effects on grain length to width ratio. At the same time, we have not screened the cloned gene OsDER1 within the interval of qGLWR5.2. In qLWR1.2, only one gene, Os01g0589900 (Figure 6L), encodes a pentatricopeptide repeat protein. We selected Os01g0589900 as the candidate gene. In the interval of qGLWR10.10, we found six genes Os10g0343050 (expressed protein), Os10g0343200 (membrane-associated DUF588 domain-containing protein), Os10g0343400 (CSLF7 -cellulose synthase-like family F), Os10g0343951 (Hypothetical conserved gene), Os10g0344500 (MATE efflux family protein), and Os10g0344900 (MATE efflux family protein) belonging to group II. At the same time, we know that BIRG1 also encodes a MATE efflux family protein and functions as a chloride efflux transporter involved in mediating grain size [22]. Therefore, we selected Os10g0344500 and Os10g0344900 ( Figure 6M,N)

Extreme Combination of Alleles for Each Trait
Due to the limitations of detected cloned genes, only allelic combinations of grain length/GL, grain width/GW, and grain length to width ratio/GLWR were analyzed; For GL and GLWR, we will explore the combination between the GS3 and GW5 alleles. And, for GW, we will examine the cross between the GS3, GW5, and OsDER1 alleles. In terms of grain circumferences/GC, we have screened 3 QTLs: qGC2, qGC3, and qGC5 for analysis. These three QTLs intervals all contain cloned genes associated with grain shape. While only one QTL qGC3 was repeatedly detected (Table 3). There were 116 genes in the three QTLs intervals, including 15 genes belonging to group I and 27 genes belonging to group II. And a total of 27 genes were found between the two groups, including one cloned gene: GS3 ( Figure 6O). GS3 was detected in all environments except for 2018GA. MicroRNA OsmiR396a and OsmiR396c, and the genes GW5 and OsDER1, were not identified by screening. In qGC2, there is only one gene, Os02g0805100 ( Figure 6P); We already knew that Os02g0805100, which encodes auxin/IAA, was a candidate gene for grain length, and now we found that it pleiotropic, and it also a candidate gene for grain circumference.

Extreme Combination of Alleles for Each Trait
Due to the limitations of detected cloned genes, only allelic combinations of grain length/GL, grain width/GW, and grain length to width ratio/GLWR were analyzed; For GL and GLWR, we will explore the combination between the GS3 and GW5 alleles. And, for GW, we will examine the cross between the GS3, GW5, and OsDER1 alleles.
Through allele analysis, we found that GS3 had three alleles:  Figure 8A). Therefore, we hypothesized that the genetic effects of GS3 and 284 GW5 on grain length were mainly additive.   Figure 8A). Therefore, we hypothesized that the genetic effects of GS3 and GW5 on grain length were mainly additive.
(C) OsDER1, Hap A (T), Hap B (C) and Hap C (Y). The red arrow indicates the location of the start codon "ATG". The number after "ATG" is its position on the chromosome. The number on the significant SNP is its position relative to "ATG".

Discussion
Grain shape is an essential agronomic character of rice, which affects the yield and appearance, processing, and edible quality. What's more, on a global scale, there is a marked difference in the grain shape of preferred rice. Therefore, grain shape also directly affects the market demand for rice. So far, many genes related to grain shape have been cloned, and their regulatory pathways have been classified, which provides an essential theoretical basis for breeding high-yield and good-quality rice. However, additional genes controlling grain shape remain to be identified, and the effective use of the cloned genes is still lacking.
The phenotypic distribution of GL, GW, GLWR, and GC showed bimodal distribution except for the normal distribution of GS ( Figure 1). Therefore, we hypothesized that  Figure 8C). The inheritance of grain length to width ratio also showed the additive effect.

Discussion
Grain shape is an essential agronomic character of rice, which affects the yield and appearance, processing, and edible quality. What's more, on a global scale, there is a marked difference in the grain shape of preferred rice. Therefore, grain shape also directly affects the market demand for rice. So far, many genes related to grain shape have been cloned, and their regulatory pathways have been classified, which provides an essential theoretical basis for breeding high-yield and good-quality rice. However, additional genes controlling grain shape remain to be identified, and the effective use of the cloned genes is still lacking.
The phenotypic distribution of GL, GW, GLWR, and GC showed bimodal distribution except for the normal distribution of GS ( Figure 1). Therefore, we hypothesized that GL, GW, and GC have major effect genes [13]. As expected, we detected two cloned genes, GS3 and GW5, which were the major genes affecting grain length and width, respectively [5,6]. As previously reported, GS3, GW5, and OsDER1 regulate the grain shape through changes in expression level, and, in our haplotype analysis, they were mainly detected in group II. Therefore, it proves that our findings are consistent with previous reports [7,23,24]. Furthermore, we also found a significant negative correlation between GL and GW in the alleles of GS3 and GW5. For example, GS3 Hap A (AATCT) had the most extensive grain length but the smallest grain width, while GW5 Hap B (TA) had the largest grain width but the most petite grain length (Figure 8). This finding fits well with the correlation between phenotypes (Figure 2).
After completing GWAS, we found that many QTL intervals contained cloned grain shape-related genes. Still, they were not detected through significance analysis, perhaps because the PVE of these QTL intervals was too small (Table 3). However, in some QTL intervals with large PVE, we still did not detect candidate genes. Even on the Manhattan plot, this region has a distinct peak, which we suspect is due to the type II error (false negative) caused by the MLM model. Compared with the GLM model, the accuracy of the MLM model is improved, but the detection efficiency is reduced. Using multiple detection models may effectively enhance detecting genes with minor effects. Due to environmental influences and gene-to-gene interactions, although many genes related to grain shape have been cloned, these genes still lack effective use. Next, to effectively utilize the genes in the grain shape-related gene bank, allelic combination analysis was performed. As it turns out, the inheritance of GL, GW, and GLWR mainly showed additive effects ( Figure 8). Therefore, it is promising to design high-quality rice by pyramiding alleles of different genes. The seven new candidate genes were only obtained by genome-wide association study, genebased association analysis, haplotype analysis, and functional annotation. However, their actual functions need to be further verified. Unfortunately, so far, validation work such as RT-PCR, transcriptome analysis, and transgenic experiments has not been completed.

Materials
The natural population used in this study consisted of 623 varieties collected worldwide. Among them, 323 accessions were selected from the 3K Rice Genome Project (3K R.G.P.) [25] (Table S2).

Field Trials and Trait Measurements
From mid-May to the end of September in 2017 and 2018, we planted seeds in experimental fields in Gongan/GA Jingzhou and Ezhou/EZ, Hubei Province, China, respectively. Each variety was planted in 5 rows and 10 columns, and the distance between individual plants was 17 cm × 20 cm, with three replications. Field management methods are consistent with local management standards. When the rice was fully mature, seeds of 5 individual plants were collected. Wanshen SC-G seed copying apparatus was used to take photos and measure the related traits. The measured characteristics include grain length/GL, grain width/GW, grain length to width ratio/GLWR, grain circumferences/GC, and grain size/grain area/GS. For each trait, the mean value is used for GWAS analysis.

Statistical Analysis
We have used the software Graphpad Prism 9 to analyze the phenotypic data as follows: (1) Normal distribution test and (2) One-way analysis of variance. The analysis methods were the "Kolmogorov-Smirnov test" and "Brown-Forsythe and Welch ANOVA tests" respectively. At the same time, we also used canonical variables analysis and Mahalanobis distance analysis to analyze phenotypes.

Genotyping
We extracted DNA from the leaves. Then, Covaris S2 (Covaris) breaks the DNA into~500 bp fragments. NEBNext DNA Library Prep Reagent Set for Illumina (BioLabs) constructs the DNA library. Illumina Hiseq X10 platform was used to sequence the library. The reference genome was Os-Nipponbare-Reference-IRGSP-1.0 [26]. GATK was used to call single-nucleotide polymorphisms/SNPs [27]. SNPs with MAF ≥ 5% and missing rate ≤ 20% were retained. IMPUTE2 [28] is used for imputing missing genotypes, and 2,416,716 high-quality SNPs were finally obtained.

Population Structure and Kinship Analysis
We used the R package "rMVP" [29] to calculate population structure (Q) and kinship (K). All SNPs are used in the calculation. The principal component analysis/PCA is used to evaluate population structure. The principal component analysis score and relationship matrix will be used in the mixed linear model (MLM) [30] below.

Linkage Disequilibrium Analysis
We use the software "PopLDdecay" to calculate the linkage disequilibrium (LD) between pairs of markers [31]. Command: " r 2" , which squared the Pearson's correlation coefficient (r 2 ). When the correlation coefficient (r 2 ) drops to half of its maximum value, the distance across the chromosome is called the LD decay distance [15].

Genome-Wide Association Study and Candidate Genes Identification
GWAS is based on linkage disequilibrium (LD). It uses a large number of single nucleotide polymorphisms (SNPs) in the genome of the mapped population as molecular markers and combines with phenotypic variation of the population to analyze the correlation between target traits and molecular markers or candidate genes at the genome-wide level. GWAS analysis was performed using the R package "rMVP". The operation steps of "rMVP" are as follows: (1) Data Preparation. Prepare phenotype and genotype data files and calculate population structure (Q) and kinship (K) based on the genotype files. (2) Data Input. Import the above four files into the operation. (3) Start GWAS. Important option parameters are model ("rMVP" offers three models: GLM, MLM, and FarmCPU.) and threshold (0.05 or 0.1, We chose 0.1). (4) Output. Here we have SNPs that are significantly correlated. We used a mixed linear model (MLM) to reduce false positives. MLM uses the Q and K matrix to adjust for cryptic relationships and other fixed effects.
Next, considering LD decay distance, we defined the interval of significantly associated SNP ± LD decay distance as a QTL. Here, referring to the previous report, we use the LD decay distance of 125 kb [15]. To reduce QTL redundancy, if there is overlap between QTL areas, they are combined into one QTL [32][33][34]. Next, we will select some important QTLs for further analysis, which must meet at least one of the following conditions: 1 [35]. For these important QTL regions, haplotype analysis and functional annotation of the genes within the regions were performed to screen candidate genes. The method of haplotype analysis was as follows: Firstly, the SNP was classified, the SNP that caused amino acid and splicing changes and had significant p value was classified into group I; SNPs with significant p values in the promoter region were assigned to the group II [36]. Then, these SNPs will be used for haplotype analysis (A gene haplotype with corresponding materials ≥ 6 will be retained), and genes with significant differences between haplotypes and functional correlation will be screened as candidate genes.

Allele Analysis of Different Genes
The definition of alleles is similar to that of haplotypes. Still, the difference is that the significant SNPs detected in different environments and different grain shape traits are different, so the haplotypes of the same gene may differ in different environments and other grain shape traits. However, alleles were different from haplotypes. If a gene affected multiple grain traits, SNPs of the gene that were significant in different environments and other characteristics were selected for allele analysis, and the materials were classified accordingly. Therefore, alleles of a gene are the same in different environments and for different characteristics [37]. Next, we matched the material with the mean phenotypic values of the relevant grain shape traits. Finally, through statistical analysis, find the extreme allele of each gene.

Extreme Combination of Alleles for Each Trait
Each gene has an allele for an extreme phenotype for different grain traits; however, due to environmental influences and gene-to-gene interactions, the genotype of the optimal material is not necessarily a combination of the extreme allele of each gene. We use allelic combinations to classify materials, and if a combination has less than six materials, we discard the combination. Next, we matched the material with the phenotypic values of the relevant grain shape traits. And, through statistical analysis, find an extreme combination of alleles for each trait.

Conclusions
In 2017 and 2018, grain shape-related traits (grain length/GL, grain width/GW, grain length to width ratio/GLWR, grain circumferences/GC, and grain size/grain area/GS) of 623 indica rice cultivars were measured in Gongan/GA and Ezhou/EZ. And, in 2017, 623 rice were genotyped using second-generation resequencing technology. A genomewide association study was performed based on the above genotypic and phenotypic data. A total of 39 important QTLs were screened out based on genome-wide association studies. In addition, haplotype difference analysis and functional annotation were performed on the genes in these candidate intervals. Finally, three cloned genes, GS3, GW5, and OsDER1, and seven new candidate genes Os02g0805100, Os02g0805400, Os02g0164600, CYP93G1, Os10g0344500, Os10g0344900, and Os02g0805100 were found. We have also analyzed the allelic combinations of the three cloned genes GS3, GW5, and OsDER1. The results of this study will enrich the gene pool of grain shape, deepen the understanding of the regulation mechanism of grain shape, and provide a valuable reference for future molecular breeding work.