Genome-Wide Association Study for Grain Protein, Thousand Kernel Weight, and Normalized Difference Vegetation Index in Bread Wheat (Triticum aestivum L.)

Genomic regions governing grain protein content (GPC), 1000 kernel weight (TKW), and normalized difference vegetation index (NDVI) were studied in a set of 280 bread wheat genotypes. The genome-wide association (GWAS) panel was genotyped using a 35K Axiom array and phenotyped in three environments. A total of 26 marker-trait associations (MTAs) were detected on 18 chromosomes covering the A, B, and D subgenomes of bread wheat. The GPC showed the maximum MTAs (16), followed by NDVI (6), and TKW (4). A maximum of 10 MTAs was located on the B subgenome, whereas, 8 MTAs each were mapped on the A and D subgenomes. In silico analysis suggest that the SNPs were located on important putative candidate genes such as NAC domain superfamily, zinc finger RING-H2-type, aspartic peptidase domain, folylpolyglutamate synthase, serine/threonine-protein kinase LRK10, pentatricopeptide repeat, protein kinase-like domain superfamily, cytochrome P450, and expansin. These candidate genes were found to have different roles including regulation of stress tolerance, nutrient remobilization, protein accumulation, nitrogen utilization, photosynthesis, grain filling, mitochondrial function, and kernel development. The effects of newly identified MTAs will be validated in different genetic backgrounds for further utilization in marker-aided breeding.


Introduction
Wheat is one of the essential staple foods around the world. Wheat-based products are gaining increased demand because of changing dietary habits driven by urbanization along with industrialization. It is the main source of energy and starch and also provides a considerable quantity of protein, vitamins, dietary fibers, and phytochemicals that are beneficial or essential for health. Reduced secondary immunity due to protein energy malnutrition (PEM) is considered to be the most frequent cause of various diseases in human beings, and, in acute cases, clinically these are referred to as marasmus or kwashiorkor [1]. Further, severe PEM affects children's cognitive development [2]. The protein concentration and composition are the two important determinants of both nutritional and end-use quality [3]. Gluten proteins (∼80%) are the major storage proteins that influence the baking process through flour's functional properties. Grain protein content and hardness are the two key determinants of wheat grain quality to classify quality class in international trade and also to decide the suitability of the quality class to different types of end products [4].
The protein concentration in wheat is the result of genetic makeup, environmental effect, and genotype-environment interaction (GEI). GPC is a highly complex quantitative Through the GWAS approach, many marker-trait linkages were found for GPC [40][41][42], TKW [43][44][45][46], and NDVI [47][48][49][50][51] in wheat using a GWAS panel with a diverse set of genotypes. Several MTAs/QTLs have been mapped on all three subgenomes of wheat; however, more mapping studies are required as the saturation point may not be reached [52]. Further, these traits are environmentally sensitive, hence, detection and validation of consistent MTAs in multi-location or multi-year studies are important to use in marker-aided breeding. Moreover, hexaploid wheat has three subgenomes with a size of~17 Gb [53], with the limited characterization of LD decay. Thus, further studies are required to understand the genetic basis and devise marker-based breeding approach to further complement conventional breeding efforts. The present investigation aims to identify MTAs related to GPC, TKW, and NDVI in a diverse set of bread wheat genotypes in multi-environments following the GWAS method and putative candidate genes associated with the SNPs.

Genotypes and Field Experiments
A set of 280 diverse bread wheat genotypes, including advanced elite genotypes and commercial varieties, was used for the mapping study. The details of the genotypes used in the study are given in Table S1. The set of genotypes was evaluated in three different environments, i.e., E1-ICAR-Indian Agricultural Research Institute, New Delhi; E2-ICAR-Indian Agricultural Research Institute, Jharkhand; and E3-ICAR-Indian Institute of Wheat and Barley, Karnal. The weather parameters of the experimental sites during crop season 2021-22 is illustrated in Figure 1. The experiment was conducted under irrigated conditions, and planting was done from 1st to 15th November at all the locations during the year 2021-2022 Rabi (winter) season. The recommended dose of NPK in the ratio of 150:60:40 kg/ha was applied as urea and diammonium phosphate (DAP) for nitrogen, DAP for phosphorus, and muriate of potash for potassium. Biotic stresses were effectively controlled by the fungicide (tebuconazole 25% EC), pesticide (imidacloprid 30.5 SC), and pre-emergence herbicide (pendimethalin 30% EC). An augmented block design was used, in which checks (DBW 187, MACS 6222, WH 1124, and WH 1142) were replicated in each block of two rows of two-meter length.

Data Recording and Statistical Analysis
Phenotyping of 280 genotypes for GPC, TKW, and NDVI was done at three locations. The GPC was recorded with the infra-red transmittance-based instrument Infratech 1125,

Data Recording and Statistical Analysis
Phenotyping of 280 genotypes for GPC, TKW, and NDVI was done at three locations. The GPC was recorded with the infra-red transmittance-based instrument Infratech 1125, and the estimated readings were expressed at a 12.0% moisture level. A random sample of 1000 grains was counted, and the weight has been measured in an electric weighing machine. NDVI was recorded at the maximum vegetative stage (Zadok's scale 41) using a handheld crop sensor, i.e., GreenSeeker (Trimble industries, Inc., Westminster, CO, USA), which was held 50 cm above the canopy facing the center of the plot to record the NDVI. Approximately, 3-4 NDVI readings/plot were recorded, and the mean represents the NDVI reading for that particular plot. The data for various genetic parameters were analyzed using the R package "augmented RCBD" [54].

Genotyping and Quality Control (QC)
The genomic DNA was extracted using the cetyltrimethylammonium bromide (CTAB) method [55]. The pure seeds of each genotype were sown in a plastic tray with separate chambers, and the leaf samples were collected from 21-day-old seedlings from each genotype. The collected leaf samples were thoroughly washed with distilled water and dried on blotting paper. The dried leaf samples were cut into 1.5-2.5 cm length and kept in a mortar and pestle. The leaf sample was finely ground and transferred into a 2.0 mL Eppendorf tube, added with 700 µL pre-warmed 2% CTAB extraction buffer along with 0.2 vol% β-mercaptoethanol, and incubated in a water bath at 65 • C for 45 min. A total of 750 µL of chloroform/iso-amayl alcohol in the ratio of 24:1 was put into the tube and shaken thoroughly. The mixture in a tube was centrifuged at 12,000 rpm for 12 min, and the resulting supernatant was collected in a 1.5 mL tube. Later on, 700 µL cold iso-propanol was added and shaken slowly by inverting the Eppendorf tube. The tubes were kept under freezing conditions for 2 h at −20 • C and subsequently centrifuged at 12,000 rpm for 10 min, which results in DNA pellet. The DNA was treated with RNaseA, and the concentration was determined in a NanoDrop spectrophotometer. The genotyping of 280 genotypes was done using Axiom Wheat Breeder's genotyping array (Affymetrix, Santa Clara, CA, USA) consisting of 35,143 genome-wide SNPs. Stringent quality control was applied through the removal of monomorphic markers, markers with minor allele frequency (MAF) of 20%, and heterozygote frequency of >25.0%. A total of 14,790 curated markers were further utilized for the GWAS study.

Population Statistics and GWAS
Pairwise LD values (r2) were calculated using Analysis by aSSociation Evolution and Linkage (TASSEL) version 5.0 [56]. The LD block size of the whole genome, as well as individual subgenomes was calculated by fixing the r 2 threshold at half LD decay. The PCA and kinship association were estimated using GAPIT [57]. Phenotypic data of GPC, TKW, and NDVI of the GWAS panel and corresponding genotypic data were used in GWAS analysis. Significant MTAs were detected through BLINK (Bayesian-information and linkage disequilibrium iteratively nested keyway) model [39] implemented in Genome Association and Prediction Integrated Tool (GAPIT) version 3.0 [58] in the R software package. The SNPs with p ≤ 0.0001 were considered significantly associated, and R 2 reflects the percent phenotypic variation (PVE).

In Silico Analysis
The sequence information of the significant SNPs was used to search for putative candidate genes with BLAST using default parameters in the Ensemble Plants database (http://plants.ensembl.org/index.html (accessed on 23 December 2022)) of the bread wheat genome (IWGSC (RefSeq v1.0)). The genes located in the overlapping and within the region of 0.1 Mb intervals flanking either side of the linked marker were recorded as putative candidate genes. The role of the detected genes in the regulation of GPC, TKW, and NDVI was also determined by comparing with the earlier reports in both wheat and other crop plants.

Variability, Heritability, and Correlation
The genetic parameters of 280 genotypes given in Table 1, which exhibited a large range of variability across the environments for GPC, TKW, and NDVI, ranging from 09.59-16.71%, 28.38-55.98 gm, and 0.32-0.71, respectively. The percent CV was less than 8.0% in all the environments for all three traits, which ranged from 3.42-5.47% (GPC), 2.37-3.85% (TKW), and 6.58-7.65% (NDVI). Out of all the studied environments, E3 was found to be a relatively higher CV for all the traits. The trend of broad sense heritability of both GPC and NDVI are similar and lower than TKW, which recorded more than 80.0% in all the environments. The trait's mean values are presented in Figure 2 as boxplots. All three traits recorded comparatively higher trait mean values in the E3 environment compared to other environments. The lowest trait mean values for GPC were recorded at E3, whereas, the lowest mean values for TKW and NDV were recorded at E2.  The frequency distribution of GPC, TKW, and NDVI in a set of 280 genotypes tested at E1-E3 during 2021-2022 is illustrated in Figure 3. The continuous frequency distribution was observed for GPC, TKW, and NDVI. Pearson's correlation coefficient (r 2 ) was estimated and illustrated in Figure 4. The direction of the correlation between GPC and TKW was similar in all three environments, and it was negatively associated. Further, a significant and strong negative correlation between GPC and TKW was observed in E2. None of the environments recorded a significant correlation between GPC and NDVI; however, the direction of the correlation was positive in E2 and E3, whereas, it was nega- The frequency distribution of GPC, TKW, and NDVI in a set of 280 genotypes tested at E1-E3 during 2021-2022 is illustrated in Figure 3. The continuous frequency distribution was observed for GPC, TKW, and NDVI. Pearson's correlation coefficient (r 2 ) was estimated and illustrated in Figure 4. The direction of the correlation between GPC and TKW was similar in all three environments, and it was negatively associated. Further, a significant and strong negative correlation between GPC and TKW was observed in E2. None of the environments recorded a significant correlation between GPC and NDVI; however, the direction of the correlation was positive in E2 and E3, whereas, it was negative in E1. Similarly, the association between TKW and NDVI was negative and significant at E1, and also the direction of the association was similar in E3. However, the direction of the association was positive in E2.

Marker Statistics
The genome-wide distribution of SNPs is illustrated in Figure 5. After a thorough quality check on the 35K SNP array, 14,790 high-quality markers were chosen. These markers that qualify for quality control are further utilized to identify MTAs through GWAS analysis. The subgenome-wise distribution of SNPs was highest with 5649 on subgenome B, whereas, the other two subgenomes were represented similarly with 4590 (subgenome D) and 4551 (subgenome A). Similarly, chromosome-wise maximum SNPs of 1077 were identified on 1B chromosome, whereas, the lowest number of 264 SNPs were identified on the 4D chromosome.

Population Structure and LD
The PCA and kinship relationship of the GWAS panel is illustrated in Figure 6, which reveals the absence of clear-cut sub-groups. The LD was calculated by using the squared correlation co-efficient (r 2 ) of all the SNPs. The LD decay was rapid with 3.6 cM in A subgenome, followed by 5.2 cM in D subgenome and 5.7 cm in B subgenome, whereas, the whole genome LD decay was 4.9 cM. The frequency distribution of GPC, TKW, and NDVI in a set of 280 genotypes tested at E1-E3 during 2021-2022 is illustrated in Figure 3. The continuous frequency distribution was observed for GPC, TKW, and NDVI. Pearson's correlation coefficient (r 2 ) was estimated and illustrated in Figure 4. The direction of the correlation between GPC and TKW was similar in all three environments, and it was negatively associated. Further, a significant and strong negative correlation between GPC and TKW was observed in E2. None of the environments recorded a significant correlation between GPC and NDVI; however, the direction of the correlation was positive in E2 and E3, whereas, it was negative in E1. Similarly, the association between TKW and NDVI was negative and significant at E1, and also the direction of the association was similar in E3. However, the direction of the association was positive in E2.

Marker Statistics
The genome-wide distribution of SNPs is illustrated in Figure 5. After a thorough quality check on the 35K SNP array, 14,790 high-quality markers were chosen. These markers that qualify for quality control are further utilized to identify MTAs through GWAS analysis. The subgenome-wise distribution of SNPs was highest with 5649 on subgenome B, whereas, the other two subgenomes were represented similarly with 4590 (subgenome D) and 4551 (subgenome A). Similarly, chromosome-wise maximum SNPs of 1077 were identified on 1B chromosome, whereas, the lowest number of 264 SNPs were identified on the 4D chromosome.

Marker Statistics
The genome-wide distribution of SNPs is illustrated in Figure 5. After a thorough quality check on the 35K SNP array, 14,790 high-quality markers were chosen. These markers that qualify for quality control are further utilized to identify MTAs through GWAS analysis. The subgenome-wise distribution of SNPs was highest with 5649 on subgenome B, whereas, the other two subgenomes were represented similarly with 4590 (subgenome D) and 4551 (subgenome A). Similarly, chromosome-wise maximum SNPs of 1077 were identified on 1B chromosome, whereas, the lowest number of 264 SNPs were identified on the 4D chromosome.

Genome-Wide Association Studies
A set of 26 significant MTAs was detected including 16 for GPC, 4 for TKW, and 6 for NDVI ( Table 2). The details of the detected MTAs are presented in Table 2 and illustrated as Manhattan plots in Figure 7a,b. The Q-Q plots depicting the observed associations of SNPs of GPC, TKW, and NDVI compared to the expected associations after accounting for population structure are presented in Figure 7a,b.

Population Structure and LD
The PCA and kinship relationship of the GWAS panel is illustrated in Figure 6, which reveals the absence of clear-cut sub-groups. The LD was calculated by using the squared correlation co-efficient (r 2 ) of all the SNPs. The LD decay was rapid with 3.6 cM in A subgenome, followed by 5.2 cM in D subgenome and 5.7 cm in B subgenome, whereas, the whole genome LD decay was 4.9 cM.

Genome-Wide Association Studies
A set of 26 significant MTAs was detected including 16 for GPC, 4 for TKW, and 6 for NDVI ( Table 2). The details of the detected MTAs are presented in Table 2 and illustrated as Manhattan plots in Figure 7a,b. The Q-Q plots depicting the observed associations of SNPs of GPC, TKW, and NDVI compared to the expected associations after accounting for population structure are presented in Figure 7a,b.   A set of 16 significant MTAs was detected for GPC in E1, E2, and E3 on 1A, 1B, 1D, 2B, 3B, 4B, 5A, 5B, 5D, 6A, 6B, and 7A and PVE ranged from 6.2% (AX-94749397) to 11.4%

Discussion
Although yield enhancement has been the main focus of crop improvement programs across the globe for a long time, wheat quality enhancement is gaining importance only in the recent past. Wheat improvement for quality is a tedious, expensive, and time-taking process, which makes quality improvement programs slow and protracted. Further, yield and quality enhancement in wheat was mostly phenotype-based selection through conventional breeding for many years. However, genotype-based approaches can complement conventional methods in cultivar development programs. Moreover, recent efforts that led to the sequencing of the wheat genome could further enhance the potential of markerbased breeding in wheat. Several MTAs/QTLs were detected for various economic traits in wheat. However, further genetic studies are suggested using different germplasm or populations as mapping has not reached a saturation level [52]. Further, hexaploid wheat has three subgenomes with a large genome size of~17 Gb, and there is always a possibility to map new QTLs/MTAs for quality traits. In addition, ample genetic diversity is present in the unexplored gene bank accessions and elite breeding materials, which make suitable candidates to dissect the genetic basis and to identify novel MTAs through GWAS analysis.

Variability, Correlation, and GEI
The expression of GPC, TKW, and NDVI has been greatly influenced by the effects of the environment and GEI. GPC was relatively more environment-sensitive, whereas, TKW was a largely stable trait. Significant effects of environment and GEI have been described in earlier reports [5][6][7][8]79]. The GWAS panel has been evaluated in multi-environments, as GEI is an important factor to identify environment-specific and consistent QTLs. The trait's environmental sensitivity was also reflected in the trend of broad sense heritability, as TKW recorded high heritability as compared to TKW and NDVI. Similarly, TKW has recorded the lowest percent CV and highest GCV compared to the other two traits.
The negative association between TKW and GPC observed in the current study was also reported previously in several reports [42,80]. This well-established negative correlation between GPC and TKW was partly explained by the dilution effect [17,18]. This negative association may also be attributed to nutrition (particularly nitrogen) competition between TKW and GPC. Although the correlation between GPC and TKW was not significant, the direction of association was positive. The positive and significant association between GPC and NDVI was also observed in the previous studies [81]. However, the correlation between TKW and NDVI was significant and negative. Previously a significant and negative correlation between TKW and NDVI was also reported [49].

Linkage Disequilibrium
The PCA analysis of the high-quality SNPs that exhibited allelic frequency was evenly distributed without any separate sub-populations. Allelic frequency of equal distribution was obtained through the careful selection of elite breeding lines for different agroecological zones. Wheat being a self-pollinated crop generally has a larger LD block size and, hence, slowly decays [82], compared to cross-pollinated crop plants like maize, where the LD decay is rapid [83]. QTL mapping resolution may be reduced due to presence of large LD blocks and vice versa [84]. The LD decay distance of~3 cM of A subgenome is shorter than the B and D subgenomes, which have a decay distance of~5 cM. A similar LD decay pattern was also recorded previously in wheat GWAS studies [44,84,85]. The LD among the populations may vary due to various factors including population size, non-random mating, random genetic drift, selection, admixtures, mutation, pollination pattern, and recombination frequency [86,87].

MTAs
A set of 26 MTAs was identified for GPC (16), TKW (4), and NDVI (6). A maximum of 10 MTAs were mapped on the B subgenome, whereas, 8 MTAs each were mapped on the A and D subgenomes. The pattern of subgenome-wise marker distribution is also similar, as maximum markers were located on subgenome B, and an approximately similar number of markers were mapped on A and D subgenomes. In earlier studies also, a similar pattern of QTL and marker distribution among the subgenomes for grain-quality traits was reported [21,26]. Krishnappa et al. [22] studied a RIL population wherein none of the QTL was identified on the D subgenome due to a very less distribution of markers; however, the enrichment of the D genome with additional SNP markers in the same mapping population has significantly increased the power of QTL identification. Therefore, marker frequency and distribution along with the type and size of the mapping population are important determinants of QTL mapping.

Putative Candidate Genes
Through BLAST search, several putative candidate genes underlying MTAs for GPC, TKW, and NDVI were identified ( Table 3). The MTAs identified on different chromosomes of wheat are present in the gene coding regions associated with different transcription factors, transmembrane proteins, zinc finger superfamilies, etc. For instance, AX-94520919 linked to GPC encodes NAC domain (TraesCS5D02G537600) genes, which regulate protein accumulation in wheat grains. A NAC transcription factor (NAM-B1) that enhances nutrient redistribution from source to sink and accelerates senescence is encoded by the ancestral wild wheat allele [14]. Another transcription factor, i.e., the OsNAC-like transcription factor, is reported to regulate seed-storage protein concentration in rice [62]. An NAC transcription factor (HvNAM1) controls anthesis time, senescence, and grain protein content in barley [63]. NAM proteins control the movement of nitrogen, zinc, and iron from vegetative tissues to developing grains in wheat [64]. NAC transcription factors enhance the acceleration of leaf senescence, and thereby remobilize iron and zinc to seeds in rice [65]. An SNP, i.e., AX-94770504 linked to GPC, encodes folylpolyglutamate synthase (TraesCS4B02G392600) found to have a role in nitrogen utilization. In the early seedling development stage, the gene for mitochondrial folylpolyglutamate synthetase regulates nitrogen utilization in Arabidopsis [66]. Similarly, another SNP, i.e., AX-94537786 associated with GPC, encodes zinc finger, RING-H2-type (TraesCS6A02G274400), that controls the protein accumulation in wheat. In rice, the CCCH-type zinc finger protein (OsGZF1) regulates the GluB-1 promoter, a seed storage protein, and controls the accumulation of glutelin protein during grain development [60]. The C2H2 zinc finger family transcription factor regulated grain-related traits in maize [61]. Further, AX-95199688 associated with GPC encodes the aspartic peptidase domain (TraesCS7A02G208600) and was found to have a role in gluten breakdown. Gluten aspartic proteinase (GlAP 2) is associated with gluten breakdown in wheat [67].

Conclusions
The study with a set of 280 diverse bread-wheat genotypes revealed that GPC, TKW, and NDVI are quantitative traits. The negative association of GPC and TKW suggests that there is a trade-off between grain protein content and grain weight. However, GPC and NDVI are positively correlated, as both these traits are much influenced by the soil nitrogen status. A total of 26 MTAs, including 16 for GPC, six for NDVI, and four for TKW, were identified. Several putative candidate genes encoding main functions such as zinc, iron, and protein remobilization, increased nitrogen use efficiency, photosynthesis regulation, endosperm development, mitochondrial function, and stress tolerance are reported. Further, functional characterization of these putative genes to understand their role in wheat growth and development is envisaged.