Post-GWAS Prioritization of Genome–Phenome Association in Sorghum

Department of Computer Science and Engineering, Michigan State University, East Lansing, MI 48824, USA
Translational and Integrated Sciences Lab, Department of Genetics, University of North Carolina Chapel Hill, Chapel Hill, NC 27599, USA
Department of Plant, Soil and Microbial Sciences, Michigan State University, East Lansing, MI 48824, USA
Arizona Experiment Station, University of Arizona, Tucson, AZ 85721, USA
Department of Botany and Plant Pathology, Oregon State University, Corvallis, OR 97331, USA
KnowledgeVis, LLC., Altamonte Springs, FL 32701, USA
Authors to whom correspondence should be addressed.
These authors contributed equally to this work.
Agronomy 2024, 14(12), 2894;
Submission received: 23 September 2024 / Revised: 20 November 2024 / Accepted: 26 November 2024 / Published: 4 December 2024
(This article belongs to the Special Issue Advances in Data, Models, and Their Applications in Agriculture)


Genome-wide association studies (GWAS) are widely used to infer the genetic basis of traits in organisms; however, selecting appropriate thresholds for analysis remains a significant challenge. In this study, we introduce the Sequential SNP Prioritization Algorithm (SSPA) to investigate the genetic underpinnings of two key phenotypes in Sorghum bicolor: maximum canopy height and maximum growth rate. Using a subset of the Sorghum Bioenergy Association Panel cultivated at the Maricopa Agricultural Center in Arizona, we performed GWAS with specific permissive-filtered thresholds to identify genetic markers associated with these traits, enabling the identification of a broader range of explanatory candidate genes. Building on this, our proposed method employed a feature engineering approach leveraging statistical correlation coefficients to unravel patterns between phenotypic similarity and genetic proximity across 274 accessions. This approach helps prioritize Single Nucleotide Polymorphisms (SNPs) that are likely to be associated with the studied phenotype. Additionally, we conducted a complementary analysis to evaluate the impact of SSPA by including all variants (SNPs) as inputs, without applying GWAS. Empirical evidence, including ontology-based gene function, spatial and temporal expression, and similarity to known homologs demonstrates that SSPA effectively prioritizes SNPs and genes influencing the phenotype of interest, providing valuable insights for functional genetics research.

1. Introduction

Genome-wide association studies (GWAS) [1] are widely used to infer the genetic basis of phenotypic traits in organisms; however, the challenge lies in selecting an appropriate statistically significant threshold (p-value) for the analysis, considering the false discovery rate [2]. A stringent significance threshold for associations could potentially miss causative variance present at lower frequencies in the population. Additionally, a stringent trait correlation filter is likely to be confounded with the population structure. While these features could potentially be highly predictive of the phenotypes, they may not provide insights into underlying biology and gene function, but rather into the breeding history or source of accessions in the population. Therefore, our objective is to utilize certain permissive-filtered GWAS thresholds that retain many more candidate Single Nucleotide Polymorphisms (SNPs) than the GWAS analysis itself would have identified as significantly associated with the phenotypic trait. This approach would likely increase false positives, leading to subsets of potentially informative SNPs whose association with the phenotypes was presumably not confounded with the population structure.
Building on the genetic markers identified by permissive-filtered GWAS thresholds, we introduce the Sequential SNP Prioritization Algorithm (SSPA) that employs a feature engineering technique commonly used in Machine Leaning (ML). The core idea is to compare the resemblance of phenotypic similarity—determined by differences in normalized phenotypic trait measurements among accessions—with their genetic relatedness using statistical correlation coefficients. This process aids in prioritizing Single Nucleotide Polymorphisms (SNPs) and genes that are likely to influence the phenotype under study. Our algorithm also incorporates a ranking of the prioritized SNPs based on their impact on phenotypic traits, addressing the challenge of discovering SNPs with lower effect estimates through traditional GWAS [3]. To evaluate the effectiveness of our method, we focused on the phenotypes of maximum canopy height and maximum growth rate across 274 accessions of Sorghum bicolor, a subset of the Sorghum Bioenergy Association Panel cultivated at the Maricopa Agricultural Center in Arizona. Our code and data are available on GitHub (
We primarily intended to leverage GWAS due to its proven capability to identify causal SNPs for complex phenotypic traits. Therefore, our method is designed as a post-GWAS analysis, following the selection of less stringent GWAS thresholds. In the literature, the most common approach for post-GWAS prioritization of SNPs involves annotating different genomic features, such as Expression Quantitative Trait Loci (eQTLs) and sequence-specific DNA-binding factors, with GWAS outputs to identify SNPs for further investigation [4]. These methods depend on the physical proximity of an SNP to a known genomic feature to assign relevance; however, while physical proximity can suggest function, it does not necessarily mean that the SNP will affect the phenotype of interest.
Researchers have also explored other post-GWAS methods to identify the most significant loci, which can further be utilized for functional delineation and plant breeding improvement programs [5]. Some approaches were based on computing prioritization scores using p-values, while others utilized meta-analysis, pathway-based analysis, haplotype-based analysis, etc. Cai et al. conducted post-GWAS analysis by combining gene-based association signals and gene expression data to identify putative causal genes for mastitis resistance in dairy cattle [6]. Marina et al. [7] proposed guilt by association based prioritization analysis using functional information collected from various sources and exploited fuzzy-based similarity measures to identify functional candidate genes for milk and cheese-making traits in Assaf and Churra dairy breeds. Additionally, ML and deep learning (DL) algorithms were also applied for post-GWAS analysis to prioritize disease-associated loci by annotating numerous biological features with GWAS outputs, considering prioritization as a classification problem [8].
In this work, SNP prioritization was achieved using a feature engineering approach that leverages statistical correlation coefficients to compare phenotypic similarity among accessions with their genetic proximity. This contrasts with the p-value-based methods commonly used in existing score-based post-GWAS prioritization schemes. Our goal was to establish associations between genotypes and phenotypes based on correlation rather than regression, which is often used in genome–phenome association studies, e.g., GWAS and any BLUP-based methods [9]. While both regression and correlation are statistical methods for examining relationships between two variables, they can yield different insights depending on the patterns present in the dataset. Regression analysis typically models the relationship between independent and dependent variables to predict the dependent variable’s value. Conversely, correlation quantifies the strength and direction between two independent distributions, ranging from −1 (perfect negative correlation) to +1 (perfect positive correlation), with 0 indicating no linear relationship. This helps determine if any relationship exists between them at all.
Moreover, instead of analyzing the data at the level of individual accessions, we examined the relationships between every possible pair of accessions. This pairwise approach provides a more detailed understanding of how genetic and phenotypic similarities correspond across accessions, revealing nuanced associations that might be missed when considering accessions individually. Further, we utilized normalized phenotype measurements to compute the phenotypic similarity among accessions using a logistic growth curve model, which enhanced the robustness of our analysis against outliers and improved the comparability of the results across different environmental conditions. Additionally, we explored the prospect of our proposed algorithm SSPA by investigating the differences in results when applying a GWAS filter and when not using GWAS. In summary, the contribution of this work is as follows:
  • Proposed the SSPA algorithm that utilized a feature engineering technique based on statistical correlation for post-GWAS prioritization of SNPs likely to influence the phenotype of interest.
  • Conducted experiments using 274 accessions of Sorghum bicolor grown in Arizona, focusing on maximum canopy height and maximum growth rate phenotypes.
  • Assessed the generalizability of our method by using the phenotypic measurements of the same accessions grown in a different environment (South Carolina).
  • Investigated the impact of SSPA in prioritizing SNPs without the GWAS filtering.

2. Materials and Methods

2.1. Plant Material and Phenotype Dataset Description

The multi-year TERRA Phenotyping Reference Platform (TERRA-REF) program [10,11] studied how the environment and other factors affected the growth of agricultural products. This program aims to maximize growth and, consequently, maximize agricultural bioenergy output. Season 6 of the program was conducted in the summer of 2017, where Sorghum bicolor was chosen as the target organism due to its drought tolerance and variety of applications in food, fuel, and fiber, with potential improvement through targeted breeding. A subset of 336 sorghum accessions from the Bioenergy Association Panel (BAP) [12] was cultivated at the Maricopa Agricultural Center (MAC) in Arizona. Throughout the growing season, automated large field scanner systems monitored these plants for phenotype measurements every day or two. In addition, a small number of common phenotypes were also recorded by field scientists.
A suite of measurements, including photographs, canopy height, leaf characteristics, and soil environmental factors, collected regularly by automated systems, were utilized to analyze each accession’s growth patterns [10]. Two replicate plots of each accession were planted within a 200 m × 20 m field. Data collected by TERRA-REF are available in the public domain, along with experimental design, measurement methods, and other study metadata [11]. Our analysis utilized canopy height measurements which are available for most days during the growing season and whole-genome resequencing data. Plot-level canopy height measurements were estimated from laser scans using an algorithm calibrated to hand measurements.

2.2. Phenotype Data Preparation and Normalization

Automated measurements of canopy height over the growing season were fitted with logistic growth curves to obtain the phenotypic traits of maximum canopy height and maximum growth rate relative to growing degree days (gdd), defined as the accumulation of heat energy above 10 °C following the “Method 1” described by McMaster and Wilhelm [13]. This normalization approach can yield phenotypic traits that are less sensitive to outliers, more comparable across varying environmental conditions, and permit greater biological inference.
First, we applied a cleaning algorithm that excluded accessions with fewer than 35 automatic measurements per season or measurements on fewer than 40% of days, resulting in a total of 326 accessions. This preprocessing step yielded a time series that could robustly estimate the parameters of two phenotypes.
Time series of canopy height and growing degree days (gdd) for each accession were modeled in a Bayesian framework with the likelihood:
h e i g h t i ~ N o r m a l μ i , σ 2
where, μ is the expected value of height, σ2 is the measurement error variance, and i indexes each observation. The logistic model was defined as:
μ i = c 1 + a e b × g d d
with the logistic parameters a, b, and c and the covariate gdd associated with each observation; however, we re-parameterized the model to obtain more biologically meaningful parameters by defining:
Y m a x = c
Y m i n = c ( 1 + a )
R h a l f = c × b 4
Finally, we transformed the logistic parameters to the whole real line, which allowed for faster mixing and improved model convergence:
c = e θ c
a = e θ a 1
b = e θ b
All root nodes were given wide, relatively non-informative standard priors, including Normal(0, 1000) for all transformed logistic parameters (θa, θb, θc) and Gamma(0.1, 0.1) for the observation precision 1 σ 2 , where Normal() and Gamma() refer to probability distributions.
We implemented the above model in JAGS 4.3.0 [14] via R/rjags [15,16]. Three parallel Markov Chain Monte Carlo (MCMC) sequences were assigned with dispersed starting values. The models were run until convergence was achieved at a Gelman and Rubin diagnostic parameter of <1.2 [17]. An additional run of 10,000 iterations with the thinning parameter as 10 yielded a total of 3000 relatively independent posterior samples for each parameter of interest. The posterior median of Ymax (maximum canopy height, cm) and Rhalf (maximum growth rate, cm/gdd) of the accessions were considered as phenotype measurements in our experiment. Figure 1 shows the logistic growth curves obtained for two sample accessions. The data and code for the logistic growth curve model are publicly available [18].

2.3. Computation of Phenotype Similarity Matrix

Based on the normalized phenotypic measurements of the sorghum accessions, we computed the phenotype similarity matrix separately for each phenotype: maximum canopy height and maximum growth rate. We assumed that the accessions with close genetic proximity would exhibit smaller differences in their phenotypic observations. Here, each phenotype similarity matrix represents the potential correlation (similarity) among accessions. The rows and columns of these matrices correspond to the accessions, and the values indicate the similarity between pairs of accessions, ranging from 0 to 1. A higher similarity value indicates greater similarity between the accessions based on the phenotype under study. The similarity value ( s i j ) between the accessions i and j with phenotype measurements pi and pj, respectively, was calculated as follows:
  • Step 1: The difference in their phenotypic measurements was computed.
    x i j = | p i p j |
  • Step 2: Min-max scaling was applied to the difference (xij), using the following equation, to scale the difference in the range [0, 1]:
    x i j s c a l e d = x i j x m i n x m a x x m i n
    where, x m i n and x m a x , respectively, denote the minimum and maximum values obtained over all xij’s.
  • Step 3: To obtain the similarity value, the scaled difference was subtracted from 1, so that a higher value indicates a greater similarity between accessions.
    s i j = 1 x i j s c a l e d  

2.4. Genomic Data Preparation and GWAS

We utilized the Sorghum BTX623 reference genome v3.1 for our study. Whole-genome sequencing data were obtained from the TERRA-REF project [11], where 384 sorghum accessions from the sorghum Bioenergy Association Panel (BAP) were sequenced with ~25× coverage. Shotgun sequencing (127-bp paired-end) was conducted using an Illumina HiSeq X Ten instrument (San Diego, CA, USA) at the HudsonAlpha Institute for Biotechnology. Following the alignment of sequence reads to the BTX623 reference genome v3.1 (available from Phytozome [19]), we downloaded GVCF files with variant calls from CyVerse, merged them, and applied filtering protocols as outlined in the TERRA-REF documentation [20].
As described in the documentation, the GVCF files with variant calls were generated using the GATK HaplotypeCaller tool, with the emitRefConfidenceGVCF option, and were combined using the GATK CombineGVCFs function. Joint genotyping was performed using GATK GenotypeGVCFs to create a single VCF. To control the genotyping quality, we applied hard SNP filters using GATK Variant Filtration with the following parameters: QD < 2.0 to remove variants with low confidence relative to depth, FS > 60.0 to exclude variants with excess strand bias, MQ < 40.0 to exclude variants with poor mapping quality, MQRankSum to filter variants with significant discrepancies in mapping quality between reference and alternate alleles, and ReadPosRankSum < −8.0 to reject variants where alternate alleles are disproportionately positioned near read ends. In addition, we used VCFtools to filter and recode variant calls, removed accessions lacking phenotypic observations, retained only bi-allelic sites, and excluded rare alleles with excessive missing data. This set of SNPs was utilized while exploring the impact of our proposed algorithm SSPA without the GWAS filtering.
Next, GWAS was conducted using rMVP [21] with the FarmCPU algorithm [22] for each normalized phenotype (maximum canopy height and maximum growth rate), incorporating a kinship matrix and one principal component to account for relatedness and population structure. The results were filtered with p-values of 0.001, 0.005, and 0.01, resulting in three distinct lists of SNPs. Additionally, these lists were cross-referenced with QTL regions known to be associated with height and growth obtained from the Sorghum QTL Atlas [23]. This process yielded six SNP lists for each phenotype, which were used to filter the GVCF file, resulting in a total of 12 VCF files corresponding to the two phenotypes, hereafter, referred to as experimental conditions. Detailed data and links to download the VCF files, both before and after the GWAS filtering, are available in the Supplementary Materials.

2.5. Creation of SNP Similarity Matrices

Subsequently, each VCF file corresponding to an experimental condition was processed to create SNP similarity matrices, comparing the allelic similarity of every pair of accessions at that chromosomal location, hence each SNP similarity matrix corresponds to a single SNP. The values of these similarity matrices were computed with an addition of 0.5 for each matching allele between any pair of accessions, while reversed heterozygous SNPs received a value of 1.0. For instance, if a VCF file consists of ‘N’ SNPs across ‘a’ number of unique accessions, there would be ‘N’ SNP similarity matrices. Each of these similarity matrices is a square matrix of dimensions ‘a × a’, with values encoded as one of three discrete values: {0, 0.5, 1.0}.
From the 326 accessions obtained as part of the logistic growth model, we used 274 accessions that were common to both Seasons 4 and 6 of the TERRA-REF program for generating the similarity matrices (the common set of accessions was considered for future work). Thus, each SNP similarity matrix in our experiment has dimensions of 274 × 274.

2.6. Sequential SNP Prioritization Algorithm (SSPA)

Thereafter, we aimed to obtain a set of prioritized SNPs individually for each experimental condition by comparing the corresponding SNP similarity matrices with the respective phenotype similarity matrix. The objective was to identify SNP similarity matrices that closely resemble the phenotype similarity matrix, i.e., SNPs wherein the genetic variations between the accessions align with the variations in the phenotypic measurements. To achieve this, we developed the SSPA using a feature engineering approach. This algorithm is inspired by the concept of Sequential Forward Selection (SFS) [24]. We employed the Pearson correlation coefficient ( ρ ( X , Y ) ) to measure the similarity (relationship) between the two data matrices. It is defined by the covariance of two data matrices X and Y, divided by their standard deviations.
ρ ( X , Y ) = E X μ X Y μ Y σ X σ Y
where, μ X and μ Y are, respectively, the mean and, σ X and σ Y are, respectively, the standard deviations of the data matrices X and Y.
SFS is a dimensionality reduction technique commonly used in ML to manage high-dimensional datasets. Its purpose is to extract a subset of features that contribute most to the prediction performance of the model. The method operates iteratively, selecting the most important features one by one based on an evaluation criterion and appending them to an initially empty candidate list. The stopping criteria can be defined as a pre-set number of features (i.e., the desired number of selected features) or the point at which the addition of further features no longer improves the evaluation criterion.
Our proposed method SSPA works in a similar fashion: We initiated the process by defining a “maximum desired number of prioritized SNPs” (e.g., 1000), and an empty candidate list. In the first iteration, we selected a SNP similarity matrix from the set of SNP similarity matrices that resulted in the highest correlation coefficient with the phenotype similarity matrix and added it to the candidate list. Starting from the second iteration, we selected the SNP similarity matrix from the remaining set (those not present in the candidate list) that, when summed up with all the similarity matrices together (matrix summation) in the candidate list, produced the highest correlation coefficient with phenotype similarity matrix. This selected matrix was then appended to the candidate list. This process was continued iteratively, each time seeking the next most cumulatively impactful SNP similarity matrix and adding it to the candidate list until the number of matrices in the candidate list reached the “maximum desired number of prioritized SNPs”. During the execution, we also recorded the highest correlation coefficient obtained at each iteration in a list, allowing us to determine the iteration index (K) with the maximum correlation coefficient. Based on K, we obtained a set of prioritized SNPs by selecting the first ‘K’ SNP similarity matrices from the candidate list. Therefore, at the end, our method generates a final SNP similarity matrix (sum of ‘K’ number of prioritized SNP similarity matrices) based on the allelic proximities among accessions maximizing the correlation coefficient with the phenotype similarity matrix. Algorithm 1 describes the complete process.
Algorithm 1. Sequential SNP Prioritization Algorithm (SSPA)
  • ‘N’ no. of SNP similarity matrices (each SNP corresponds to a similarity matrix) involving ‘a’ number of unique accessions: Sj ∈ {0.0, 0.5, 1.0}, where j = 1, 2, …, N; each Sj is of dimension a × a.
  • Phenotype similarity matrix ‘P’ across the same ‘a’ number of accessions with dimension a × a.
  • Maximum desired number of prioritized SNPs: ‘n’.
  • ‘K’ no. of prioritized SNPs and K ≤ n.
Initialize: Candidate list: L = ⊘ ; Correlation coefficient list: C = ⊘.
  • Step 1: Compute the Pearson correlation coefficient of each Sj with P, for j = 1, 2, …, N, and select Sj with the highest correlation coefficient, denoted by Sj*.
  • Step 2: Append Sj* to L and the highest correlation coefficient value to C.
  • Step 3: Compute the matrix summation of all the similarity matrices in L and Sj, and calculate the Pearson correlation coefficient with P, where j ∈ {1,2, …, N} and Sj ⊄ L
  • Step 4: Select Sj from Step 3 that resulted in the highest correlation coefficient value and append it to L and the highest correlation coefficient value to C.
  • Step 5: Repeat Steps 3 and 4 until the number of SNP similarity matrices in L reaches the maximum desired number of prioritized SNPs (n).
  • Step 6: Select the index K of the list C with the maximum correlation coefficient.
  • Step 7: Select the first K number of SNP similarity matrices from L (i.e., the summation of which, termed as the final similarity matrix, resulted in the maximum correlation coefficient value with P). These SNP similarity matrices correspond to K-prioritized SNPs.
Figure 2 presents a streamlined flowchart of this algorithm. This process enables the cumulative identification of prioritized SNPs through a sequential evaluation of both local and global relationships with phenotypic traits, ultimately ranking them by their influence on the studied phenotype. In the functional analysis of these prioritized SNP sets, we refer to the term “Top k” to indicate the first ‘k’ SNPs in the prioritized SNP list.

2.7. Prioritizing SNPs Without GWAS Filtering

As a complementary analysis, we explored the impact of our proposed algorithm SSPA when applied to all variants (SNPs) without the GWAS filtering. For this experiment, we only focused on the maximum canopy height phenotype. Due to the large volume of data and limited computational resources, we were unable to execute our method on the full GVCF file at once. Therefore, we created 10 VCF files, each containing SNPs corresponding to one of the chromosomes (detailed data available in the Supplementary Materials). The goal was to first process each VCF file individually to prioritize SNPs using SSPA following the creation of SNP similarity matrices from each chromosome. Next, we merged the prioritized lists of SNPs from each chromosome together and executed the SSPA on the SNP similarity matrices that correspond to the combined set of prioritized SNPs. This enabled us to obtain a final set of prioritized SNPs without the GWAS filtering for the analysis.

2.8. SNP Annotation and Function

The prioritized SNPs extracted by SSPA were annotated with overlapping or nearest known gene identifiers using the SnpEff package [25]. The analysis of function was carried out on either Phytozome [19], Gramene [26], Planteome [27], or the Gene Ontology [28] online databases. A list of unique gene identifiers was submitted for Gene Ontology (GO) Enrichment Analysis using PANTHER 17.0 [29] and the Sorghum bicolor reference list [30]. We performed separate enrichment analyses using the complete set of GO Biological Process, Molecular Function and Cellular Component annotation datasets. Fisher’s exact test with FDR correction was also employed.
The genes identified that neighbor the “Top 10” SNPs (from SnpEff) for both phenotypes with the GWAS filtering (with and without QTL filtering) at p < 0.001 were further explored for their functional role by using Gramene and the EMBL-EBI Gene Expression Atlas. This was also performed for the “Top 10” SNPs for the maximum height phenotype obtained without the GWAS filtering. InterPro protein domain annotations [31] were obtained by querying Gramene BioMart [26,32]. The gene IDs were also used to query the EMBL-EBI Gene Expression Atlas [33] to explore their baseline gene expression. The baseline expression data of genes showing a default minimum expression level of 0.1 transcripts per million (TPM) were downloaded for experiments E-MTAB-3839 [34], E-MTAB-4400 [35], and E-GEOD-98817 [36]. Expression heatmaps were generated using ClustVis (Beta) [37]. The Sorghum QTL data were queried from the Sorghum QTL Atlas [23] and mapped on Gramene’s sorghum genome browser. The SNPs were also queried using Gramene’s variant effect prediction (VEP) tool [38,39] to find any overlapping genes or existing SNPs in that database.

2.9. Generalizability Testing

To validate the generalizability of our proposed method in a different environment, we considered end-of-season plant height data of the same accessions grown in Clemson, Florence County, South Carolina [12]. The goal was to examine the associations (relationships) of already extracted prioritized SNPs with the phenotypic observations obtained in a new environment.
We created a phenotype similarity matrix using the end-of-season phenotypic measurements from Clemson following the process described in Section 2.3. (For Clemson, we were unable to apply the logistic growth curve model for normalization due to certain environmental factors; however, these measurements are assumed to be comparable with the normalized phenotype measurements of MAC). Next, we calculated the Pearson correlation coefficient between the Clemson phenotype similarity matrix and the final SNP similarity matrix (i.e., the summation of similarity matrices corresponding to the prioritized SNPs), obtained for each experimental condition with GWAS filtering based on the maximum canopy height of the MAC Season 6 dataset. A similar validation was also conducted with the final set of prioritized SNPs obtained without the GWAS filtering.

3. Results

3.1. Experiments on Post-GWAS Prioritization of SNPs

Figure 3 outlines our proposed approach for post-GWAS prioritization of SNPs, and the evaluation process conducted in this study. For each experimental condition that corresponds to a specific phenotype and GWAS filtering threshold (described in Section 2.4), we obtained a set of prioritized SNPs using the SSPA. Table 1 summarizes the number of GWAS-filtered SNPs used as input to our algorithm and the respective results. In cases where the total number of GWAS-filtered SNPs (input) exceeded 1000, the maximum desired number of prioritized SNPs (n) was set to 1000 during the execution of SSPA. We observed that this limit did not impact the prioritization of SNPs. The highest correlation coefficient i.e., the correction coefficient between the phenotype similarity matrix and the final SNP similarity matrix (summation of SNP similarity matrices corresponding to prioritized SNPs) was consistently achieved at an index below 1000 for all experimental conditions. Figure 4 illustrates the trend of the correlation coefficients over the iterations while sequentially adding SNP similarity matrices to the candidate list for the experiment based on the maximum canopy height phenotype with a p-value of 0.01 without QTL filtering.

3.2. Comparison of Post-GWAS SSPA with Traditional GWAS

The traditional GWAS employs a False Discovery Rate (FDR)-based threshold in determining the significance cutoffs, which results in the identification of very few SNPs. Using a stringent GWAS threshold can lead to false negatives, while a more permissive threshold (e.g., p < 0.001) can yield numerous false positives. SSPA aims to balance these extremes by initially using a permissive threshold to expand the search space, and then narrowing down the list of interesting SNPs by analyzing the phenotypic similarity with genetic proximity through the correlation coefficient. Figure 5 presents the Manhattan plots generated through GWAS for both phenotypes, selecting the significance threshold (Bonferroni cutoff) as 0.05 in rMVP [21].
In comparing the prioritized SNPs with traditional GWAS hits, we observed the following:
  • The single SNP identified as significant for maximum canopy height by GWAS was consistently prioritized by SSPA across all experimental conditions, both with and without QTL filtering.
  • Among the 10 SNPs identified as significant for maximum growth rate by GWAS, 6 were prioritized by SSPA with p-value thresholds of 0.01 and 0.001, and 5 were prioritized with a p-value threshold of 0.005 when no QTL filtering was applied. However, only one of these 10 SNPs was consistently prioritized by SSPA across all experimental conditions with QTL filtering.
Additionally, Figure 6 displays the distribution of p-values across the ranked prioritized SNPs for maximum canopy height and maximum growth rate, with a p-value of 0.01 without QTL filtering.

3.3. Experiment on SNP Prioritization Without GWAS Filtering

As described in Section 2.7, we first performed chromosome-wise execution of the SSPA on all SNPs without the GWAS filtering, setting the maximum desired number of prioritized SNPs (n) to 1000. After merging the extracted prioritized SNPs from each chromosome, we executed the SSPA on this combined set to obtain a final set of prioritized SNPs for the analysis. While applying the algorithm to the combined prioritized set, we iterated through the entire list as the total number of SNPs was already reduced during chromosome-wise execution. In this case, the maximum desired number of prioritized SNPs (n) was set to the total number of SNPs in the set. Table 2 describes the number of SNPs used as input to our algorithm and the respective results. Figure 7 shows the trend of correlation coefficients achieved during the addition of the SNP similarity matrix to the candidate list while iterating through the combined prioritized set of SNPs obtained from all chromosomes.

3.4. Observations on Prioritized SNPs Obtained from Different Experiments

For both phenotypes, the SSPA resulted in fewer prioritized SNPs with a smaller p-value, as expected, when applied with GWAS filtering (Table 1—Prioritized #SNPs). Additionally, filtering by QTLs reduced the number of prioritized SNPs; however, it did not improve the highest correlation coefficient achieved (Table 1—Highest Correlation Coefficient). The highest correlation coefficient was achieved with the p-value threshold of 0.01 without QTL filtering for both phenotypes (highlighted in bold in Table 1). Figure 8 describes the overlap between prioritized SNPs obtained using SSPA for different experimental conditions with GWAS filtering and without the GWAS filtering, using UpSet plots [40]. We observed the following:
A minimal overlap in prioritized SNPs exists between the two phenotypes of height and growth, and this overlap occurred only at the two less stringent p-values of 0.005 and 0.01 (Figure 8B,C). At p < 0.001, there was no overlap between the prioritized SNPs for height and growth.
Lists of prioritized SNPs with and without QTL filtering have more overlap for the height phenotype, whereas the lists of prioritized SNPs with and without QTL filtering have a moderate amount of overlap for the growth phenotype (Figure 8A–C).
At each p-value and for each phenotype, the QTL filtering added between 18% and 31% more unique prioritized SNPs.
Most of the prioritized SNPs extracted without the GWAS filtering were not identified as informative compared to results obtained with GWAS filtering for maximum canopy height (Figure 8D).

3.5. Linkage Disequilibrium (LD) in Prioritized SNPs

We examined the Linkage Disequilibrium (LD) structure of 107 post-GWAS prioritized SNPs obtained by SSPA for maximum growth rate, using a p-value threshold of 0.001 (Figure 9A). Additionally, we examined 987 prioritized SNPs for maximum canopy height phenotype without the GWAS filtering (Figure 9B). Overall, the prioritized SNPs display relatively low structure without large LD blocks. This observation aligns with SSPA’s sequential prioritization method, which reduces the inclusion of additional SNPs in LD with previously selected SNPs. In GWAS, the FarmCPU algorithm [22] also recursively incorporates SNPs, minimizing the appearance of SNPs in high LD as significant.
Further, we analyzed the LD patterns surrounding each of the 11 identified GWAS regions (one associated with the maximum canopy height phenotype and 10 with the maximum growth rate phenotype; Figure 5), focusing on approximately 10 kb upstream and downstream. We highlighted two notable cases where the prioritized SNPs exhibit high or low LD. Among the 987 SNPs prioritized by SSPA for the maximum canopy height phenotype without the GWAS filtering, we observed an overlap of 4 SNPs with the GWAS regions. Among these, three prioritized SNPs (“S05_1046209”, “S05_1046222”, “S05_1046234”), located on Chromosome 5, are adjacent to each other and exhibit high LD with the GWAS SNP (Figure 10A). In contrast, another prioritized SNP “S07_49287751”, located on Chromosome 7, shows low LD (Figure 10B). Figure 10C presents an LD plot that includes the SNPs that belong to one or more of the following groups: (1) the 107 post-GWAS prioritized SNPs for maximum growth rate using a p-value threshold of 0.001 without QTL filtering, (2) the 323 post-GWAS prioritized SNPs for maximum canopy height using a p-value threshold of 0.001 without QTL filtering, (3) the 987 SNPs for maximum canopy height without the GWAS filtering, and (4) the significant GWAS hits (11 SNPs). All these SNPs were located within a GWAS region (±10 kb). The plot reveals that, while a few prioritized SNPs overlap with GWAS regions, they are mostly not in LD with the GWAS hits, except for instances on Chromosome 4. Also, the GWAS hits are not in LD with one another.

3.6. Analysis of Function of Prioritized SNPs

Few prioritized SNPs were mapped to known genes (Table 1 and Table 2), with the majority mappings to intergenic regions. Enrichment analysis showed that 43 unique GO terms related to biosynthetic processes, metabolic processes, and gene expression were over-represented in the annotations of the genes associated with the prioritized SNPs. The triterpenoid biosynthetic process (GO:0016104) and triterpenoid metabolic process (GO:0006722) were the most over-enriched (20.13-fold enrichment) across all the prioritized SNP lists for maximum canopy height phenotype. These processes were also present in the prioritized SNP lists for the maximum growth rate phenotype with GWAS filtering at p < 0.01, both with and without QTL filtering. Nitrogen compound metabolic process (GO:0006807) was one of the few under-enriched (0.72-fold enrichment) GO annotations and was only present in prioritized SNP lists for the maximum height phenotype at p < 0.005, both with and without QTL filtering. The positive regulation of the amide metabolic process (GO:0034250) was the only GO term that was over-enriched without the GWAS filtering, but not with GWAS filtering. There were more overrepresented GO annotations for the height phenotype than for the growth phenotype, possibly because more is known about genes for height. We did not observe significant differences between the annotations enriched in prioritized SNP lists without QTL filtering and with QTL filtering.
Of the “Top 40” prioritized SNPs (considered a broader range to enhance variance during analysis) extracted with GWAS filtering for both phenotypes at p < 0.001, all except 2 were in intergenic regions within 700 bp of a known transcript. One SNP, S05_1688138, a deletion event (AATGTG→A) located on Chromosome 5 at 1,688,138 bp position, was present in the intron of a gene, SORBI_3005G018800, that overlaps with QTLs known to impact dry matter growth rate and plant height (Figure A1). Another SNP, S06_45415791, lies between two pre-miRNA (MIR2118) genes, ENSRNA04948199 and ENSRNA049481970, which are conserved in plants and induce the phased small interfering RNAs (phasiRNAs) production known to impact plant development and fertility (Figure A2 [41]).
A closer look at the 85,709 bp region around another intergenic SNP, S01_50208809, shows that it overlaps with QTLs for phenotypes such as plant height, days to flowering, and tiller height. Gene SORBI_3001G265700 flanks this SNP on the right and gene SORBI_3001G265600 on the left. SORBI_3001G265700 encodes a serine-threonine-like kinase and SORBI_3001G265600 encodes an aldolase-1-epimerase-like protein (SbA1ELP). The expression profile of the 51 genes that flank and/or overlap these SNPs (according to SnpEff) showed baseline expression of 44 genes in plant structures important for growth and height, such as the root, vegetative meristem, shoot, and stem internodes (Figure A3(a–c); E-MTAB3839, E-MTAB-4400, and E-GEOD-98817). Included in this list of 51 genes are the genes known to impact transcription factors, cytochrome P450, sugar transporters, protein kinases, leucine-rich Ser/Thr-kinase membrane proteins, and proteins involved in regulating microtubule binding and assembly that are key to regulating mitosis and growth during the cell cycle and thus height and growth. Most of these genes are expressed across all the experiments (Figure A3(d)).
Of the “Top 10” prioritized SNPs (selected to constrain the analysis to a smaller range) obtained using all variants by SSPA without the GWAS filtering, all but three were in intergenic regions within 700 bp of a known transcript. The genes most closely mapping to these prioritized SNPs were different from the genes identified with the GWAS filtering (SORBI_3001G279900, SORBI_3002G119500, SORBI_3005G116400, SORBI_3001G262900, SORBI_3002G153900, SORBI_3002G212200, SORBI_3006G040401, SORBI_3002G132000, SORBI_3007G086800, SORBI_3002G119600, SORBI_3005G116500, SORBI_3007G115500, SORBI_3001G262950, SORBI_3002G154000, SORBI_3006G040500, and SORBI_3007G086900). All these genes overlapped with known QTLs for plant height and most also overlapped with panicle length and panicle emergence QTLs. These panicle traits may also regulate plant height in the reproductive phase. The expression profile of these genes showed baseline expression in plant structures important for growth and height, such as the root, vegetative meristem, shoot, and stem internodes (Figure A4(a–c); E-MTAB-4400, E-MTAB-3839 and E-GEOD-98817). Two of these genes (SORBI_3001G262950 and SORBI_3006G040401) did not show expression in any of these studies. One gene (SORBI_3005G116400) showed expression in only one study (E-MTAB-4400), where it was expressed in leaf tissues. Most genes are common and expressed in cells of different anatomical parts of the plant and may contribute to overall growth and development (Figure A4(d)).

3.7. Colocalization Analysis of Prioritized SNPs

Additionally, we performed a colocalization analysis of the prioritized SNPs with known QTL markers, revealing that most of the prioritized SNPs exhibited minimal overlap or colocalization with the known QTLs. For instance, Figure 11A, focusing on 107 prioritized post-GWAS SNPs by SSPA for maximum growth rate, illustrates that the known QTLs (represented by their closest flanking SNPs labeled on the y-axis and positioned as the first 29 SNPs on the x-axis) are not in LD with the prioritized SNPs (depicted on the remaining portion of the x-axis). A similar analysis was conducted for the 987 prioritized SNPs obtained for maximum canopy height without the GWAS filtering, yielding the same observations (Figure 11B). These results are consistent with the lower correlation coefficients observed when using QTL filtering (Table 1) and can be partially attributed to the methodological differences between QTL analysis, which skews toward loci with large effects on phenotype, and our approach, which aimed to capture more subtle effects. While the colocalization of prioritized SNPs with known QTLs was minimal, our functional analysis (Section 3.6) demonstrated that these SNPs are located within regulatory regions of genes known to play critical roles. This suggests that genes, rather than QTLs, are the more informative genomic unit for our study.

3.8. Comparative Study with Empirical Findings

Further, we validated the prioritized genes obtained for maximum canopy height phenotype as part of our experiment with known height genes in sorghum [42] and a few genes map closely to these (Figure 12). With GWAS filtering at a p-value threshold of 0.001 with QTL filtering, the only known height gene with a nearby SNP extracted is Ma1 (SOBIC_006G057866), which has three SNPs nearby and is known to regulate flowering time in sorghum [43] (Figure 12A). It reflects the influence of flowering time on plant height. Without the GWAS filtering, prioritized SNPs are identified near Ma5, Ma2, Ma1, Dw2, and Dw1 (Figure 12B). Recent work on sorghum grown in Ethiopia [44] gives the seven most impactful SNPs on plant height, which fall within known genes (Figure 12A,B). With GWAS filtering at a p-value threshold of 0.001 with QTL filtering, four of these are also near SNPs we identified as prioritized for plant height (SOBIC.003G119600, SOBIC.006G111800, SOBIC.010G085400, and SOBIC.010G066100), two are near known height genes (Dw1, SOBIC.009G223500 and Ma5, SOBIC.001G017500), and one is not near either (SOBIC.006G235400) (Figure 12A). Without the GWAS filtering, an additional SNP identified in the Ethiopian study is also near an SNP identified in this study (Figure 12B).

3.9. Generalization Experiment

As described in Section 2.9, we conducted the generalization experiment to evaluate the explanatory power of prioritized SNPs derived from the MAC Season 6 dataset, both with and without the GWAS filtering (Section 3.1 and Section 3.3), in the Clemson environment. There were 271 accessions common between the Clemson and MAC Season 6 datasets. The correlation coefficient between the Clemson phenotype similarity matrix (dimension 271 × 271) and MAC Season 6 final SNP similarity matrix (considering 271 instead of 274 accessions) was approximately 0.07 for all experimental conditions with the GWAS filtering (Table 1). To compute this, the MAC Season 6 final SNP similarity matrix was modified by excluding the rows and columns corresponding to the three accessions absent in the Clemson dataset. The correlation coefficient was observed to be 0.06 when considering the final SNP similarity matrix obtained from the combined prioritized SNP list without the GWAS filtering (Table 2).
To further investigate, we examined the histograms of height differences across every pair of accessions for both sites (Figure 13A,B), which indicated greater variations in height differences at Clemson compared to MAC Season 6 (MAC Season 6 plant height measurements are normalized through logistic growth curve, whereas for Clemson, end-of-season plant height measurements were used). Figure 13C highlights the differences in height differences between these two sites, revealing that approximately 55% of the pairs of accessions exhibited differences greater than or equal to 50.

4. Discussion

The highest correlation coefficient between the phenotype similarity matrix and the final SNP similarity matrix derived from the MAC Season 6 dataset was 0.69 with the GWAS filtering (Table 1—Highest Correlation Coefficient) and 0.71 without the GWAS filtering (Table 2—Highest Correlation Coefficient). These values indicate moderately strong correlations, which are not surprising given the demonstrated correlation of environmental factors on height and growth in other studies [45] because we did not directly capture the effect of the environment in this study. We hypothesize that the total number of SNPs influencing height or growth is much larger than those identified in this study. Depending on the environmental pressures faced by the plant, different sub-groups of SNPs may become predictive. Our results from the Clemson dataset (South Carolina) support this. Additionally, the partial overlap in important SNPs with the Ethiopia study may reflect the similarities in hot and arid climates between Ethiopia and Arizona, despite using different accessions. This contrasts with the environmental conditions in South Carolina.
In the absence of empirical data on gene function from sorghum, we can only conclude that this method has the potential to prioritize specific regions of the genome that influence height and growth phenotypes. We found gene homology and expression data supporting the significance of some of the identified sorghum SNPs for the height and growth phenotypes. These SNPs were in the vicinity of gene homologs. For instance, tobacco aldolase-1-epimerase (NbA1ELP) is known to regulate pectin methylesterase (NbPME) antagonistically [46] and lower expression levels of tobacco NbA1ELP resulted in the dwarf plant phenotype. This suggests that the sorghum ortholog of this tobacco gene (SORBI_3001G265600 SbAiELP) may be associated with plant height and biomass yield in sorghum. Existing GO annotations do not include these traits due to a lack of evidence from sorghum; however, this homologous sorghum gene is expressed in embryo, shoot internode, root, emerging inflorescence, and stem internodes (Figure A3(a–c)), suggesting that its expression in these tissues is likely to contribute to plant height and growth. Further validation by functional genomics studies is necessary to confirm these findings.
At the outset of this work, we envisaged developing ML models to predict the phenotype based on genomic and environmental information. However, the major difficulty in training ML models with the TERRA-REF dataset was the disparity between the vast size of an entire genome and the limited number of phenotypic observations available. Cultivating on such a large scale to obtain a comparable number of phenotypic measurements with genetic makeup is generally prohibitively expensive, necessitating some form of genomic feature selection. In GWAS, prefiltering (e.g., based on missingness, error probability, linkage disequilibrium, minor allele frequency, or other metrics) can be an effective way to reduce the dimensionality of the datasets (the number of SNPs). However, this approach has shown mixed outcomes [47], and the best method for doing so remains an active area of research.
Our proposed algorithm SSPA can also be utilized to effectively reduce the number of SNPs. This was born out in our experiment without the GWAS filtering that was able to locate several SNPs known to be associated with plant height. These SNPs were not retrieved after applying GWAS with permissive-filtered thresholds that we used in this work. Although the experiment without the GWAS filtering does not explicitly account for the population structure, this analysis demonstrated the efficacy of our approach in identifying genes associated with the phenotype of interest. By reducing the number of SNPs, our method facilitates their use as input to the ML models and population structure can be accounted for later by incorporating certain techniques like clustering and principal component analysis in the models [48].
Additionally, we observed that the phenotypic dataset used in our experiment consists of accessions with limited variation in their phenotypic measurements. Consequently, our approach resulted in a phenotype similarity matrix where most of the accessions are highly correlated with each other considering the phenotype under study. Figure 14 displays the histograms of phenotype similarity values across every pair of accessions, calculated with the process described in Section 2.3. We hypothesize that the extraction of prioritized SNPs based on phenotypic measurements by SSPA would be more pronounced if accessions with greater variations in their phenotypic trait measurements were selected.

5. Conclusions

This work focuses on developing an algorithm, named SSPA, for prioritization of SNPs and genes for further analysis to facilitate the domain of functional genetics. Our approach incrementally and cumulatively identifies a set of SNPs that impact a certain phenotypic trait by evaluating both local and global relationships at the pairwise accession level, incorporating a ranking system to prioritize the SNPs. Additionally, this method shows potential for prioritizing genomic regions by reducing the dimensionality (i.e., the number of SNPs) that could be informative for ML/DL methods to predict complex phenotypes. Currently, the method is limited in capturing linear relationships between genotypes and phenotypes, and environmental variations are controlled by normalizing the inputs rather than explicitly incorporating them into the procedure.
In the future, this algorithm could be extended by using weighted correlation metrics to capture non-linear relationships and explicitly incorporating environmental variations into the method itself. Moreover, we used a sequential algorithm for selecting prioritized SNPs, wherein once a SNP is identified as informative, it cannot be removed from the prioritized SNP list. The process of prioritizing SNPs can be further improved by employing other feature selection techniques, such as Sequential Floating Forward Selection (SFFS). We believe that the work reported in this paper provides a novel approach that leverages feature engineering techniques for the prioritization of informative SNPs.

The following supporting information can be downloaded at:, VCF files obtained after GWAS analysis performed using rMVP with the FarmCPU algorithm at various p-value levels and the SNP similarity matrices is available here: Chromosome-wise VCF files considering all variants and corresponding SNP similarity matrices is available here: Figure S1: QQ Plot for the maximum growth rate and maximum canopy height phenotypes in Sorghum bicolor for TERRA-REF "MAC Season 6" data set.

Conceptualization, A.R., A.E.T., P.J., D.L. and A.T.; methodology, A.R. and D.P.; validation, A.E.T., P.J. and A.T.; data curation, K.S. and J.G.; writing—original draft preparation, D.P.; writing—review and editing, L.C., P.J., A.T. and A.R.; visualization, C.L.; supervision, A.R. and A.E.T.; project administration, L.C.; funding acquisition, A.E.T., A.R., D.L. and P.J. All authors have read and agreed to the published version of the manuscript.


The original data and code presented in this study are openly available at and in the supplementary files.


Author Curtis Lisle is employed by KnowledgeVis LLC. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest. The funders had no role in the design of this study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Appendix A

Figure A1. Screenshot from the Gramene database. One SNP, S05_1688138, located on Chromosome 5 at 1,688,138 bp position is present in the intron of the gene SORBI_3005G018800. This gene overlaps with the QTLs recorded for dry matter growth rate and plant height phenotype. Query performed on 15 December 2022.
Figure A1. Screenshot from the Gramene database. One SNP, S05_1688138, located on Chromosome 5 at 1,688,138 bp position is present in the intron of the gene SORBI_3005G018800. This gene overlaps with the QTLs recorded for dry matter growth rate and plant height phenotype. Query performed on 15 December 2022.
Agronomy 14 02894 g0a1
Figure A2. Screenshot from the Gramene database. SNP S06_45415791 located on Chromosome 6 at 45,415,791 bp position lies between the two pre miRNA (MIR2118) genes ENSRNA049481999 and ENSRNA049481970. miRNA2118 are conserved in plants and induce the phased small interfering RNAs (phasiRNAs) production [41]. Query performed on 15 December 2022.
Figure A2. Screenshot from the Gramene database. SNP S06_45415791 located on Chromosome 6 at 45,415,791 bp position lies between the two pre miRNA (MIR2118) genes ENSRNA049481999 and ENSRNA049481970. miRNA2118 are conserved in plants and induce the phased small interfering RNAs (phasiRNAs) production [41]. Query performed on 15 December 2022.
Agronomy 14 02894 g0a2
Figure A3. Gene Expression in Tissues. We took these 51 genes and queried their baseline expression profile on the EMBL-EBI gene expression atlas. Based on the representative tissue/plant structure atlas studies primarily including vegetative and reproductive growth tissues, we selected three studies to examine using a heatmap visualization, shown in (ad). (a) E-MTAB-4400 [35]; (b) E-MTAB-3839 [34]; (c) E-GEOD-98817 [36]; (d) Overlap between [34,35,36]. This indicates that the majority were common genes, suggesting that these genes are expressed in cells of different anatomical parts of the plant and may contribute to overall growth and development.
Figure A3. Gene Expression in Tissues. We took these 51 genes and queried their baseline expression profile on the EMBL-EBI gene expression atlas. Based on the representative tissue/plant structure atlas studies primarily including vegetative and reproductive growth tissues, we selected three studies to examine using a heatmap visualization, shown in (ad). (a) E-MTAB-4400 [35]; (b) E-MTAB-3839 [34]; (c) E-GEOD-98817 [36]; (d) Overlap between [34,35,36]. This indicates that the majority were common genes, suggesting that these genes are expressed in cells of different anatomical parts of the plant and may contribute to overall growth and development.
Agronomy 14 02894 g0a3aAgronomy 14 02894 g0a3b
Figure A4. Gene Expression in Tissues. We took these 17 genes and queried their baseline expression profile on the EMBL-EBI gene expression atlas. Based on the representative tissue/plant structure atlas studies primarily including vegetative and reproductive growth tissues, we selected three studies to examine using a heatmap visualization, shown in (ad). (a) E-MTAB-4400 [35]; (b) E-MTAB-3839 [34]; (c) E-GEOD-98817 [36]; (d) Overlap between [34,35,36]. This indicates that the majority were common genes, suggesting that these genes are expressed in cells of different anatomical parts of the plant and may contribute to overall growth and development. The one gene in the EMTAB-4400 was expressed only in the leaf.
Figure A4. Gene Expression in Tissues. We took these 17 genes and queried their baseline expression profile on the EMBL-EBI gene expression atlas. Based on the representative tissue/plant structure atlas studies primarily including vegetative and reproductive growth tissues, we selected three studies to examine using a heatmap visualization, shown in (ad). (a) E-MTAB-4400 [35]; (b) E-MTAB-3839 [34]; (c) E-GEOD-98817 [36]; (d) Overlap between [34,35,36]. This indicates that the majority were common genes, suggesting that these genes are expressed in cells of different anatomical parts of the plant and may contribute to overall growth and development. The one gene in the EMTAB-4400 was expressed only in the leaf.
Agronomy 14 02894 g0a4aAgronomy 14 02894 g0a4b


Figure 1. Examples of logistic growth curves illustrating the phenotypes of two sorghum accessions. Using automated canopy heights for each accession, logistic growth curves were fitted and produced maximum canopy height (Ymax) and maximum growth rate (Rhalf), which are indicated by two dashed lines. The cleaning algorithm ensured sufficient observations to produce confident estimations of the parameters.
Figure 1. Examples of logistic growth curves illustrating the phenotypes of two sorghum accessions. Using automated canopy heights for each accession, logistic growth curves were fitted and produced maximum canopy height (Ymax) and maximum growth rate (Rhalf), which are indicated by two dashed lines. The cleaning algorithm ensured sufficient observations to produce confident estimations of the parameters.
Agronomy 14 02894 g001
Figure 2. Flowchart of the Sequential SNP Prioritization Algorithm (SSPA). This algorithm inputs a single phenotype similarity matrix and multiple SNP similarity matrices to determine a set of prioritized SNPs (L: candidate list, C: correlation coefficient list, SS: similarity matrix, and CC: correlation coefficient).
Figure 2. Flowchart of the Sequential SNP Prioritization Algorithm (SSPA). This algorithm inputs a single phenotype similarity matrix and multiple SNP similarity matrices to determine a set of prioritized SNPs (L: candidate list, C: correlation coefficient list, SS: similarity matrix, and CC: correlation coefficient).
Agronomy 14 02894 g002
Figure 3. Method pipeline for post-GWAS prioritization of SNPs and its analysis. This diagram illustrates the process for a single experimental condition, beginning with the measurements of the studied phenotype for a set of accessions and GWAS-filtered SNPs. Our proposed method yielded a set of prioritized SNPs, which were further utilized for functional analysis and to test the generalization capability of SSPA. Additionally, we compared the prioritized SNP set with the traditional GWAS results and analyzed their Linkage Disequilibrium (LD) structure and colocalization with known QTL markers.
Figure 3. Method pipeline for post-GWAS prioritization of SNPs and its analysis. This diagram illustrates the process for a single experimental condition, beginning with the measurements of the studied phenotype for a set of accessions and GWAS-filtered SNPs. Our proposed method yielded a set of prioritized SNPs, which were further utilized for functional analysis and to test the generalization capability of SSPA. Additionally, we compared the prioritized SNP set with the traditional GWAS results and analyzed their Linkage Disequilibrium (LD) structure and colocalization with known QTL markers.
Agronomy 14 02894 g003
Figure 4. The correlation coefficient trend during the sequential addition of SNP similarity matrices to the candidate list over the iterations (index) in the execution of SSPA, as recorded in the correlation coefficient list (C). This analysis pertains to the phenotype of maximum canopy height with GWAS filtering applied at a p-value of 0.01. As the maximum desired number of SNPs was set as 1000 in our experiment, SSPA aimed to select a maximum of 1000 SNP similarity matrices to the candidate list iteratively. The highest correlation coefficient was observed at index 755, resulting in 755 prioritized SNPs for this experimental condition.
Figure 4. The correlation coefficient trend during the sequential addition of SNP similarity matrices to the candidate list over the iterations (index) in the execution of SSPA, as recorded in the correlation coefficient list (C). This analysis pertains to the phenotype of maximum canopy height with GWAS filtering applied at a p-value of 0.01. As the maximum desired number of SNPs was set as 1000 in our experiment, SSPA aimed to select a maximum of 1000 SNP similarity matrices to the candidate list iteratively. The highest correlation coefficient was observed at index 755, resulting in 755 prioritized SNPs for this experimental condition.
Agronomy 14 02894 g004
Figure 5. Manhattan plots generated by traditional GWAS using a threshold of 0.05 (Bonferroni cutoff), resulting in the identification of very few SNPs. The rMVP package [21] used in our experiment determines the significance cutoff (red dotted line) as 0.05 divided by the marker size ( The SNPs above this cutoff are considered to be significant. The only identified significant SNP for maximum canopy height is “S04_13525068”, and the 10 significant SNPs identified for maximum growth rate are S01_6237519, S01_80084493, S02_17462557, S03_53433817, S04_50143458, S05_1046607, S05_66473649, S06_45415791, S07_49287642 and S10_38420110. Different colors in the plots were used to distinguish between chromosomes.
Figure 5. Manhattan plots generated by traditional GWAS using a threshold of 0.05 (Bonferroni cutoff), resulting in the identification of very few SNPs. The rMVP package [21] used in our experiment determines the significance cutoff (red dotted line) as 0.05 divided by the marker size ( The SNPs above this cutoff are considered to be significant. The only identified significant SNP for maximum canopy height is “S04_13525068”, and the 10 significant SNPs identified for maximum growth rate are S01_6237519, S01_80084493, S02_17462557, S03_53433817, S04_50143458, S05_1046607, S05_66473649, S06_45415791, S07_49287642 and S10_38420110. Different colors in the plots were used to distinguish between chromosomes.
Agronomy 14 02894 g005aAgronomy 14 02894 g005b
Figure 6. Distribution of p-values across the ranked order of prioritized SNPs. This analysis pertains to the post-GWAS prioritized SNPs through SSPA with the p-value < 0.01, where maximum canopy height and maximum growth rate phenotype resulted in 755 and 326 prioritized SNPs, respectively.
Figure 6. Distribution of p-values across the ranked order of prioritized SNPs. This analysis pertains to the post-GWAS prioritized SNPs through SSPA with the p-value < 0.01, where maximum canopy height and maximum growth rate phenotype resulted in 755 and 326 prioritized SNPs, respectively.
Agronomy 14 02894 g006
Figure 7. The correlation coefficient trend during the sequential addition of SNP similarity matrices to the candidate list over the iterations (index) in the execution of the SSPA, as recorded in the correlation coefficient list (C). This analysis pertains to the combined set of prioritized SNPs corresponding to the phenotype of maximum canopy height without the GWAS filtering. In this case, the maximum desired number of SNPs was set as the total number of input SNP similarity matrices, causing SSPA to iterate through the entire list. The highest correlation coefficient was observed at index 987, resulting in 987 prioritized SNPs for this experiment.
Figure 7. The correlation coefficient trend during the sequential addition of SNP similarity matrices to the candidate list over the iterations (index) in the execution of the SSPA, as recorded in the correlation coefficient list (C). This analysis pertains to the combined set of prioritized SNPs corresponding to the phenotype of maximum canopy height without the GWAS filtering. In this case, the maximum desired number of SNPs was set as the total number of input SNP similarity matrices, causing SSPA to iterate through the entire list. The highest correlation coefficient was observed at index 987, resulting in 987 prioritized SNPs for this experiment.
Agronomy 14 02894 g007
Figure 8. UpSet plots showing the overlap of the prioritized SNPs extracted using SSPA for different experimental conditions with GWAS filtering and without the GWAS filtering. It helps in visualizing the intersections between the sets in a matrix, where each row corresponds to a set (SNP list) and each column corresponds to a possible intersection with the vertical bar showing the size of the intersection. The filled cells in each column show which set is part of the intersection. The figure was created with UpSetPlot v0.8.0 ( (A) Prioritized SNPs summary for height and growth phenotypes with GWAS filtering at p < 0.001. No overlap in prioritized SNPs exists between the phenotypes. (B) Prioritized SNPs summary for height and growth phenotypes with GWAS filtering at p < 0.005. There is a total of seven SNPs in common between the phenotypes. (C) Prioritized SNPs summary for height and growth phenotypes with GWAS filtering at p < 0.01. There are 11 SNPs in common between the phenotypes. (D) Prioritized SNP summary for height phenotype with and without the GWAS filtering. Most of the SNPs identified as prioritized without the GWAS filtering had not been identified when using GWAS filtering.
Figure 8. UpSet plots showing the overlap of the prioritized SNPs extracted using SSPA for different experimental conditions with GWAS filtering and without the GWAS filtering. It helps in visualizing the intersections between the sets in a matrix, where each row corresponds to a set (SNP list) and each column corresponds to a possible intersection with the vertical bar showing the size of the intersection. The filled cells in each column show which set is part of the intersection. The figure was created with UpSetPlot v0.8.0 ( (A) Prioritized SNPs summary for height and growth phenotypes with GWAS filtering at p < 0.001. No overlap in prioritized SNPs exists between the phenotypes. (B) Prioritized SNPs summary for height and growth phenotypes with GWAS filtering at p < 0.005. There is a total of seven SNPs in common between the phenotypes. (C) Prioritized SNPs summary for height and growth phenotypes with GWAS filtering at p < 0.01. There are 11 SNPs in common between the phenotypes. (D) Prioritized SNP summary for height phenotype with and without the GWAS filtering. Most of the SNPs identified as prioritized without the GWAS filtering had not been identified when using GWAS filtering.
Agronomy 14 02894 g008
Figure 9. LD patterns of prioritized SNPs, indicating relatively low structure without major LD blocks. (A) Considered the 107 post-GWAS prioritized SNPs for maximum growth rate obtained by SSPA, selecting the p-value threshold as 0.001. (B) Focused on the 987 prioritized SNPs for maximum height obtained by SSPA without the GWAS filtering.
Figure 9. LD patterns of prioritized SNPs, indicating relatively low structure without major LD blocks. (A) Considered the 107 post-GWAS prioritized SNPs for maximum growth rate obtained by SSPA, selecting the p-value threshold as 0.001. (B) Focused on the 987 prioritized SNPs for maximum height obtained by SSPA without the GWAS filtering.
Agronomy 14 02894 g009aAgronomy 14 02894 g009b
Figure 10. LD patterns of prioritized SNPs with the identified GWAS regions, focusing on approximately 10 kb upstream and downstream. (A,B) Highlight 4 SNPs (red boxes) out of the 987 prioritized SNPs by SSPA without the GWAS filtering, which overlap with the identified GWAS regions (±10 kb). The blue box indicates GWAS hits. (C) Displays an LD plot including SNPs that were either from the 107 post-GWAS prioritized SNPs for maximum growth rate (p-value < 0.001 without QTL filtering) OR the 323 post-GWAS prioritized SNPs for maximum height (p-value < 0.001 without QTL filtering) OR the 987 prioritized SNPs for maximum height without the GWAS filtering, OR the 11 significant GWAS hits, AND within ±10 kb of a GWAS region. Different GWAS regions are indicated by a black horizontal line.
Figure 10. LD patterns of prioritized SNPs with the identified GWAS regions, focusing on approximately 10 kb upstream and downstream. (A,B) Highlight 4 SNPs (red boxes) out of the 987 prioritized SNPs by SSPA without the GWAS filtering, which overlap with the identified GWAS regions (±10 kb). The blue box indicates GWAS hits. (C) Displays an LD plot including SNPs that were either from the 107 post-GWAS prioritized SNPs for maximum growth rate (p-value < 0.001 without QTL filtering) OR the 323 post-GWAS prioritized SNPs for maximum height (p-value < 0.001 without QTL filtering) OR the 987 prioritized SNPs for maximum height without the GWAS filtering, OR the 11 significant GWAS hits, AND within ±10 kb of a GWAS region. Different GWAS regions are indicated by a black horizontal line.
Agronomy 14 02894 g010aAgronomy 14 02894 g010b
Figure 11. Colocalization analysis of prioritized SNPs (x-axis), showing minimal overlap with known QTL markers (y-axis). (A) Considered the 107 post-GWAS prioritized SNPs for maximum growth rate by SSPA, selecting the p-value as 0.001. (B) Focused on the 987 prioritized SNPs for maximum height by SSPA without the GWAS filtering.
Figure 11. Colocalization analysis of prioritized SNPs (x-axis), showing minimal overlap with known QTL markers (y-axis). (A) Considered the 107 post-GWAS prioritized SNPs for maximum growth rate by SSPA, selecting the p-value as 0.001. (B) Focused on the 987 prioritized SNPs for maximum height by SSPA without the GWAS filtering.
Agronomy 14 02894 g011
Figure 12. Chromosomal location of prioritized SNPs with known plant height genes. Prioritized SNPs (red triangles) are scattered throughout the ten chromosomes. The known plant height-associated genes are labeled in blue vertical lines. The green vertical lines show the location of SNPs important for height in sorghum grown in Ethiopia. (A) The 286 SNPs were identified for maximum canopy height based on a GWAS-filtered set of SNPs with QTL filtering (p-value of 0.001). (B) The 987 SNPs were identified for maximum canopy height considering all variants without the GWAS filtering.
Figure 12. Chromosomal location of prioritized SNPs with known plant height genes. Prioritized SNPs (red triangles) are scattered throughout the ten chromosomes. The known plant height-associated genes are labeled in blue vertical lines. The green vertical lines show the location of SNPs important for height in sorghum grown in Ethiopia. (A) The 286 SNPs were identified for maximum canopy height based on a GWAS-filtered set of SNPs with QTL filtering (p-value of 0.001). (B) The 987 SNPs were identified for maximum canopy height considering all variants without the GWAS filtering.
Agronomy 14 02894 g012
Figure 13. Analysis of differences in plant height measurements across every pair of accession for MAC Season 6 and Clemson dataset. (A) Height differences in every pair of accessions (274 choose 2) in MAC Season 6. (B) Height differences in every pair of accessions (271 choose 2) in Clemson. (C) The differences between MAC Season 6 and Clemson across every pair of accessions (271 choose 2).
Figure 13. Analysis of differences in plant height measurements across every pair of accession for MAC Season 6 and Clemson dataset. (A) Height differences in every pair of accessions (274 choose 2) in MAC Season 6. (B) Height differences in every pair of accessions (271 choose 2) in Clemson. (C) The differences between MAC Season 6 and Clemson across every pair of accessions (271 choose 2).
Agronomy 14 02894 g013
Figure 14. Phenotype similarity value across every pair of accessions in the MAC Season 6 dataset. These diagrams depict the estimated phenotype similarity values following the process described in Section 2.3.
Figure 14. Phenotype similarity value across every pair of accessions in the MAC Season 6 dataset. These diagrams depict the estimated phenotype similarity values following the process described in Section 2.3.
Agronomy 14 02894 g014
Table 1. Experimental summary of post-GWAS prioritization of SNPs using permissive-filtered GWAS thresholds. Numbers highlighted in bold indicate the experimental condition yielding the highest correlation coefficient for each phenotype.
Table 1. Experimental summary of post-GWAS prioritization of SNPs using permissive-filtered GWAS thresholds. Numbers highlighted in bold indicate the experimental condition yielding the highest correlation coefficient for each phenotype.
p-ValuePhenotypeQTL FilteringTotal #SNPs aPrioritized #SNPs bPrioritized #SNPs Directly Map to Known GenesHighest Correlation Coefficient c
p < 0.001Maximum Growthw/QTL53477180.41
No QTL1055107280.49
Maximum Heightw/QTL4630286690.61
No QTL6003323840.63
p < 0.005Maximum Growthw/QTL2861174300.55
No QTL5575198480.60
Maximum Heightw/QTL15,8885381190.67
No QTL20,8995371330.68
p < 0.01Maximum Growthw/QTL6106230470.59
No QTL11,698326820.64
Maximum Heightw/QTL27,5366341400.68
No QTL36,0337551820.69
# indicates the total number. a Total number of SNPs identified through GWAS (corresponding to SNP similarity matrices used as input to the SSPA). b Number of prioritized SNPs extracted using the SSPA. c Pearson correlation coefficient of phenotype similarity matrix with the final SNP similarity matrix (i.e., the summation of SNP similarity matrices corresponding to the prioritized SNPs).
Table 2. Experimental summary of executing SSPA on all variants (SNPs) without the GWAS filtering, based on maximum height phenotype. Bold-highlighted record represents the final set of prioritized SNPs selected for further analysis.
Table 2. Experimental summary of executing SSPA on all variants (SNPs) without the GWAS filtering, based on maximum height phenotype. Bold-highlighted record represents the final set of prioritized SNPs selected for further analysis.
Chromosome NumberTotal #SNPs aPrioritized #SNPs bPrioritized #SNPs Directly Map to Known GenesHighest Correlation Coefficient c
“Combined prioritized” SNPs from all chromosomes66469872470.71
# indicates the total number. a Total number of SNPs (corresponding to SNP similarity matrices used as input to the SSPA). b Number of prioritized SNPs extracted using SSPA. c Pearson correlation coefficient of phenotype similarity matrix with the final SNP similarity matrix (i.e., the summation of SNP similarity matrices corresponding to the prioritized SNPs).
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

