Statistical Approach of Gene Set Analysis with Quantitative Trait Loci for Crop Gene Expression Studies

Genome-wide expression study is a powerful genomic technology to quantify expression dynamics of genes in a genome. In gene expression study, gene set analysis has become the first choice to gain insights into the underlying biology of diseases or stresses in plants. It also reduces the complexity of statistical analysis and enhances the explanatory power of the obtained results from the primary downstream differential expression analysis. The gene set analysis approaches are well developed in microarrays and RNA-seq gene expression data analysis. These approaches mainly focus on analyzing the gene sets with gene ontology or pathway annotation data. However, in plant biology, such methods may not establish any formal relationship between the genotypes and the phenotypes, as most of the traits are quantitative and controlled by polygenes. The existing Quantitative Trait Loci (QTL)-based gene set analysis approaches only focus on the over-representation analysis of the selected genes while ignoring their associated gene scores. Therefore, we developed an innovative statistical approach, GSQSeq, to analyze the gene sets with trait enriched QTL data. This approach considers the associated differential expression scores of genes while analyzing the gene sets. The performance of the developed method was tested on five different crop gene expression datasets obtained from real crop gene expression studies. Our analytical results indicated that the trait-specific analysis of gene sets was more robust and successful through the proposed approach than existing techniques. Further, the developed method provides a valuable platform for integrating the gene expression data with QTL data.


Background
Gene expression (GE) studies including RNA sequencing (RNA-seq) and microarrays are powerful techniques for studying expression dynamics and regulation of genes in human and non-human genomes. RNA-seq has surpassed microarrays by providing better quantification of the expression of genes with higher accuracy and better reproducibility [1]. Through RNA-seq, expression levels of genes are measured in terms of discrete read counts obtained by mapping the sequence reads to the reference genome followed by quantification of transcript abundance [2]. It also allows studying alternative splicing [3], new coding and non-coding RNA transcripts, and long non-coding RNAs [4,5]. In other words, RNA-seq is much more prevalent and efficient, as it answers a much more comprehensive range of questions than the existing microarrays technology. Further, differential expression (DE) analysis is one of the significant downstream analyses hypergeometric statistical tests [26]. This approach is not so statistically sound, as it violates the basic assumptions of the hypergeometric statistical tests (e.g., sampling without replacement) [25]. To tackle these issues in analyzing the gene sets with QTL, another approach, i.e., Gene Set Analysis with QTL (GSAQ) for microarrays, has been reported in the literature and found to be better and robust compared to GSVQ [25]. However, GSAQ has some serious limitations. First, it only considers the genes present in the selected gene set but failed to use the corresponding DE scores of genes present in that gene set. Second, GSAQ treats each gene as equally important by assuming the genes are independently and identically distributed, contrary to fundamental biology. Third, GSAQ and GSVQ approaches use only the most significant genes while discarding other genes. For instance, a gene input list from microarrays is obtained by setting the arbitrary threshold(s) for fold-change and p-values as 1.5 and 0.01, respectively. With this method, marginally less significant genes (e.g., fold-change~1.499 and p-value~0.011) are missed, resulting in information loss for some essential genes. Under these circumstances, the statistical methodologies for GSA with QTL requires further improvements and advances, which will be very helpful in unraveling genotype-phenotype relationships in plants or in complex diseases.
We, therefore, propose a novel statistical approach, i.e., GSQSeq, for analyzing gene sets with trait-enriched QTL data for gene expression studies, including RNA-seq. This approach considers the genes present in the gene set and their corresponding DE scores to analyze the gene set in the presence of the trait-specific QTL data. Here, the gene sets' enrichment significance was assessed through the p-values computed using the developed test statistic(s). Further, we evaluated the proposed method's performance with respect to the existing approaches, including GSVQ and GSAQ, through performance metrics such as False Discovery Rate (FDR) and −log 10 (p-value) on multiple real crop datasets. For this purpose, we used five GE datasets obtained from microarrays and RNA-seq studies in rice. Our analytical findings indicate that the developed approach more successfully detected the QTL-enriched gene sets than the existing techniques. Additionally, GSQSeq has a robust performance over the existing traditional methods, including GSVQ and GSAQ, when assessed on multiple rice gene expression datasets. We implemented the proposed statistical approach in a freely available R software package for the benefit of users.

Real Microarray Datasets
Rice GE experimental datasets were collected from the Gene Expression Omnibus (GEO) database of NCBI for Affymetrix platforms GPL2025 (www.ncbi.nlm.nih.gov/geo/ query/acc.cgi?acc=GPL2025, accessed on 20 December 2020) [27]. We used rice GE datasets; as it is a model crop plant, a massive amount of GE and QTL datasets are publicly available, and its genome is well annotated. Here, we used the rice GE datasets for the GPL2025, as this platform contains as many as 220 microarray experiments (series) comprising 3480 samples/subjects of Oryza sativa L. compared to other platforms. Among these 3480 samples, 150 experimental samples related to four different biotic and abiotic stresses for rice were taken in this study. Initially, we collected the raw CEL file data for the cold, drought, fungal, and insect stresses from 4, 5, 2, and 1 microarray studies, respectively, in rice.

Pre-Processing of Rice Microarrays Datasets
The raw CEL files of the collected samples were processed using the Robust Multichip Average (RMA) algorithm available in the affy Bioconductor package of R [28]. This procedure involves background correction, quantile normalization, and summarization by the median polish approach [29]. Further, the log2 scale transformed expression data from RMA for the collected experimental samples were used for meta-analysis of each stress separately to remove the outlier samples (Supplementary Document S1). In other words, microarray GE samples for cold, drought, insect, and fungal stresses were integrated  Supplementary Table S2) to obtain  the meta-data. For example, the drought stress dataset, originating from 5 independent  studies available in the GEO database under the accession numbers GSE6901, GSE26280,  GSE21651, GSE23211, and GSE24048, were integrated through meta-analysis. Now, the meta-data consists of expression measurements of 57,381 genes for over 70 samples (case: 35 and control: 35). Then, these meta-datasets for the respective stresses were further used to remove the control and irrelevant features through the preliminary gene selection. This process reduces the computational complexity and data sets' dimensions. For instance, out of 57,381 genes in the drought stress, the control (123) and irrelevant (48,180) genes were filtered out by setting the fold change and p-value (from t-test) parameters as 1 and 0.05, respectively, through the preliminary gene selection. The summary and details of these datasets are given in Table 1 and Supplementary Table S1, respectively. The detailed descriptions of data collection, pre-processing, meta-analysis, and preliminary gene selection of these datasets are given in Supplementary Document S1. The QTL datasets for the stresses in rice, viz. drought, cold, insect, and fungal, were collected from the Gramene QTL database (http://www.gramene.org/qtl/, accessed on 15 December 2020) [30]. The positions of the QTLs for each stress were mapped to a reference genome through the MSU rice genome browser [31]. The lists of the respective stress-responsive QTLs and their mapped positions on the rice genome are given in Supplementary Document S4.

Real RNA-Seq Dataset
The raw sequence datasets of rice (Japonica group) under salinity stress were collected from the NCBI database (https://www.ncbi.nlm.nih.gov/, accessed on 18 December 2020). The datasets were generated from Illumina HiSeq 2000 sequencing platform and available in the GEO database with platform GPL13834 (www.ncbi.nlm.nih.gov/geo/query/acc.cgi? acc=GPL13834, accessed on 18 December 2020). This platform consists of 323 samples and 29 series of Oryza sativa. Among these datasets, we used the sequence data pertaining to GSE109341 accession, submitted by Formentin et al. on 18 January 2018, and last updated on 13 June 2018, to test our proposed approach [32]. Unlike other datasets, GSE109341 has a relatively large number of samples belonging to two contrasting conditions, i.e., treated vs. control. Further, the sequence datasets were generated from root and leaf tissue samples under untreated and treated plants of Vialone nano and Baldo rice genotypes. Each sample was made of 6 pooled plants with three biological replicates. The raw data, in .fastq files, of these samples were selected from the Sequence Read Archive (SRA) database (https://www.ncbi.nlm.nih.gov/sra/, accessed on 18 December 2020) for further Entropy 2021, 23, 945 5 of 18 statistical analysis. The detail description about the raw data and their pre-processing is given in Supplementary Document S2. The salinity stress QTL dataset was collected from the Gramene QTL database (http://www.gramene.org/qtl/, accessed on 15 December 2020) [30]. The salinity responsive QTLs' positions are mapped to the rice genome through the MSU rice genome browser [31] and are given in Supplementary Document S4.

RNA-Seq Preprocessing and Read Alignment
The single-end Illumina raw sequence reads were downloaded from the SRA database using the SRA toolkit (version 2.9.1-1). The raw reads were then preprocessed with the Trimmomatic toolkit (version 0.38), which involves removing adapter sequences, quality filtering, etc. Further, the overall quality of the preprocessed results was manually inspected using quality reports generated by FastQC (version 0.11.7). Then, the preprocessed reads were mapped with HISAT (hisat2-2.1.0) [33] on the Oryza sativa v. Nipponbare reference genome, downloaded from the MSU Rice Genome Annotation Project version 7.0 (http: //rice.plantbiology.msu.edu/, accessed on 18 December 2020) [31]). The mapping of sequence reads to the reference genome allows identifying their genomic positions. The gene coordinates file (.GFF3) was collected from the MSU rice genome browser, which helps to map the reads to spanning splice junctions to get the genes' chromosomal positions.

Transcript Assembly and Quantification
The success of the RNA-seq data analysis requires accurate reconstructions and proper quantification of all the isoforms expressed from each gene [2]. Here, we used and executed the StringTie tool (version 1.3.4d) [34] to assemble transcripts from the RNA-seq reads aligned to the genome, which primarily involves two steps. First, grouping the reads into distinct gene loci and then assembling each locus into as many isoforms. After assembling the transcripts with StringTie, we used the gffcompare tool [35] to assess the success of matching the assembled transcript with pre-annotated genes, either fully or partially. It was also used to identify the novel transcripts discovered in the mapping process.
The given experiment involved multiple RNA-seq samples generated for two varieties (with two tissue samples) under two different contrasting conditions (salinity treated vs. untreated). Hence, genes and transcripts present in one sample are rarely identical to others due to varied sequencing depth. So, they need to be assembled in a consistent manner for which the mapping results for individual samples can be compared. For this purpose, we executed the merge function implemented in the StringTie tool, which prepares a final list of genes by merging all the genes found in any of the samples.

Notations
Let, Y ij : read counts of ith (i = 1, 2, . . . , N) gene in jth (j = 1, 2, . . . , M) sample/library; Ω: collection of all genes present in the RNA-seq data (i.e., whole gene list); G: gene set selected from Ω; N: size of Ω; M: number of samples/libraries; n: size of G; µ ij : mean of ith gene in jth sample/library; θ ij (= ϕ −1 ij ) and ϕ ij : size and dispersion parameters respectively of ith gene in jth sample/library; Q: set of associated QTLs; D i : differential gene expression score for i-th gene; and T i be the threshold placed at the i-th position in gene ranked list, which divides the gene list into G and G c = (Ω − G).

Preparation of Gene Ranked List
We used the edgeR R package [36,37] to prepare the gene ranked list for the RNA-seq read count data. The edgeR tool models the Y ij through a negative binomial model, and its Probability Mass Function (PMF) is given in Equation (1). The expressions of the expected value and variance of the observed read counts (Y ij ) are given in Equations (2) and (3), respectively. For each gene, the expected value (µ ij ) is assumed to be the product of the total number of reads (i.e., library size) and the (unknown) relative abundance (Z ij ) of that gene in the current experimental condition, expressed in Equation (4). Here, V Y ij is a function of µ ij , as shown in Equation (3), and which requires the estimation of the over-dispersion parameter (ϕ ij ). So, edgeR estimates ϕ ij using a conditional Maximum Likelihood Estimation procedure, conditioning on the total read count for each gene, and an empirical Bayes procedure to shrink the dispersions toward a consensus value [36]. Here, we used the Likelihood Ratio Test (LRT) statistic(s) to calculate the DE scores of genes under a generalized linear model framework, given in Equation (5). We computed the DE scores of the genes through executing glmLRT implemented in edgeR R package [37], and finally prepared the gene ranked list. Further, we used the t-test statistic(s) to compute the DE scores of genes for preparing the gene ranked list for microarray GE data. The detailed procedure is given in Supplementary Document S3.
The PMF of the Negative Binomial distribution is expressed as: where µ ij ≥ 0; θ ij > 0 are the parameters of NB distribution, and G(.): gamma function. Then, the expected value and variance of Y ij is shown as: where s j : size factor of jth library/sample; Z ij represents the true (unknown) concentration of reads for ith gene of jth library/sample; X j is simply the binary indicator of the group membership of jth library/sample (case: 1 and control: 2); β 0i : logarithm of mean parameter for the ith gene in the reference control group; and β 1i : log fold-change parameter for the ith gene.

Proposed GSQSeq Approach
Earlier developed GSVQ and GSAQ approaches were based on the ORA of the QTL hit genes (i.e., genes overlapped with QTL regions) in the selected gene set through hypergeometric distribution [25,26]. This approach only considered the genes in the selected gene set but ignored their corresponding DE scores. Hence, we developed the GSQSeq approach that can integrate the available DE scores of the selected genes with QTL analysis of the gene sets. For this purpose, we developed a scoring function for the gene set in GSQSeq that combines features from over-representation and shifted expressionbased approaches [38]. Here, the scoring function is computed using hypergeometric distribution based on enrichment scores weighted with the DE scores computed through tests such as t-test, fold change, etc. Alternatively, GSQSeq uses the long list of genes (which should be ordered based on the DE scores) along with the corresponding vector of the DE scores. It divides the input gene list into the selected gene set (G) and not-selected gene set (G c ) based on the chosen threshold value. Then, it calculates the test statistic, given in Equation (6), for every gene set of the ordered gene list taken at each threshold value [39] by using the following procedure.
The GSQSeq uses a function to calculate the difference between the sum of differential gene expression test scores for G and G c and is expressed in Equation (6). This calculation is repeated for each threshold value, T i . It is worthy to note that the T i s are chosen based on the user's discretion under the constraint that Therefore, to perform the gene set analysis with the underlying trait-specific QTLs for GE studies, including RNA-seq data, we developed the GSQSeq approach under a sound computing framework. In other words, it can be used to evaluate the statistical significance of selected gene sets related to a specific trait based on available QTL information. Under the GSQSeq approach, the following hypotheses can be constructed for testing purposes.
H 0 : Genes in G are at most as often overlapped with the QTL regions as the genes in G c (i.e., SD GQ = 0). H 1 : Genes in G are more often overlapped with the QTL regions as compared to genes in G c (i.e., SD GQ > 0).
The above constructed null hypothesis is a competitive hypothesis as it considers the genes from both G and G c [13,40,41]. Here, the H 0 tells that the QTL hit gene set members and non-members are distributed randomly across the gene list. Now, the QTL hits of the genes present in G can be determined through the indicator function given in Equation (7).
where g i ∈ G; a and b represent the start and stop positions (in terms of base pairs) in chromosome c of the gene g i ; q ∈ Q; and d and e represent the start and stop positions (in base pairs) in chromosome c of the QTL q. The NQHits statistic [25], based on the overlapping of the selected genes with the QTL regions (given in Equation (7)), is shown in Equation (8).
It is important to note that the existing techniques, including GSVQ and GSAQ, use the NQHits test statistic given in Equation (8) for gene set testing. This test statistic considers each gene as equally important and does not consider their DE scores. Hence, we developed the SD GQ test statistic (Equation (6)). However, the SD GQ alone cannot be used for enrichment testing of gene sets, as it is unstable due to different sizes of the gene sets G and G c . Therefore, we used the Z-score transformation of the test statistic (SD GQ ) (Equation (8)) and which is expressed in Equation (9).
where E SD GQ and V SD GQ are the expected value and variance of the SD GQ , respectively. Further, we obtained the distribution of the test statistic SD GQ , given in Equation (6), under H 0 . The expressions for mean and variance of the test statistic are given in Equations (10) and (11), respectively.
where X: differential gene expression test scores of the genes in the gene set; N GQ : number of gene set members in G got QTL hits; E(.): expected value; and V(.): variance. The estimate of the expected value of SD GQ (Equation (10)) is a linear function of the expected value of the hypergeometric distribution of N GQ , the expected value for the differential gene expression test scores in the analyzed selected gene set, E(X), and the size of the selected gene set, n. The E(X) and V(X) in Equations (10) and (11) were computed through the method proposed by Newton et al. (2007) [42], i.e., calculating the mean and variance values of the DE test scores (X) for all the genes in the analyzed gene set followed by normalization.
It is worthy to note that the N GQ given in Equations (10) and (11) is same as the NQHits statistic proposed by   [25] and implemented in the existing GSAQ approach [25], and its expression is shown in Equation (8).
Here, the N GQ follows a hypergeometric distribution and its PMF can be given as: where V: total number of genes covered by the QTLs in the whole Ω and v: number of genes in G that are covered by QTLs.
The expected value and variance of the N GQ , given in Equation (8), can be expressed in Equations (13) and (14), respectively.
Under H 0 , the Z-transformation of the test statistic given in Equation (9) follows a standard normal distribution (under statistical assumptions), i.e., Z ∼ N(0, 1). Through this property, the statistical significance value for the selected gene set, G, was computed. Similarly, this procedure was repeated for all the K gene sets obtained by placing the threshold, T i , (i = 1, 2, . . . , K) (K ≤ N) at K different places in the ranked gene list. Then, we adjusted the gene sets' statistical significance values through the multiple hypothesis testing corrections, and the procedure is given as follows.
Let p 1 , p 2 , . . . , p K be the corresponding p-values for all the K gene sets, and α be the level of significance. Here, we assume that all gene sets are equally important for trait development. Hence, we employed the Hochberg procedure [43] to correct the multiple testing and compute the adjusted (adj.) p-values for gene sets. It is worthy to note that Hochberg's procedure is computationally simple, quite popular in genomic data analysis [44], and more powerful than Holm's method [45]. The algorithm for Hochberg's procedure [43] is as follows.
Step 1. If p (l) > α, then retain the corresponding null hypothesis (H (l) ) and go to the next step. Else, reject it and stop. Step and go to the next step. Else, reject all remaining hypotheses and stop.
Step 3. K. If p (1) > α/K, then retain (H (1) ). Else, reject it. Now, the adj. p-values are given recursively beginning with the largest p-value [43]: Further, based on the computed adj. p-values, the underlying QTL enrichment significance of the selected gene sets was assessed. In other words, a lesser value of adj. p-value indicates more QTL enrichment of the selected gene set for the target trait development and vice-versa. Similarly, we also computed the False Discovery Rate (FDR) for the selected gene set. The outlines and key analytical steps of the proposed GSQSeq approach are shown in Figure 1. Further, based on the computed adj. p-values, the underlying QTL enrichment significance of the selected gene sets was assessed. In other words, a lesser value of adj. p-value indicates more QTL enrichment of the selected gene set for the target trait development and vice-versa. Similarly, we also computed the False Discovery Rate (FDR) for the selected gene set. The outlines and key analytical steps of the proposed GSQSeq approach are shown in Figure 1. Various steps involved in RNA-seq data analysis are described, starting from data collection to differential expression analysis. The outlines of the analytical steps involved in proposed GSQSeq approach are described. The major steps include: (i) preparation of the gene ranked list after differential expression analysis of RNA-seq count data and selection of gene sets.

Mapping Results
From the SRA database of NCBI, a total of 542,309,740 single-end reads (with 50 base pair length) were obtained for 24 libraries. The number of read sequences in each library and the GC content is given in Table 2. The average number of reads per library was found to be 22,596,239, with a CV of 0.169 (=16.9 %). After pre-processing with Trimomatic, the above summary statistic was reduced to 22,353,215 as the mean library size with a CV of 0.171 (=17.1%). However, through pre-processing, the average library size was reduced compared to that of raw sequence datasets, but, the variability among the libraries remained unchanged. Then, the qualities of the read sequences after removal of the adapter sequences were assessed through the FastQC tool, and the results are given Various steps involved in RNA-seq data analysis are described, starting from data collection to differential expression analysis. The outlines of the analytical steps involved in proposed GSQSeq approach are described. The major steps include: (i) preparation of the gene ranked list after differential expression analysis of RNA-seq count data and selection of gene sets.

Mapping Results
From the SRA database of NCBI, a total of 542,309,740 single-end reads (with 50 base pair length) were obtained for 24 libraries. The number of read sequences in each library and the GC content is given in Table 2. The average number of reads per library was found to be 22,596,239, with a CV of 0.169 (=16.9 %). After pre-processing with Trimomatic, the above summary statistic was reduced to 22,353,215 as the mean library size with a CV of 0.171 (=17.1%). However, through pre-processing, the average library size was reduced compared to that of raw sequence datasets, but, the variability among the libraries remained unchanged. Then, the qualities of the read sequences after removal of the adapter sequences were assessed through the FastQC tool, and the results are given in Supplementary  Figures S2-S4. From the distribution of the quality scores over the base pairs, it was observed that the quality scores of the sequence reads are above 30, which indicated better qualities of the samples/libraries (Figures S2-S4). In other words, we could not trace any universally low-quality reads in the salinity stress data. Therefore, the processed sequence datasets can be used for further analysis, such as mapping to the rice reference genome followed by quantification transcript abundance. It was observed that most of the pre-processed reads (94.3%) were successfully mapped to the rice reference genome. Out of these mapped reads, 2.87% of reads were mapped to more than one position and subsequently discarded from further analysis. Through StringTie, we quantified the transcriptomic abundance of the transcripts/genes over 24 samples, resulting in the read count data of 55,801 genes. This generated read count data matrix was used to prepare the ranked gene list through the downstream DE analysis.

Rice RNA-Seq Data
The read sequence count data for each sample/library belonging to two contrasting classes, i.e., salinity treated vs. control, as given in Table 2

Rice Microarray Datasets
The log2 scale transformed expression data from the RMA for the selected experimental samples (Supplementary Document S1) for the cold, drought, fungal, and insect stresses were used to prepare the gene-ranked list through the DE analysis. Here, the DE analysis was performed through a t-test, and the test statistic(s) for the genes were computed from the t-test. The genes were arranged in descending order for the preparation of the gene-ranked list. Then, different values of the T i are placed in various positions on the gene-ranked list to select different gene sets. Through this process, gene sets of sizes such as 200, 300, 400, . . . , 2000 are selected from the ranked gene lists for each dataset.

Distribution of the Test Statistic(s)
The distribution of the NQHits statistic(s), computed through the existing GSAQ approach (under the parameter settings given in Supplementary Table S9) over the selected gene sets for the different stresses are shown in Figure 2A. Further, the distribution of the SD GQ statistic(s) computed from the GSQSeq approach is also shown in Figure 2B. The distribution of the NQHits statistic(s) calculated from the GSAQ approach indicated that the values of the NQHits statistic(s) were found to be higher for fungal stress, followed by insect stress, as compared to other datasets (Figure 2A). This trend is due to the fact that a higher number (76) of QTLs are reported for this stress, followed by 57 in fungal stress. Alternatively, the NQHits statistic is a linear function of the number of genes present in gene sets, the number of QTLs reported for that stress, and the length of the QTL regions ( Figure 2A). Similar interpretations can be made for the distribution of the SD GQ statistic(s) ( Figure 2B). However, the NQHits statistic did not consider the DE scores of the genes present in the selected gene set. Here, it is worthy to note that the SD GQ is a function of the number of genes, their respective DE scores in the gene set, the number of QTLs reported for that stress, and the QTL regions ( Figure 2B).

Distribution of the Test Statistic(s)
The distribution of the NQHits statistic(s), computed through the existing GSAQ approach (under the parameter settings given in Supplementary Table S9) over the selected gene sets for the different stresses are shown in Figure 2A. Further, the distribution of the statistic(s) computed from the GSQSeq approach is also shown in Figure 2B. The distribution of the NQHits statistic(s) calculated from the GSAQ approach indicated that the values of the NQHits statistic(s) were found to be higher for fungal stress, followed by insect stress, as compared to other datasets (Figure 2A). This trend is due to the fact that a higher number (76) of QTLs are reported for this stress, followed by 57 in fungal stress. Alternatively, the NQHits statistic is a linear function of the number of genes present in gene sets, the number of QTLs reported for that stress, and the length of the QTL regions (Figure 2A). Similar interpretations can be made for the distribution of the statistic(s) ( Figure 2B). However, the NQHits statistic did not consider the DE scores of the genes present in the selected gene set. Here, it is worthy to note that the is a function of the number of genes, their respective DE scores in the gene set, the number of QTLs reported for that stress, and the QTL regions ( Figure 2B).

Proposed Approach for Gene Set Analysis with QTLs
The NQHits and SD GQ statistic(s) failed to tell the trait-specific enrichment of the gene sets or association of genotype-phenotype relation. Therefore, we proposed the GSQSeq approach to test the gene sets' trait-specific enrichments with the underlying QTLs. We also explored the ability of the proposed GSQSeq and the existing methods, including GSVQ and GSAQ, to provide biologically meaningful insights (e.g., establishing genotype-trait-specific phenotype associations) using the real high-throughput GE datasets derived from RNA-seq and microarrays. Through all the three tested GSA approaches, we searched significantly associated gene sets enriched with underlying QTLs selected by a particular gene selection method (e.g., t-test in microarrays, edgeR in RNA-seq) in each of the datasets. The results of such analysis are shown in Tables 3 and 4. Size: size of the selected gene sets obtained from the gene expression data; −log 10 (p-value) are listed in the table; GSQSeq: proposed approach; GSVQ and GSAQ are existing approaches. Performance analysis of the proposed approach was carried out with the existing approaches on salinity stress RNA-seq data and cold stress microarray expression data. Higher the value of −log 10 (p-value), better the gene set enrichment with QTL. Size: size of the selected gene sets obtained from the gene expression data; −log 10 (p-value) are listed in the table; GSQSeq: proposed approach; GSVQ and GSAQ are existing approaches. Performance analysis of the proposed approach was carried out with the existing approaches on drought, fungal and insect stress microarray gene expression datasets. Higher the value of −log 10 (p-value), better the gene set enrichment with QTL, and vice-versa. For salinity stress RNA-seq data, the magnitude of −log10 (p-values) from the GSQSeq was found to be much higher than that of the existing GSVQ and GSAQ approaches (Table 3). This observation indicated that the GSQSeq approach more often rejected H 0 (i.e., equal salinity QTL enrichment of both selected and not-selected gene sets) than GSVQ and GSAQ approaches. Therefore, it was found that the salinity trait-specific analysis of gene sets derived from the RNA-seq study was successful through GSQSeq compared to the GSVQ and GSAQ approaches (Table 3). In other words, the GSQSeq approach performed better in terms of detecting the QTL-enriched gene sets compared to the existing methods. In order to cross-validate these findings on the same RNA-seq data related to salinity stress, we computed the FDR for the GSQSeq, GSAQ, and GSVQ approaches for all the gene sets. The results are given in Tables 5 and 6. It was observed that the computed values of FDR from the proposed GSQSeq approach for all the selected gene sets are far below those of existing GSAQ and GSVQ approaches (Table 5). Therefore, it can be inferred that the proposed GSQSeq approach was more robust than the GSAQ and GSVQ approaches for performing gene set enrichment testing with salinity trait-specific QTLs. For cold stress data obtained from microarrays, the values of −log10 (p-values) from GSQSeq were observed to be higher than those of the existing GSVQ and GSAQ approaches over all the selected gene sets (Table 3). This finding indicated that the GSQSeq approach more often rejected H 0 (i.e., equal cold QTL enrichment of both the selected and not-selected gene sets) than the GSVQ and GSAQ approaches. Further, the FDR values computed through the proposed GSQSeq approach for all the selected gene sets of sizes 200, 300, . . . , 200 were found to be least followed by the GSAQ compared to the GSVQ approach (Table 5). Similar findings were observed for drought, fungal, and insect stress datasets in rice (Tables 4-6). Therefore, it can be concluded that the proposed GSQSeq approach is much better and more robust than GSAQ and GSVQ for performing gene set enrichment testing with the underlying QTLs for the microarray-based GE studies. Furthermore, we found much greater consistency in QTL-specific gene set enrichment analysis across five different stress scenarios, viz. salinity, cold, drought, fungal, and insect, by using GSQSeq than the GSVQ and GSAQ approaches (Tables 3-6). The proposed GSQSeq approach is an improved way to perform the trait-specific analysis of the gene sets to establish genotype (polygenes)-phenotype (quantitative trait) association testing with the help of genetically rich QTL data. It is more biologically appealing to establish the association of genes (genotype) in the selected gene set with the underlying QTLs (traits/phenotypes). However, in the existing GSVQ and GSAQ approaches, the genes in gene sets are taken as input to hypergeometric distribution for performing trait enrichment analysis. These approaches violated the basic assumptions of the hypergeometric test (i.e., sampling units must be drawn without replacement) and did not consider the DE scores of the gene set's genes. Thus, the existing approaches are expected to have poor performance in terms of gene set enrichment. Furthermore, the GSQSeq approach was more successful and useful in detecting the trait-specific QTLenriched gene sets than the existing methods.
The proposed GSQSeq approach allowed us to statistically test the gene set for enrichment with the underlying QTLs (i.e., rejection of the null hypothesis of the random association of selected genes with QTLs). Through this, a p-value was assigned to each selected gene set, which is statistically meaningful to genome researchers and experimental biologists (as the value lies between 0 and 1). The gene sets with lower p-values are considered as more enriched with the underlying trait-specific QTLs, and vice-versa. It may be noted that the proposed GSQSeq technique is a two-stage approach. First, it deals with selecting gene sets through the downstream DE analysis from the large GE data. Second, it assesses the QTL enrichment significance of gene sets by using a developed parametric testing procedure. This analysis eases the interpretation of a large-scale experimental GE data by identifying trait-specific enriched gene sets. Instead of focusing on the individual QTL hit gene (i.e., genes overlapped with QTL region), researchers can focus on the QTLenriched gene sets (polygenes), which tend to be more reproducible, and more interpretable (for quantitative traits).
The proposed GSQSeq approach can be considered a valuable tool for performing gene(s) enrichment analysis in a molecular plant breeding context, as most of the plant traits are quantitative and controlled by polygenes. Further, it provides a valuable tool for integrating the GE data from RNA-seq or microarray studies with genetically rich QTL data to identify potential QTL-enriched gene sets or the sets of QTL candidate genes. These QTL-enriched gene sets may provide valuable input or hypotheses to plant breeders to design their breeding experiments.

R Software Package
To facilitate the use of the proposed gene set analysis with the QTL approach among the researchers, we developed an R software package that includes GSQSeq R package and accompanying documentation with examples. This package is supplied with the manuscript as supplementary material and is also available in https://github.com/samuofl/GSQSeq. The inputs and guidelines for the use of the GSQSeq R package are given in Supplementary Document S5. This software can analyze the gene sets for GE datasets derived from expression studies including microarrays and RNA-seq. For microarray GE data, four different gene selection methods, such as t-test, F-score, Maximum Relevance and Minimum Redundancy (MRMR), and Support Vector Machine (SVM) techniques [46], are implemented for the selection of relevant gene sets from the high-dimensional GE data. After that, a selected gene set of a particular 'size', obtained from microarray GE data, is analyzed with the underlying QTL data. Both these steps can be executed by implementing the GSQMicro function in the GSQSeq R package.
In the developed software, two popular and extensively used DE analytical methods, including edgeR and DESeq2, are implemented in the GSQSeq R package to prepare the gene ranked list for RNA-seq count data. Then, the selected genes obtained from the RNA-seq data are analyzed with the underlying QTLs to establish the genotypes with quantitative trait links. These steps are implemented in the GSQSeq function of the GSQSeq R package. Moreover, the QTL enrichment significance of the selected gene sets obtained from both the microarrays and RNA-seq can be assessed through the computed p-values. However, different user-specific parameter options are provided to the users for their desired analysis. Hence, the GSQSeq software offers opportunities to get the QTL-enriched gene sets of desired 'sizes' from the microarray and RNA-seq GE data.

Conclusions
In the last decade, GSA has become the primary choice of secondary genomic data analysis for explaining the underlying biology for high-throughput GE studies. Most of the bioinformatics studies look for statistically significant gene sets as either biological interpretations of long gene list in GE data or validation of computationally derived results from DE analysis. Traditionally, the gene sets are analyzed based on the available GO or pathway annotation data. However, such research may not establish the links between gene sets (polygenes) with the underlying quantitative traits. Therefore, in this study, we proposed an innovative statistical approach and tool, i.e., GSQSeq, to analyze gene sets with genetically rich trait data, such as QTL. This approach is an improvement over the existing GSVQ and GSAQ methods, as it considers the DE scores of the genes in the gene list in performing GSA with the QTLs. In other words, the proposed approach may be regarded as the second generation of GSA with QTL as it is an improvement over first-generation GSA methods, such as GSVQ and GSAQ (based on ORA). Through this proposed GSQSeq approach, the statistically meaningful and biologically interpretable p-values are assigned to each gene set, which profoundly describes the trait enrichment of the gene sets.
The researchers and molecular biologists may focus only on the QTL-enriched gene sets to frame their hypothesis without considering the long list of genes in the highthroughput GE studies. The comparative performance analysis of the proposed approach with respect to the existing methods indicated its better and more robust performance on multiple crop GE datasets. This approach will have immense plant biology and breeding applications for the identification of trait-enriched gene sets (as plant traits are quantitative) for stress response engineering. Further, the developed tool provides a statistically sound computational environment for integrating high-throughput GE data with genetically rich QTL data. This approach can also be extended to analyzing the GE datasets obtained from single-cell RNA-seq GE studies. Here, due to the unavailability of rice single-cell data sets, we cannot test the performance of the GSQSeq approach on such datasets.
Further, the researcher may consider testing the utility of the GSQSeq tool on other plant and non-plant GE studies as a potential future work. In future, statisticians and biologists may focus on developing the next generations of GSA with QTLs by using graph/network theoretic approaches. These new approaches will analyze the highthroughput GE data more efficiently to understand the biological systems better, which will increase the specificity, sensitivity, utility, and relevance of GSA in GE studies.
Supplementary Materials: The following are available online at https://www.mdpi.com/article/ 10.3390/e23080945/s1. Document S1: Rice RNA-sequencing dataset collection and pre-processing. Document S2. Differential Expression analysis of RNA-seq data. Document S3. Data collection, preprocessing, meta-analysis and preliminary gene selection for rice. Document S4: Stress(es)-specific Quantitative Trait Loci information for rice (Oryza sativa L.). Table S9  Data Availability Statement: All the secondary data used in this study are publicly available in the NCBI GEO database. The developed R package is publicly available to the users at https: //github.com/sam-uofl/GSQSeq.