massiveGST: A Mann–Whitney–Wilcoxon Gene-Set Test Tool That Gives Meaning to Gene-Set Enrichment Analysis

Gene-set enrichment analysis is the key methodology for obtaining biological information from transcriptomic space’s statistical result. Since its introduction, Gene-set Enrichment analysis methods have obtained more reliable results and a wider range of application. Great attention has been devoted to global tests, in contrast to competitive methods that have been largely ignored, although they appear more flexible because they are independent from the source of gene-profiles. We analyzed the properties of the Mann–Whitney–Wilcoxon test, a competitive method, and adapted its interpretation in the context of enrichment analysis by introducing a Normalized Enrichment Score that summarize two interpretations: a probability estimate and a location index. Two implementations are presented and compared with relevant literature methods: an R package and an online web tool. Both allow for obtaining tabular and graphical results with attention to reproducible research.


Introduction
Enrichment analysis (EA) of gene-sets is a technique typically used to uncover the phenotype of a gene-profile associated with the differential expression between two conditions [1] (e.g., treatment and control). If many genes (the gene-set) contribute to a phenotype or a cellular function, enrichment analysis tests whether a gene-set is associated with one of the two conditions [2]. The test procedures are classified as global or competitive tests [3]. In global test approaches, the test involves only genes in the gene-set. Instead, in competitive tests, the genes in the gene-set are compared with those outside the set. In this case, the test is applied to a gene-profile summarizing the differences between the two conditions. When ordered, from the highest to the lowest, a gene-profile is known as a pre-ranked list. An extensive and recent qualitative review of EA methods and tools is in [4].
To obtain the significance level, analytical methods are generally not applicable because the distributional hypothesis behind the test is not met. Computational strategies can help to estimate the null distribution by shuffling samples or genes. Since the seminal paper of [5], researchers mainly focused on shuffling samples leaving the inference from the gene-profile slightly covered. With [6], the analysis of a gene-profile becomes more central as EA was done at the level of the single sample profile.
GSEA [5] is the most adopted gene-set enrichment methodology. It is based on a modified version of the two-sample Kolmogorov-Smirnov (weighted-KS, wKS) test and is applied on a gene-profile. In this manuscript, GSEA and wKS are interchangeable. Basically GSEA consists of testing whether the distribution of scores associated with genes inside the gene-set is the same of the distribution of scores of genes outside the gene-set, i.e., H 0 :F in (x) = F out (x), toward the alternative H 1 :F in (x) = F out (x). Given the non-canonical Table 1. State-of-the-art implementation of enrichment analysis tools based on the wKS and MWW test statistics.

EA Tool
Reference Year Test Available as camera [2] 2012 MWW R function in limma package GSEA [5] 2005 wKS R package fGSEA [9] 2021 wKS R package clusterProfiler [10] 2012 wKS R package massiveGST [11,12] 2022 MWW R package/web GeneTrial3 [13] 2020 wKS/MWW web WebGestalt [14] 2019 wKS web A quantitative comparison of wKS and MWW EA algorithms, carried out by [15], states that the two methodologies are essentially equivalent in terms of significant genesets. A deeper study is in [16], where MWW and wKS are compared in the setting of weak functional signals, showing that MWW's test is the most sensible.
In this work, we propose a new implementation of the enrichment analysis based on the MWW's test (available as an easy-to-use web-based service and as an R package) called massiveGST (mGST). Current literature implementations essentially use the MWW's test to compute the p-value associated with the gene-set. Instead, we exploit the statistical information from the test to obtain a richer view of the analysis. According to [17], the normalized version of the MW's test-statistic is an estimate of probability. From such a probability, we propose two additional statistics, odds and logit2NES, that help researchers to understand the gene-set enrichment's importance beyond the trivial evaluation of p-values. In addition, we propose: (1) a new prioritization of the tabular view of gene-sets EA that includes NES, p-value, and size of the gene-set; and (2) we demonstrate that the estimate of the probability owns a new interpretation as a location index. Then, our software provides a richer set of new statistics than available algorithms.
Furthermore, the computational effort to run the analysis has been compared with the EA tools reported in Table 1.
We ignored over-representation methodologies (e.g., [18]) based on the hypergeometric test, as they follow a completely different approach and include the theoretical issues of choosing the universe set and which genes are differentially expressed.

The Normalized Enrichment Score
The Normalized Enrichment Score and the p-value come from the Mann-Whitney's test [7]. The null hypothesis H 0 :F in (x) = F out (x) states that there is no mutual dominance of the distribution functions, F in (x) and F out (x) that describe the intensities of genes, respectively, in and out of the gene-set. The alternative hypothesis states that the distribution function F out (x) dominates F in (x), i.e., H 1 :F out (x) > F in (x). Under the alternative hypothesis, the genes in the gene-set have intensities higher than those of the genes outside the gene-set. The MW test-statistic is: where I(·) is the indicator function. Basically, U is the number of times that the relation . . , m out ) is the intensity associated with the jth gene outside the gene-set, x in i (i = 1, 2, . . . , m in ) is the intensity associated with the ith gene in the gene-set, and m = m in + m out is the total number of genes in the gene-profile. With the help of the Wilcoxon [8] test-statistic, the computation of U is drastically improved as follows: where T out is the sum of rank transformed x k , k = 1, 2, . . . , m outside the gene-set. According to [17], the ratio U m in ×m out is an unbiased estimator of the probability . Given a gene-set, the event X in > X out says that "a gene randomly drawn from the gene-set has an intensity greater than the one of a second gene randomly sampled from outside the gene-set".
We define the estimate U m in ×m out of P X in > X out as the Normalized Enrichment Score (NES) of a gene-set enrichment analysis. Assuming that a gene-profile recapitulates the differential expression of treatment samples versus control, an NES close to 1 means association of the gene-set with the treatment. Instead, an NES close to 0 suggests an association with the control group. This interpretation allows us to restate NES as NES = P[the gene-set is associated with the treatment group]≈ U m in × m out .
A different way to look at the NES is the odds = NES/(1 − NES), i.e., the imbalance of the probability that the gene-set is associated with the treatment group to the probability that the gene-set has no association with it (or the gene-set is related to the control group). odds = P[the gene-set is associated with the treatment group] P[the gene-set is not associated with the treatment group] The association with the treatment is as strong as the odds diverge to infinity; it is weak when the odds approach zero. In this last case, the association is with the control groups. An odds of about 1.0 means no association, neither the treatment nor the control.
A further transformation of NES is the logit2NES = log 2 (odds).
In this version of NES, a zero value means no association, a positive value means association with the treatment group, and a negative value means association with the control group. The NES owns a descriptive interpretation as location index of the gene-set. It is the percentile rank of the gene-set, seen as a single value, in the ranking of the genes outside the gene-set (see Appendix A for the proof). When NES reaches 1, then genes in the gene-set are located at the top of the gene-profile. When NES is 0, the location is at the bottom and the association is with the control group.

Enrichments Prioritization
With the rapid growth of gene-sets collections, there is a problem of prioritizing significant results. In GSEA, gene-sets are generally ordered according to the NES or the p-value. However, this can be misleading because NES and gene-set size are dependent as shown by the following experiment.
We considered the gene-sets collection C5/BP from MSigDB [19] and the gene-profile published in [16]. Due to gene-set size, GSEA restricted the original collection to 4046 out of 7658. The same collection was used with mGST. In Figure 1, the size of the genesets (transformed as log 10 (1 + size)) has been plotted against the normalized enrichment score, both for GSEA (a) and mGST (b). The range of NES decreases as the size increases in both cases, showing a dependence. Furthermore, we measured the intensity of the dependence with the mutual information (computed with k-NN estimator implemented by [20]) obtaining MI GSEA = 0.0446, and MI mGST = 0.0902.
MI GSEA = 0.0446, and MI mGST = 0.0902 showing that exists dependence. To improve the gene-sets prioritization and give more evidence to large ones, we propose an additional gene-sets score, named relevance, that aggregates NES, p-value, and gene-set size.
Let us assume that we run a two-sided enrichment test so that some gene-sets have logit2NES ≥ 0, and some others logit2NES < 0. For the k th gene-set, k = 1, 2, . . ., in the collection having logit2NES ≥ 0, then where rank(·) is a function that associates the highest rank with the highest value of its argument, and actual-size is the gene-set size. Similarly, the relevance in the subsets of gene-sets (with index k ) such that logit2NES < 0 is Finally, given the kth gene-set,

Enrichments Visualization
We integrated the tabular results with a network-graph of gene-sets. A node represents a significant gene-set. The size of node is proportional to the size of gene-sets, while the intensity of the color is proportional to NES values. The connection between two gene-sets A and B is proportional to their similarity S(A, B). The similarity S(A, B) is computed as a convex combination of the Jaccard, δ 0 (A, B) = |A ∩ B|/|A ∪ B|, and the overlap,

Web-Based Service
A simplified functional architecture of the mGST Tool is shown in Figure 2. It is implemented in Javascript and is executed on the client host. Gene-set pre-elaboration is performed by the prepareGeneSets() function. Basically, it computes gene-profile ranking in O(m × log(m)) time, where m is the length of the gene-profile, and collects global information in appropriate data structures, such as the total number of genes and the sum of ranks.
The core of the algorithm is implemented in the computeGST() function, where, for each gene-set, ranking and test-statistics are computed in linear time. Results are collected in an interactive html table and can be exported in csv, tsv, and html formats. The computeNet() function performs additional network analysis and generates a graph representation of the results that can be exported in png format. User interface interaction features are implemented by using html5 and ajax frameworks.    GBM_9fu…ns.rnk

Choose File
Gene sets:

R Package
The R package is a collection of functions to compute the enrichment analysis and to manipulate and plot the results. The primary function is massiveGST that needs as mandatory input the gene-profile and the collection of the gene-sets. The output is a data frame arranging all the statistics introduced in the methodology section. Three functions cut_by_NES, cut_by_logit2NES, and cut_by_significance trim the data frame according to the required constraints. With the help of the S3-method, the function plot provides a graphical display for the analysis. The enrichments can be presented as a bar plot or as a network.
The logical scheme is shown in Figure 3. An extensive presentation of the package usability is in the vignette at https://cran.r-project.org/web/packages/massiveGST/ vignettes/vignette.html (accessed on 11 April 2022).

Computational Time: Comparison with Literature Methods
To assess the computational efficiency of our proposal, we designed a simulation experiment involving real data from TCGA. With the help of TCGAbiolinks [21], we downloaded data and annotations from different studies. We got gene-profiles by comparing subtypes by using a DESeq2 package [22]. The gene-profile is −log(p j ) × sign(W j ), across genes, where p j and W j are the p-value and the test-statistic of the Wald's test, respectively. In total, we collected 30 gene-profiles.
We screened nine recent literature proposals for enrichment analysis both as R-package and online service shown in Table 1.
The 30 gene-profiles, together with the C1 collection of 278 positional gene-sets from MSigDB [19], fed the nine procedures. Table A1 shows Figure 4 shows a boxplot of the experiment results. The time has been transformed as log 10 (1 + time) to bound the different ranges of each procedures. Camera pre-ranked (on average 0.02 s with 0.03 as standard deviation) and massive GST (0.27 s with 0.10 as standard deviation) own the lowest computation time in the R environment, confirming results reported in [23]. The time difference between massive GST and camera pre-ranked is because the latter applies the MWW's test and returns the p-value with an indicator of the direction of the test; instead, massive GST provides the statistics presented in the methodology section. As online service, our proposal spends 0.91 s on average (sd = 0.01), versus 13.57 (sd = 3.01) of wKS (GeneTrail3) and 14.30 (sd = 3.13) of MWW (GeneTrail3). WebGestalt (wKS) spends 84.20 s (sd = 6.69).   Table A1.

Usage of the Online Web-Tool
To run the analysis, the user needs to load two files: (a) a gene-profile (as a two columns tab-separated text format, the gene-name and the associated value), and (b) one or more gene-sets collections (in .gmt format).
The next steps are: (1) set the significance-level of the enrichments (the user can choose between the p-value, and two versions of adjusted p-values: Benjamini-Hochberk and Bonferroni), and (2) (optionally) set the threshold value of the logit2NES.
From the user's point of view, the online web tool follows the same logical scheme as Figure 3.
The significance level allows for selecting gene-sets relevant for the treatment and control. In addition, the researcher could be interested in those gene-sets strongly associated. In this case, the trimming with NES, both as location index or probability, comes into play. NES could be difficult to handle and read because it is a positive number, and people have to remember that association with treatment or controls depends on the value above or below 0.5, respectively. As a help, the logit2NES simplifies the process of interpreting the association (positive values with the treatment, negative with the control) and intuitively measuring the strongness of association (higher positive values mean strong association; lower negative values signify strong association with the control). The equivalence of values from NES, odds, and logit2NES is shown in Table 2.
To require that the probability of association of the gene-set with the treatment group be about twice the probability of non association, the logit2NES threshold can be set to 0.9 (equivalent to NES > 0.65, or odds > 1.86). Tabular versions of results are also generated (see Figure 5). The shown report respects the constrains given as input, while the full table with every gene-set can be downloaded as .csv or .tsv formats. The html version of the table can be downloaded as shown. Both the displayed table and its .html version allow the user to re-sort results according to any column. To visualize the network-graph of current results, the user can click on the network tab. Here, the similarity between any two of the gene-sets in the table is computed and the network of gene-sets is shown. The user can chose between two similarity measures, Jaccard or overlap, or any convex combination of the twos by tuning the parameter with a slider box. A second slider-box allows for setting the threshold value so that a segment joins two nodes when the similarity is above it. The network is updated in real time, as the

Conclusions, Limitations, and Future Research
Gene-set enrichment analysis is a methodology of great interest in silico experiments. Its first aim is to give a biological meaning associated with genes profiles coming as result of any analysis. Since its introduction, effort has been spent to improve the results' reliability and extend the field of application. Much attention has been devoted to global test versions, but competitive methods, requiring just a gene-profile, appear more flexible because the profile can be generated with up-to-date methodology (the case of analyzing a single cell is an example).
GSEA is the most adopted methodology, with about 32,000 citations to date. A similar approach is offered by competitive tests involving the Mann-Whitney-Wilcoxon test. To date, such a test is offered as an optional alternative in several other methodologies, but the theoretical properties have not been exploited.
In this paper, we have presented the massiveGST procedure, implemented as an R package and available as a web-tool, centered on MWW's test methodology for competitive gene-set enrichment analysis. We exploited the theoretical knowledge of the test to improve the interpretation of the enrichment results. We proposed the interpretation of the normalized version of MWW's test-statistic as an estimate of probability and as a location index in the ordered universe of genes outside the gene-set. Convincing use of this last interpretation is in [24].
As demonstrated in the simulation experiment, enrichment analysis with MWW's test generally requires low computation time. In the R environment, the massiveGST function competes with cameraPR but offers a rich set of statistics. Our online implementation is the most competitive (about 1.5 s for more than 10,000 gene-sets).
A general issue is the lack of an independent paradigm to test which method/procedure is reliable. Something has been done with the recent contribution from [23], where real datasets of pathologies have been selected and, for each of them, genes associated with the pathology have been gathered from the literature. The gene-sets containing such genes have been assumed as ground truth. The assumption is that a gene associated with a pathology should be highly differentially expressed with respect to control samples. Such a hypothesis neglects that a large subset of weak or moderate signal genes cooperates with important biological phenotypes [25], posing critical concerns on the usage of the paradigm proposed in [23] for the evaluation of EA methods.
Competitive EA methods have increased attention in applied research as they own implicit adaptability to emerging new omic technologies. It is urgent to design a comparison paradigm with large consensus to know the strengths and weaknesses of methodologies, such as those developed in other contexts (e.g., DREAM, KAGGLE, . . . [26]).
The availability of a fast methodology for EA, together and results not affected by variability induced by the computational strategy to obtain the significance, could push new contributions to methodological proposals in discovering master regulators (e.g., [27]). Such tools, starting from the estimation of gene regulatory network [28,29], apply EA methods to detect those transcription factors able to drive phenotypes.
Competitive EA methods may have several applications in fields also far from bioinformatics. Consider a list of items (think about the ranking of basketball players), sorted according to some criterium (the best player at the top), and different ways to cluster them (the teams, the ethnicity, young players, . . . ). The result of the competitive EA method with MWW will be the location in the ranked list of a consistent cluster of items (the young basketball players perform better than others, in the case that the cluster is located close to the top). Proof. Let T in be the sum of the m in ranks inside the gene-set, while T out is the sum of the m out outside, then From these relations, we obtain that MW's U statistic corresponds to U in , and We can show that U/m in ≡ m out , when T in sums the highest k ranks. In this case, m in = k, and m out = m − k; Conversely, when T in sums the lowest k ranks, then T in = k(k+1) 2 , and U k ≡ 0.
The ratio U m in ×m out is the percentile rank of the gene-set, seen as a single value, in the ranking of the genes outside the gene-set. Table A1. Results of the simulation 1 . Thirty gene-profiles were queried with MSigDB C1 collection (V.7.4) of 278 gene-sets. Six procedures come from R implementation GSEA, fastGSEA (fGSEA), clusterProfiler (CP) with fGSEA option, massiveGST (mGST), and camera pre-ranked (cPR); four more results come from online services GeneTrial3 (weighted GSEA and Wilcoxon Rank Sum test), massiveGST, GSEA offered by WebGestalt. The values are in seconds. The last two rows report the average and the standard deviation across the 30 experiments. The study corresponds to the name of the cancer collection of samples in TCGA; DOI maps to the paper from which the sub-typing of samples was obtained; control is the subtype assumed as the control group, while treatment is a second subtype. In brackets, there is the number of samples in the groups. The length refers to the number of genes in the gene-profile. File name is the file-name of the gene-profile.