Entropic Ranks: A Methodology for Enhanced, Threshold-Free, Information-Rich Data Partition and Interpretation

Featured Application: The generic applicability of the entropy-empowered rank product (RP) calculation score supports the utilization of this non-parametric, threshold-free methodology in di ﬀ erent kinds of data. This is not restricted only in meta-analysis of di ﬀ erent data sets, but could serve as a key methodology for data integration of di ﬀ erent sources of information, in the quest for highly automated, systemic big data biological interpretation. Abstract: Background: Here, we propose a threshold-free selection method for the identiﬁcation of di non-parametric distribution broad applicability. ﬁxed to propose a methodology, which automates and standardizes the statistical selection, through the utilization of established measures like that of entropy, already used in information retrieval from large biomedical datasets, thus departing from classical ﬁxed-threshold based methods, relying in arbitrary p -value and fold change values as selection criteria, whose e ﬃ cacy also depends on degree of conformity to parametric distributions. Methods: Our work extends the rank product (RP) methodology with a neutral selection method of high information-extraction capacity. We introduce the calculation of the RP entropy of the distribution, to isolate the features of interest by their contribution to its information content. Goal is a methodology of threshold-free identiﬁcation of the di ﬀ erentially expressed features, which are highly informative about the phenomenon under study. Applying the proposed method on microarray (transcriptomic and DNA methylation) data of sizes and noise we robust convergence for the di ﬀ erent parameterizations stable cuto ﬀ Functional analysis through BioInfoMiner and EnrichR was used to evaluate the information potency of the resulting feature lists. Overall, the derived functional terms provide a systemic description highly compatible with the results of traditional statistical hypothesis testing techniques. The methodology behaves consistently across di ﬀ erent data types. The feature lists are compact and rich in information, indicating phenotypic aspects speciﬁc to the tissue and biological phenomenon investigated. Selection by information content measures e ﬃ ciently addresses problems, emerging from arbitrary thresh-holding, thus facilitating the full automation of the analysis.


Introduction
Data analysis of high-throughput technologies (microarrays, next generation sequencing) commonly predicates on the adoption of arbitrary p-value and fold change thresholds to define the reliability and relevance of a set of features, in order to partition the initial distribution into two sets. The first set S is further investigated for its phenotypic relevance, whereas the other is exempted from further analysis, considered to be the baseline distribution with noise, either biological (coexisting, causally unrelated processes) or technical [1]. This approach as a selection philosophy is currently being debated. Specifically, regarding p-value thresh-holding, critics raise the issues of incomplete information, misrepresentation, misinterpretation and bias [2,3], whereas for fold change thresh-holding, the issues cited include the adoption of arbitrary thresholds [4], consequently the potential of strong bias [5], with no theoretical underpinning for the threshold values. Ideally, selection thresholds should take into account the form of data distributions, the presence of confounders, and the complexity of phenomenon under investigation. Most of the approaches used also hinge on the conformity of data distribution to a priori distributions, requiring extensive data preparation and careful application of differential behavior identification methods to ensure it [6]. Prioritizing the identification of the information content and using it to extract and identify meaningful results, rather than trying to assess it as a post-analysis process was the main motivation for our approach.

From Rank Products to Entropic Ranks
The rank product method partially addresses these issues through a frequentist approach, by measuring the consistency of behavior across the sample groups, using the fold change (FC) criterion. When testing for up-regulation, the Rank Product RP up g of a gene g) is calculated as: (1) where N i is the total number of features and r up i,g (for single-colour transcriptomics) is the rank of gene g in the decreasing-FC-ordered list of genes in the ith pair of control vs. case samples (i.e., r up = 1 for a gene consistently more overexpressed than any other in all cases vs. all control samples). Similarly calculated is the RP down g , over the increasing-FC-ordered lists. The percentage of false positive (pfp) value for each rank product (RP) score is estimated through a permutation-based procedure outlined in the initial publication of the method [4].
The initial Rank Product implementation in Bioinformatics, whilst addressing a number of the aforementioned issues, retains arbitrariness by trimming the final list either by a calculated pfp/p-value thresh-holding or by directly choosing the number of genes. Moreover, it tends to behave over-optimistically with increasing data dimensionality, as shown in recent works [7] (see also "Supplementary Material 1-Method").
Our work aims to measure the information content of the RP distribution, implement a data-driven, non-parametric partitioning, provide a workflow for high-throughput data analysis and unbiased information extraction (Figure 1), and generalize the approach to various data types. We introduce the calculation of entropy over a transformation of the RP distribution, followed by a clustering procedure in order to identify the most consistent cutoff point successfully separating information-rich features from noise-dominated ones, without human intervention. The methodology operates upon the non-parametric RP distribution, relegating pfp and the p-value (which may be calculated empirically or parametrically) to quality indicators of the analysis instead of decision criteria. Consequently, it allows for improvement of the pfp calculation methods and adoption of new approaches [7,8], ensuring result reproducibility and comparability as long as the RP calculation process itself remains unchanged. of decision criteria. Consequently, it allows for improvement of the pfp calculation methods and adoption of new approaches [7,8], ensuring result reproducibility and comparability as long as the RP calculation process itself remains unchanged. Figure 1. Entropic ranks workflow. Entropic ranks builds upon the rank products methodology and assesses its results in terms of the progression of Shannon entropy over a sliding window.

Entropy in Biostatistics
The utilization of entropy in bioinformatics has been tentative, for the most part, leading to specific implementations instead of producing consistent classes of methodologies, governed by standard practices. Nevertheless, we identify three main categories of such implementations: Firstly, the usage of entropy as a measure for the optimization of classification techniques. In most implementations, this is achieved prior to the classification, through dimensionality reduction approaches driven by the evaluation of entropy in order to reduce feature redundancy [9][10][11]. However, there is a recently introduced approach [12], in which entropy is used to directly weight and rank fractional Fourier transform coefficients, upon which further clustering approach is based.
Secondly, the evaluation of entropy as a main component of the decision process. Various approaches include patient stratification [13], inferring regulatory networks using transfer entropy [14] and identification of periodical biological processes in time series data [15]. However, there exists an approach more closely based on Shannon entropy and conceptually closer to our work than others, which attempts to identify differential expression in RNA count data [16].
Lastly, a unique implementation utilizes entropy evaluation upon the variability of genome regions [17] in order to evaluate their information content in contrast to uniformity, a contrast we also use during the partitioning of the rank product distribution in the proposed methodology.

Rank Product Requirements
Rank product methodology is applied upon four basic premises [4] considered valid in highthroughput signal distributions: • S << N (N: the full set of features); • Independence of measurements between replicate arrays; Figure 1. Entropic ranks workflow. Entropic ranks builds upon the rank products methodology and assesses its results in terms of the progression of Shannon entropy over a sliding window.

Entropy in Biostatistics
The utilization of entropy in bioinformatics has been tentative, for the most part, leading to specific implementations instead of producing consistent classes of methodologies, governed by standard practices. Nevertheless, we identify three main categories of such implementations: Firstly, the usage of entropy as a measure for the optimization of classification techniques. In most implementations, this is achieved prior to the classification, through dimensionality reduction approaches driven by the evaluation of entropy in order to reduce feature redundancy [9][10][11]. However, there is a recently introduced approach [12], in which entropy is used to directly weight and rank fractional Fourier transform coefficients, upon which further clustering approach is based.
Secondly, the evaluation of entropy as a main component of the decision process. Various approaches include patient stratification [13], inferring regulatory networks using transfer entropy [14] and identification of periodical biological processes in time series data [15]. However, there exists an approach more closely based on Shannon entropy and conceptually closer to our work than others, which attempts to identify differential expression in RNA count data [16].
Lastly, a unique implementation utilizes entropy evaluation upon the variability of genome regions [17] in order to evaluate their information content in contrast to uniformity, a contrast we also use during the partitioning of the rank product distribution in the proposed methodology.

Rank Product Requirements
Rank product methodology is applied upon four basic premises [4] considered valid in high-throughput signal distributions: • S << N (N: the full set of features); • Independence of measurements between replicate arrays; • The intensity of each feature over the range of samples is largely homoscedastic; • The majority of non-zero fold changes between the sample groups are independent of each other.

The Segmentation Problem
Our working hypothesis is a direct corollary of the ordered RP distribution being an ordered set; namely, that the first n elements of the RP distribution correspond to features of high information content in respect to the phenomenon under scrutiny, and subsequent ones are to be excluded from further analysis. Our partitioning process is functionally identical to a threshold-driven usage of rank products, using optimal pfp thresholds, chosen individually for each experiment (see Table 1 and "Supplementary material 1-Method"). The generalized approach we aim to create should automatically adapt to each experiment, consistently separating signal from noise while eschewing the bias introduced by thresh-holding approaches.

Partitioning the RP Distribution
The RP score distribution (i.e., for upregulated features) is an ordered set, beginning with a steep ascent, converging to a linear distribution for features following the null hypothesis and finally diverging again for the last few features (i.e., downregulated) (see "Supplementary material 1-Method"). The set of features most significant in describing the phenomenon investigated consists of the first n elements of that distribution, necessarily encompassing at least part of the steep ascent of RP scores. Conversely, the linear part of the distribution corresponds to stochastically behaving features, which should be excluded. Moreover, the RP score distribution carries information regarding each feature's consistency of behavior across replicates and is rigidly determined by the structure and form of the original data, deflecting external computational bias (in contrast to pfp calculation). Consequently, n will be robustly determined by partitioning the RP distribution.

Differences of Consecutive RP Scores and Entropy
Given that the RP distribution is ordered, we can calculate the distribution of the differences between consecutive terms (RP up i+1 − RP up i , i ∈ [1, N − 1] when testing for upregulation), shown in Figure 2B, which is more intuitive to understand than the initial RP distribution and proves more apt for the partitioning process described below. The initial ascent, at least partly overlapping with any significant differentially behaving features, is represented by a descending distribution of RP differences. Features following the null hypothesis (undifferentiated behavior) are expected to achieve ranks at random, resulting in a set of mostly undifferentiated RP scores [18]. An extreme case of this behavior was observed on applying rank sums (which has a sparser set of values) on the GSE12288 data set (see "Supplementary Material 1-Method", Part 3), where lengthy areas of genes adhering to the null hypothesis resulted in equal rank sum scores. In less extreme cases, undifferentiated rank product (or sum) scores result in non-zero differences between these consecutive, "null-hypothesis-abiding" terms of the ordered set, stochastically oscillating around values near zero.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 6 of 18 structure of the entropy distribution prohibits stable convergence of change-point algorithms, which would be the first approach to partitioning ordered distributions. They exhibit high variation of performance, wholly reliant on the specifics of the sliding window and bin count. This class of algorithms also fails to converge reliably regarding the RP distribution. For all the above, see also "Supplementary Material 1-Method". Parameters used were: window size 220, 20 bins for discretization and advancement of the sliding window gene-by-gene. The initial, low-entropy set is clearly defined (initial blue-colored area). The low-entropy area near the end is attributed to a single difference value oscillation, lowering the entropy of the 220 windows containing it. (B) The corresponding distribution of consecutive RP score differences. The initial, signal-dominated area always overlaps with genes selected by the original rank products. The latter, noise-dominated area represents features to be excluded.

K-means Clustering to Calculate n
To reliably overcome the partitioning problem of the RP distribution, we evaluated the performance of a selection of clustering methodologies, aiming to categorize the entropy values into a high-entropy and a low-entropy cluster. Maximum consistency against outliers and incidental complexity was exhibited by the K-means algorithm. Direct corollary of our hypotheses so far is that if the low-entropy cluster starts at the beginning of the RP score list (which is an ordered set), then there is at least a single, highly informative, differentially behaving feature. If so, n will be defined as the number of features preceding the first element of the high-entropy cluster (Figure 2A, leftmost Parameters used were: window size 220, 20 bins for discretization and advancement of the sliding window gene-by-gene. The initial, low-entropy set is clearly defined (initial blue-colored area). The low-entropy area near the end is attributed to a single difference value oscillation, lowering the entropy of the 220 windows containing it. (B) The corresponding distribution of consecutive RP score differences. The initial, signal-dominated area always overlaps with genes selected by the original rank products. The latter, noise-dominated area represents features to be excluded.
Given that RP scores reward systematic behavior of features across samples (and thus convey information on it), these two distinct patterns can be considered to represent the difference between systematic behavior, attributable to differentiated processes (evident in the steep descent of RP differences), and noise-dominated behavior following the null rank product hypothesis (evident in stochastically oscillating RP differences). On this basis, we aim to utilize the calculation of entropy to partition this distribution into an initial, low-entropy and information-rich area and a high-entropy, information-poor subsequent area. The former will be considered to be the set S, containing n features. The latter will be considered the set of features with a low signal-to-noise ratio, excluded from further analysis.

Partitioning the Entropy Distribution
Using a sliding window over the distribution of differences, we calculate the entropy of values within each instance of the sliding window in nats, using the Shannon entropy: as modified by a Dirichlet-multinomial regularization resulting in: whereθ Bayes k = y k +a k n+A , A = p k=1 a k and a k = 1, as per the Bayes-Laplace uniform prior, reflecting our prior knowledge for the bin counts used during discretization [19], with each bin having an equal a priori chance of accommodating a result. Figure 2A shows the resulting entropy score distribution for the GSE60767 gene expression data set. The initial, low-entropy area represents S. Subsequently, entropy rises sharply and oscillates around higher values, due to (small in absolute value, as seen in Figure 2B) stochastic oscillations of the RP differences. This pattern holds across different parameters of the sliding window and bin count, as well as across different data types, with consistency analogous to that of the RP distribution's overall shape.
We note that there may be numerous low-entropy areas over the entropy distribution. The RP distribution is an ordered set, with ordering corresponding to decreasingly statistically significant and consistent differential behavior under the fold change criterion. Thus, our hypothesis is that the very first low-entropy area corresponds to features with consistently differentiated behavior between the two populations, hence more strongly tied to the phenomenon under study compared to subsequent areas. Contrarily, the high-entropy areas (including any local minima of entropy, resulting from random oscillations in low-signal areas as seen in the right part of the distribution) correspond to features associated with the rank products null hypothesis (achieving ranks with a uniform probability distribution [18]), thus not informative on the phenomenon. Consequently, we now need to consistently identify the first low-entropy area in an unsupervised manner. This segmentation should be achieved while eschewing the need for user-defined arbitrary thresholds, thus avoiding the re-introduction of bias inherent in thresh-holding approaches. The finely detailed structure of the entropy distribution prohibits stable convergence of change-point algorithms, which would be the first approach to partitioning ordered distributions. They exhibit high variation of performance, wholly reliant on the specifics of the sliding window and bin count. This class of algorithms also fails to converge reliably regarding the RP distribution. For all the above, see also "Supplementary Material 1-Method".

K-means Clustering to Calculate n
To reliably overcome the partitioning problem of the RP distribution, we evaluated the performance of a selection of clustering methodologies, aiming to categorize the entropy values into a high-entropy and a low-entropy cluster. Maximum consistency against outliers and incidental complexity was exhibited by the K-means algorithm. Direct corollary of our hypotheses so far is that if the low-entropy cluster starts at the beginning of the RP score list (which is an ordered set), then there is at least a single, highly informative, differentially behaving feature. If so, n will be defined as the number of features preceding the first element of the high-entropy cluster (Figure 2A, leftmost blue area). The rationale is that the first high-entropy value will correspond to the first sliding window containing no information-rich (in terms of behavioral pattern relevant to the comparison performed) features. Due to the ordering of the RP distribution, all features following a rejection are to be rejected as well. This procedure proves highly resistant to perturbations concerning the specifics of the sliding window and the number of bins used in entropy calculation. Reiterating this procedure over a range of window sizes and bin counts yields a small set of suggested values for n, one of which exhibits prominent consistency. This will mark the cutoff defining the set of highly informative features, forgoing the need for external thresholds (to pfp or even entropy values), thereby decoupling the selection process from calculated statistical scores, relegating their use to post-selection result assessment.

Implementation
The development, actualization, testing, and verification through analyses were all performed in R v3.4.1, using open source R packages from Bioconductor in RStudio and usegalaxy.eu, in order to ensure transparency and reproducibility. Full citation of the packages is offered in "Supplementary Material 1-Method".
In the interests of reproducibility and platform independence afforded by a dockerized implementation, the dockerfile for the tool is hosted at: https://github.com/Hector-Xavier/Entropic_ Ranks_docker.

Evaluation Criteria
A threshold-free, adaptive, generalized selection process, like the one proposed, should be evaluated according to the following criteria: (a) specificity of the selected features in terms of biological relevance, (b) sensitivity to weak biological signals, (c) performance on data sets of varying noise content, and (d) its generality in terms of reliable performance across different types of experiments. To test against these criteria, we selected and analyzed a range of published, publicly accessible data sets, each tied to one or more of the aforementioned criteria. The chosen data sets ( Table 2) and workflows used during analysis are presented more fully in "Supplementary Material 2-Data Sets". Biological relevance of the results was assessed with Gene Set Enrichment Analysis [20] tools EnrichR [21,22] and BioInfoMiner [23]. As an additional measure of verification, we compared its performance with the rank products and rank sums, both in their original form [4] and their recent implementation [7], as well as with the implementation of our entropic analysis upon the Rank Sum statistic (referred to as "entropic sums"). Moreover, SRP127667 was also tested for differential expression using EdgeR to verify the baseline biological relevance of results under a standard methodology since the application of rank statistics on RNAseq count data is novel. Overall, our approach exhibited consistent behavior on real as well as simulated data. Functional analysis of the derived feature lists showed that entropic ranks provides increased specificity at a concise list size, supporting the argument of efficient rejection of noise-dominated features. In short, entropic ranks lists produced highly relevant functional term lists in all cases, whereas other methodologies (rank products, "entropic sums", hypothesis testing, etc.) show variable performance according to experiment setup and thresh-holding values, ranging from comparable performance to lack of results. Discussion of the implementations and summaries of results is found in "Supplementary Material 1-Method". Discussion of the data sets and results of all methods applied on them are presented in "Supplementary Material 2-Data Sets". All files, tables and plots created are contained in "Supplementary Material 3-Output".
This approach was adopted due to the fact that standard methodologies of comparison, such as list overlap, are inappropriate for two specific reasons. Firstly, they allow the assessment of interchangeability of two methods given similar thresh-holding choices, whereas entropic ranks were created to be a threshold-independent feature selection process. Secondly, statistical testing methodologies evaluate the value distributions in each population, whereas our approach is driven by patterned behavior, thus preferring genes with higher fold changes as a byproduct of its function, instead of its main focus.

Simulated Data
In order to assess performance on simulated data with known truth values, as is standard practice, we elected a simulated RNAseq count table data set with spiked values [24]. It has been created using a random number generator for the express purpose of benchmarking differential expression methodologies and consists of 10 samples, 5 "cases" and 5 "controls". Out of the 12,500 features, 1250 known features have spiked values.
Our method highlights features in a manner different than usual statistical testing: instead of relying on statistical value thresh-holding, it trims the resulting lists according to the identification of pockets of organized and consistent behavior among the features investigated. Consequently, direct comparison to approaches such as t-testing can be difficult, especially on simulated data tailored to fit hypothesis testing. Simulated data sets created with random number generators exhibit none of the underlying biological constraints present in real data. Moreover, the differentially expressed features in biological systems tend to be organized in consistent networks.
These differences lead us to expect that our method, which detects patterns of expression instead of statistical distributions, will underperform compared to hypothesis testing approaches in a simulated, spiked value data set. Even more importantly, our method aims to assess the information content of features' behavior across populations. Simulated data created using random number generators by definition exhibit highly stochastic behavior in the absence of biological constraints, which are difficult to model. Assessment of such a data set, in terms of information content should be expected to return few findings, if any, as there are no consistent patterns to be detected. Moreover, removal of the spiked values should reduce the findings even further, possibly eliminating them altogether.
Indeed, entropic ranks underperformed in the identification of the spiked features as compared to limma on the simulated data, as can be seen in "Supplemetary Data 3-Output". Testing the null hypothesis by removing the spiked values reduced performance even further, leading to the identification of 23 differentially expressed features by entropic ranks compared to a single differentially expressed feature returned by limma. However, investigation of these 23 differentially "false positives" showed that they represented rows for which the random number engine had failed to create properly uniform distributions across samples. Instead, these features were highly differentially expressed (15-fold or more) between populations, usually with a single outlying value in one of the two populations. Both under the null hypothesis and when using the full data, entropic ranks was more robust than rank products and "entropic sums" against false positive results. Its robustness was comparable with rank sums, which has convergence issues with high data dimensionality (see "Supplementary Material 1-Method", Part 5).
This level of sensitivity to patterned behavior and robustness against false positive discoveries should be considered features of the method. Moreover, the plots created by entropic ranks during entropy calculation show very high entropy near the beginning of the distribution, and tend to oscillate around lower values further on. This pattern shows an initial low-entropy cluster behaving similar to the noise-dominated areas than we see in real data sets. Such behavior could help the experimenter identify poorly structured data sets.

Series GSE12288
The set provides microarray gene expression data of leukocytes from 110 patients with Duke coronary artery disease (CAD index > 23) and 112 control subjects (CADi = 0) [25]. It is provided as a studied data set of a known pathology upon which the baseline specificity of the method can be assessed. Moreover, we can compare our method to a mainstream analysis workflow by comparing our results to the list of 160 genes identified in the original publication as significantly (rho > 0.2, p < 0.0027) correlated with the CAD index.
A comparison of cardiovascular diseases associated with the gene lists identified in the original publication and through our methodology using the Comparative Toxicogenomics Database (CTD) [26] set analyzer is presented in Table 3. The gene lists do not overlap, with the exception of a single gene, CDC42, which has been shown to function as an anti-hypertrophic molecular switch in the heart [27,28]. BioInfoMiner was used to map the list onto the human phenotype [29] and MGI Mammalian [30,31] ontologies, highlighting a number of inflammatory response terms and T-cell activation processes, associated with abnormalities of the hematopoietic system. Mapping our list onto the Reactome ontology [32,33] through BioInfoMiner highlights a Selenocysteine synthesis process, which has been shown to have antioxidant effects [34]. EnrichR mapped the resulting gene list onto dbGaP [35], ranking "hypertension" as the top term by combined score. At the same time, the sampled tissue was clearly identified through Jensen TISSUES [36], ARCHS4 TISSUES [37] and Human Gene Atlas [38] as "blood", "peripheral blood" and "whole blood", respectively. When we mapped the 160-gene list from the original publication using BioInfoMiner, it returned relevant results, especially in the human phenotype and MGI Mammalian ontologies, but it failed to achieve both the breadth and specificity of the results returned by the list generated using our approach (see "Supplementary Data 3-Output", under GSE12288 for the full results).

Series GSE69486
In order to assess the specificity even when the input signals are weak due to technical or biological (phenotype associated) reasons, we used a data set containing microarray gene expression data of fibroblast cells from 10 samples of patients with bipolar disease and two control samples. A way to assess specificity is the capability of the resulting differentially expressed list to identify the cell population. Sensitivity is evaluated if functional analysis of the gene list identified ontological terms associated with the neurological pathologies underlying bipolar condition (as was the hypothesis of the original study).
EnrichR successfully identified the cell population as "Fibroblast" through ARCHS4. Achilles fitness decrease [39] highlighted "GB1-central nervous system" as the most significant term by rank-based. More phenotype-specific results were achieved through BioInfoMiner-mediated mapping of the list onto MGI Mammalian, which is densely described. This mapping includes a distinct branch of terms related to nervous system abnormalities. Of note is the presence of MMP3 in the highly connected gene list produced by BioInfoMiner, as it has been shown to be tied to bipolar disease [40]. Moreover, Reactome highlighted the term for "synthesis of prostaglandins (PG) and thromboxanes (TX)", which have been tied to bipolar disease [41] and are used as markers in relevant pharmacological research [42].

Series GSE60767
Chosen to assess performance in data of high noise content and to assess the sensitivity of the proposed method, this data set contains microarray gene expression data from 312 leukocyte samples of healthy adult males from the highly polluted industrial region of Ostrava and 154 healthy male control samples from Prague [43]. The study aimed to investigate differential gene expression induced by chronic exposure to elevated pollution levels. Due to the weakness of the biological signal and the need to address a significant batch effect induced by beadchip performance, standardized t-testing identified no statistically significant (p-value < 0.05) differentially expressed genes within each of the three sampling seasons, even with very low long fold change (lfc) thresholds (<0.1).
Our methodology was able to identify genes with differential behavior in regards to the city of origin in all three sampling seasons, as seen in Table 1. Mapping the resulting gene lists through EnrichR consistently identified the "diabetes melitus, type 2" through OMIM disease and OMIM expanded [44], a connection supported by past research [45]. Bioinfominer mapping of the gene lists onto Gene Ontology [46,47] and MGI Mammalian ontologies highlights terms characteristic of a response to increased levels of particulate matter and associated pollutants. There are generalized inflammation indicators which have been tied to cardiovascular syndromes and lung cancer [48], as well as terms relating to the development of the nervous system, which can be attributed to the toxic metal load of particulate matter particles [49].

Series GSE42861
This methlation microarray data set was selected in order to test the performance and specificity of the proposed method on DNA methylation data profiles of a known pathology [50,51]. This will also allow evaluation of the generality of the proposed method, given that DNA methylation platforms contain many more probes, have different distribution of values (M-values) and are also greatly influenced by blood cell population perturbations between samples. The study explores the methylation profiles of peripheral blood leukocytes from patients with rheumatoid arthritis compared to healthy controls. We opted to apply our methodology onto the subpopulation of the data set consisting of samples taken from 50 to 60 years old men and women who had never been smokers, in order to reduce potential confounders. Out of the 44 samples thus selected, 20 were of patients with rheumatoid arthritis and 24 were control samples.
Using EnrichR, Jensen DISEASES identified rheumatoid arthritis as the most significant term by rank based ranking. Of particular note is the presence of allograft rejection terms at the top of the lists of both KEGG 2016 [52] and WikiPathways 2016 [53], pointing to the triggering of the same basic mechanisms in the course of the disease. BioInfoMiner mapping of the list onto Gene Ontology, Human Phenotype, MGI Mammalian and Reactome provides highly overlapping results. There is an overarching inflammatory response with terms specific to T-cell activation. Gene Ontology highlights the "telomere maintenance" (Figure 3) term, which has been an area of active study as to its implication in autoimmune syndromes [54][55][56]. Furthermore, in the Human Phenotype ontology, highlighted terms include autoimmunity and rheumatoid arthritis. Lastly, the highly connected genes identified through BioInfoMiner for these four ontologies have a strong presence of the major histocompatibility complex family (HLA-C, HLA-DRB1, HLA-DQB1, HLA-DQA1, HLA-DRB1), a finding in agreement with one of the studies citing the data set [51]. Even with a reduced sample number, the proposed method extracted a biological signal highly relevant to the phenotype and in line with the findings of the original study for the full data set.

Series SRP127667
In order to further test the generality of the proposed method, we applied it on RNAseq count table data, which follow different distributions than gene expression and methylation data. The application of rank statistics on RNAseq count data is novel within the Bioinformatics field.

Series SRP127667
In order to further test the generality of the proposed method, we applied it on RNAseq count table data, which follow different distributions than gene expression and methylation data. The application of rank statistics on RNAseq count data is novel within the Bioinformatics field. Nevertheless, in the field of Astrophysics RPs have been successfully used in pipelines performing occultation [57] and gravitational wave [58] event verification. These "discovery enumeration" phenomena are Poisson point processes, similar to the discovery-based formation of RNAseq count tables from RNA reads. Given that the requirements of RP hold for the count tables except for the independence of variance (which has been shown to affect statistical threshold selection [59], which we do not perform), we extend the verification testing of our non-parametric methodology to RNAseq data, aiming to assess the specificity of the method, despite the biological and computational problems generated by RNAseq data as opposed to transcriptomics. Additional analysis of our data using EdgeR verified the relevance of entropic ranks results. We compared the gene counts obtained from RNA sequencing of cardiac myocytes from 10 adult patients with terminal heart failure to three control samples from the BioProject study SRP127667.
EnrichR was used to map the differentially expressed genes to Panther 2016 [60], Jensen DISEASES and Reactome. Panther 2016 highlighted as the first term by combined ranking the Wnt signaling pathway, which has been implicated in cardiovascular syndromes [61]. Jensen DISEASES terms ranked first by combined score were "hypertension", "coronary artery disease" and "cerebrovascular disease". Reactome highlighted as the second term by combined ranking the Ca 2+ pathway. Using BioInfoMiner to map the list onto Gene Ontology and Human Phenotype ontology showed terms related to thrombosis abnormalities, tyrosine phosphorylation of Stat3 protein, and the regulation of body fluid levels through the urinary system-the last of which is a known regulator mechanism of blood pressure ( Figure 4). Lastly, the highly connected genes identified through BioInfoMiner are specifically associated with pharmaceuticals prescribed for cardiovascular conditions.

Conclusions
We present and evaluate a methodology, which extends the rank products method to create a generalized framework for threshold-independent selection of differentially expressed features, according to the information content of their behavior.

Conclusions
We present and evaluate a methodology, which extends the rank products method to create a generalized framework for threshold-independent selection of differentially expressed features, according to the information content of their behavior.
The biological interpretation of the functional analysis performed on each data set supports the capability of our method to separate information-rich data from noise, eschewing the limitations and plights of fold change and p-value thresh-holding approaches, which are inherent in statistical testing approaches such as t-testing. Fold change and pfp are computed, but are relegated to quality indicators for the evaluation of the experiments and subsequent analysis instead of being used as decision criteria.
The analytic workflow we apply, exploits solely the elementary preprocessing, normalization and signal correction bioinformatic techniques, to ensure reproducibility of results and transparency of the comparative evaluation. No further processing steps, aiming to force values to conform to a specific kind of distribution were used, alluding to the generalized character of the proposed method, as well as the case of its introduction for broader data analytic scenarios.
Further comparison between the results of enrichment analysis following entropic ranks and rank products, as well as other methodologies (detailed in "Supplementary Material 2-Data Sets") shows that the identification of differentially expressed features by the proposed method provides highly specific information with respect to the experiment. Manual assessment often suggests higher specificity of results both when compared to larger and smaller lists, as well as when lists overlap (as in comparisons with rank products/rank sums results) and when containing different sets of features, as a result of using different families of methodologies (e.g., hypothesis testing performed in our lab or the results of the initial publication of a data set).

Features of Entropic Ranks
In summary, the proposed method extends the rank product methodology by incorporating the measurement of information content as an integral part of the analysis and interpretation. Firstly, the selection of significant genes is based on the distribution of all genes over the entire populations, rather than evaluating each gene independently. Information-poor data, such as simulated data sets [24] with a very low signal-to-noise ratio exhibit a starkly different entropy distribution, without a defined, initial, low-entropy area followed by stably high entropy area with minimal oscillations (see "Supplementary Material 2-Data Sets" and "Supplementary Material 3-Output"). Secondly, the methodology is applicable to a broad array of selection problems and data types, as long as they conform to basic assumptions made by the rank products methodology. Thirdly, it departs from the adoption of arbitrary or empirical statistical thresholds, exploring the information density of the distribution and cherry picking clusters of high information content, through rigorous entropic analysis. Fourthly, the automation of the partition process is possible, allowing for unsupervised and unbiased analytical processes to be applied. The fifth feature is the ability to freely adjust the analytic granularity (by changing the sliding window step) to more refined or coarser inspection, enabling solutions of varying computational cost and level of convergence. Lastly, another advantage of this method is the potential for integration of data from different sources or dissection levels into the same analysis, as long as they can be transformed to similar, ranked value distributions. Funding: Irene Liampa's PhD thesis is supported by a scholarship from the State Scholarship Foundation in Greece (IKY) (Operational Program "Human Resources Development-Education and Lifelong Learning" Partnership Agreement (PA) 2014-2020). We acknowledge support of this work by the project "ELIXIR-GR: Hellenic Research Infrastructure for the Management and Analysis of Data from the Biological Sciences" (MIS 5002780) which is implemented under the action "Reinforcement of the Research and Innovation Infrastructure", funded by the Operational Program "Competitiveness, Entrepreneurship and Innovation" (NSRF 2014-2020) and co-financed by Greece and the European Union (European Regional Development Fund).