Cancer Segmentation by Entropic Analysis of Ordered Gene Expression Profiles

The availability of massive gene expression data has been challenging in terms of how to cure, process, and extract useful information. Here, we describe the use of entropic measures as discriminating criteria in cancer using the whole data set of gene expression levels. These methods were applied in classifying samples between tumor and normal type for 13 types of tumors with a high success ratio. Using gene expression, ordered by pathways, results in complexity–entropy diagrams. The map allows the clustering of the tumor and normal types samples, with a high success rate for nine of the thirteen, studied cancer types. Further analysis using information distance also shows good discriminating behavior, but, more importantly, allows for discriminating between cancer types. Together, our results allow the classification of tissues without the need to identify relevant genes or impose a particular cancer model. The used procedure can be extended to classification problems beyond the reported results.


Introduction
Entropic magnitudes, as defined in information theory and related areas, have been used in a wide number of areas [1] beyond communication and physics, including literary analysis [2][3][4], painting [5], and music [6], among others. The advantage of entropic variables is the general framework on which they are founded, allowing them to be used in any system that conveys information in a broad sense. At the same time, it can be used to explore the emergence of patterns as opposed to noise as a random ordering of symbols without, in principle, resorting to a specific model of the data.
DNA is all about information, so there is no surprise in the use of information entropy to characterize the sequence of nucleotides [7][8][9][10] despite earlier works considering such analysis limited [11,12]. In this work, entropic magnitudes will be used as a tool for the classification of tissues between normal and tumor samples. Beyond its specific application in this contribution, the procedure presented shows that its generalization to other sequence analyses can be made straightforwardly.
More than two decades of experimental development have resulted in the availability of techniques to simultaneously probe thousands of gene expression levels from a single sample [13][14][15][16][17]. The access to such a huge data set has posed the challenge of meaningful harvesting of the information contained while discriminating, for a given purpose such as cancer diagnosis, the useful from the noise. Data analysis methods have been used over the years to feature-select and classify gene expression data [18][19][20][21]. Among others, this includes the use of statistical methods [22,23], neural networks and machine learning [24,25], and singular value decomposition-based methods [26,27]. In almost all cases, data analysis must effectively reduce the large feature set to identify a subset of distinctive genes that carry enough relevant information for the study at hand [28][29][30]. Without an underlying model, such reduction is still a challenge not robustly solved.
It is known that the identification of subtle patterns in gene expression can be used as the basis for classifying tissues, differentiating cancerous from non-cancerous samples [31][32][33][34][35][36]. This has been used not only for exploratory-diagnosis tasks but also to understand what transformation in cancerous cells is relevant and identify the genes responsible for it [15]. Recent studies have looked into the relationship between gene expressions and pathways in both coding and non-coding mutations of genes for a large number of cancer genomes [37]. The relevance of pathways can be essential to explore the cell circuitry and the cause of cancer. It can also prove helpful in designing new experiments with ad hoc perturbed cell conditions to test different hypotheses.
Entropic measures derive from several sources, starting with information theory [38] and including other fields such as Kolmogorov complexity [39], dynamical systems, and ergodic theory, among others. We applied entropic analysis to several samples from the Cancer Genome Atlas (TCGA) database for thirteen cancers. Two main results will be reported: our analysis, using complexity maps and information distance, shows strong discriminating behavior between tumor and normal tissue samples without the need for gene number reduction, avoiding sophisticated feature extraction procedures; the second result is clear evidence that ordering of the expression data along generic reaction pathways proves to be relevant to the cancer characterization in terms of organization and pattern production, a behavior that can not be explained on the sole basis of relative abundance of the different gene expressions.

Gene Expression Level and Gene Expression Coding
Gene expression data and their corresponding normal tissues are taken from the TCGA portal (https://portal.gdc.cancer.gov/) (accessed on 10 September 2020). RNA-Seq data in the FPKM format are used. The data contain expressions for 60,483 protein-coding, RNA genes, and pseudogenes. Gene expression levels are coded into three classes with a three-value alphabet χ ∈ {−1, 0, 1} as follows. The gene expression level is taken, and a value of 0.1 is added to each level allowing for geometric averaging, which is chosen due to the long tails in the gene expression distribution function. Geometric averaging is performed over the normal tissue set of samples, which allows the definition of a reference value for each gene expression, e r . The differential expression is thus defined as e/e r , and the base 2 logarithm is used to define a fold variation: e f = log 2 (e/e r ). We take e f < 1 as downregulated and assign a value of −1 to the gene expression coding; e f > 1 is taken as upregulated, and the value 1 is assigned to the gene coding; 0 is assigned otherwise.

Reactome Pathways and Ordered Sequences
For the coding of the biological pathways, the Reactome database was used. A list of pathways and the identified genes participating in them (https://reactome.org/ download/current/Ensembl2Reactome.txt) (accessed on 5 February 2021) was ordered lexicographically. An ordered sequence of the gene codes was built by parsing the Reactome list and assigning to each entry the corresponding code (See Figure A1 in the Appendix A.1).

Entropic Measures
Consider a bi-infinite sequence s of values s i at position i (−∞ < i < ∞), taken from a discrete alphabet χ. The (Shannon) block entropy of length L of the sequence is defined as H(L) = ∑ s(L)∈χ L p[s(L)] log p[s(L)], where p[s(L)] is the probability of finding a finite subsequence s(L) of length L in the bi-infinite string s. There are |χ| L possible finite sequences of length L. If in the definition of block entropy, the logarithm is taken in base 2, the units are bits. The entropy density is then h = lim L→∞ H(L)/L; it measures the entropy per symbol when an infinite number of symbols have been observed, and therefore the irreducible randomness per symbol of the sequence. The entropy density is zero for a constant, a periodic, and a quasi-periodic string, while it attains its maximum value log |χ| for a completely random sequence.
Consider h(L) = H(L) − H(L − 1), the effective measure complexity (E) [40], also known as excess entropy [41], can be defined as . E is the mutual information between the two halves of the bi-infinite string [1,42]. It measures the memory or amount of information of one half of the sequence carried to the other half. We will be using plots of E vs. h, which is an example of what is called a complexity-entropy map. Complexity-entropy maps show the trade-off between randomness or pattern production as measured by h and pattern recurrence or memory as measured by E; it has been used (under different names) in studies of an extensive set of systems [1,4,43].
Finally, information distance between two sequences comes from Kolmogorov complexity theory [44], and the Kolmogorov complexity K(s) of a string s is the length of the shortest program capable of reproducing the string when run in a Universal Turing Machine [39]. A pattern recurrence-driven sequence will have a Kolmogorov randomness that grows slowly (e.g., as log(L) ) with the sequence length, such that its normalized value K(L)/L → 0 as L → ∞. The same ratio goes to 1 for a random sequence. Information distance between two sequences s and q is defined as d(s, p) = [K(sq) − min{K(s), K(q)}]/max{K(q), K(s)}, where sp represents the concatenation of both strings. Information distance measures how innovative a string is with respect to the other from an algorithmic perspective: it measures the shortest program length that can transform one string into the other. Information distance has been used to study the relationship between instances of a system in a wide set of areas, for example, to extract phylogenetic evolutionary trees from the genes of mammals [39].
All entropic measures are estimated using the Lempel-Ziv approximation.

Lempel-Ziv Estimates
When finite data are considered, the entropy rate has to be estimated [42,45] as the asymptotic and limiting values are unreachable. The Lempel-Ziv factorization [46] procedure has been used for the estimation of the entropy density.
Consider the following factorization of a sequence s = s 1 s 2 . . . s N : where p k = 1, 2, . . . is a natural number, and s(p, q) is the substring s p s p+1 . . . s q . Each symbol s p is drawn from a finite alphabet χ of cardinality |χ| (e.g., χ = {1, 0, 1}, |χ| = 3). F(s) is called an exhaustive history of the sequence s, if any factor s(p j−1 + 1, p j ) is not a substring of the string s(1, p j − 1), while s(p j−1 + 1, p j − 1) is, except perhaps for the last factor.
For example, the sequence 0110111001101010011111111100110101111, taken from an alphabet χ = {1, 0, 1}, has the following factorization, where a dot separates each factor: The Lempel-Ziv complexity C(s) is then the cardinality (number of factors) of the exhaustive history F(s) (In the above example, C(s) = 14).
In general, C(s) for a length N string is bounded by [46] C(s) where We used log x ≡ log |χ| x to simplify the notation. ε N is a slowly decaying function of N, leading to an asymptotic value for large enough N. Ziv [47] proved that, if s is the infinite length output from an ergodic source with entropy rate h, then almost surely. The use of the Lempel-Ziv factorization for estimating the entropy density for finite-size sequences has proved robust even for short-length strings [45]. For a 10 4 length sequence, which will be used in this study, the order of magnitude for the error bound is around 10 −2 [48]. The Lempel-Ziv factorization procedure was implemented in an in-house software (written in C++ and with run time below one minute for each data set) and has been used in previous studies [4,49,50]. The effective complexity measure E is estimated using a random shuffle procedure given by [51] s (M) is a surrogate string obtained by partitioning the string s in non overlapping blocks of length M and performing a random shuffling of the blocks. The shuffling for a given block length M destroys all correlations between symbols for lengths larger than M while keeping the same symbol frequency. M max is chosen appropriately given the sequence length to avoid fluctuations. In spite of the fact that Equation (5) is not strictly equivalent to the E, it is expected to behave in a similar manner [51]. As already explained, information distance d(s, p) comes from the use of algorithmic randomness [52] K(s) = |s * | of a string s, or the length of the shortest algorithm s * capable of producing the string s when run in a Universal Turing Machine.
It is known [39] that the entropy density h is also given by From this result, it follows that [49]: Again, we estimate the entropy density via Lempel-Ziv factorization and, from there, d(s, p).

Results
Gene expression data for 13 tumors were studied (a table with the TCGA nomenclature for the cancers studied is published as Appendix A Table A1). The data contain expressions for 60,483 protein-coding, RNA genes, and pseudogenes. Gene expression levels are coded into three classes with a three-value alphabet χ ∈ {1, 0, 1} corresponding to under-, normal-, and over-expressed, respectively. The list of gene expression classes is ordered using a list of pathways and the identified genes participating in them. Such a list can be considered a data sequence.

Complexity-Entropy Maps
For every cancer type, the entropy and E were estimated from a randomly ordered sequence resulting from the unordered nature of the gene expression data. The corresponding complexity-entropy map was plotted, as shown for the COAD data in Figure 1a. The cancer samples tend to have higher values of entropy density. While segmentation is possible using the entropy density value, no relation was found with the E values. Upon sorting the gene expression classes with the pathway list, a different picture emerges as E strongly correlates with the h value. Figure 1b shows that segmentation is now possible, considering both the h and the E value. For the samples of cancer tissue, there is an order of magnitude increase of E compared to the unsorted strings, which points to the increasing appearance of structuring in the sorted data set.
From now on, all results are referred to the sorted data. Normal tissue exhibits a mean (median) h value of 0.304 (0.251) (see Table 1) while, for the tumor sample, the mean (median) value is 1.012 (1.013), a 3.32 (4.04) fold increase. For E, the increase for the mean (median) is 3.85 (4.43) times from 0.537 (0.460) to 2.067 (2.037). The Euclidean distance in the complexity-entropy map between the mean(median) value of the normal sample and the tumor sample was calculated to be 1.686(1.751). Similar results were found in other cancer types. Structuring may come from two processes: change in the symbols fractions, in this case, the symbols represent one of three classes for up-and down-and normal-regulated genes, so this process refers to the change in the symbol fraction in each class; on the other hand, structuring can be the results of the ordering of the symbols. Structuring by symbol fraction change usually dominates when one of the symbols becomes dominant in the data.
Tumor samples show a larger number of up-and downregulated genes at the other genes' expense than the normal samples. Figure 1c shows a change in symbols fraction when the normal samples with the tumor samples are compared. However, when we plot the complexity-entropy map of the difference between the unsorted set values and the sorted set values (∆E vs. ∆h, where ∆E = E sorted − E unsorted and similar for ∆h), as shown in Figure 1d, it becomes evident that the increase in E is mainly the result of symbol rearrangement, as any change coming from symbol fraction changes is canceled. The same can be said for the increase of h values. A straightforward discrimination procedure was used based on the complexityentropy map. The Euclidean distance to the mean (median) values of the normal and tumor points was calculated for every sample. The class to which a sample belongs was decided by taking the one with the shortest Euclidean distance from its mean (median) value to the sample point. With such criteria, the success rate of detecting normal tissue was 0.976 (0.976) and of detecting tumor tissue was 0.979 (0.983).
Similar analyses were performed for the rest of the cancer types, and the complexityentropy map for each is published as Appendix A. Table 2 summarizes the results. While good discriminating results are robust, above 0.9 for most cancer types, the present procedure had a lower performance for the PRAD, STAD, and THCA cancer types. The worst discriminating fraction was that of the PRAD samples with values around 0.7. Table 2. Success fraction using Euclidean distance as discriminating criteria in the complexityentropy map. The mean and median values for the normal and tumor points were calculated. A given sample is ascribed to the class whose (h, E) point has the smaller Euclidean distance to its mean (median) value. Each sample from a given tumor type was used as a test sample to compute the success ratio. The fraction corresponds to the ratio between the number of correctly classified samples and the total number of samples in a given class. If the complexity-entropy map for the mean value of all cancer types is plotted (Figure 2), two groups can be identified. One on the upper right of the plot corresponds to the tumor tissue data, exhibiting larger h and E values. The second group corresponds to the normal tissues at the lower-left corner with smaller h and E values. The poor discriminating performance of the PRAD, STAD, and THCA samples comes from the fact that their mean values for cancer (PRAD, THCA) and normal (STAD) tissue data can not be included in their corresponding groups.

Discrimination by Information Distance Measure
The information distance matrix for all samples within a cancer type was computed; we show in Figure 3 the dendrogram corresponding to the COAD tumor type (the dendrogram for all cancer types is published as Appendix A).
From the dendrogram of Figure 3, it is clear that the normal and tumor tissue are clustered in the plot hierarchy. Interestingly enough, all the branches of the cancer tissue show deeper levels of derivation from the root. The result could point to further exploring the structure of the dendrogram in terms of the cancer gene expression in the individual tissues. Discrimination of normal and tumor tissues can be made for the dendrogram alone, yet we used a different approach.  d(s, p) between all samples for a given cancer type was calculated, resulting in a distance matrix. The dendrogram is built from the distance matrix using the Phylip suit of programs [53]. Blue color corresponds to normal, and red to tumor tissue data. The inset shows the details at the root of the dendrogram.
For the distance matrix, a discriminating procedure was designed based on a majority rule. The information distance to the other tissue samples was considered for each sample, and the nearest neighbor type was used as a majority discriminating criterion. The sample was assigned to the type, tumor, or normal, based on the majority of neighbor types (Figure 4). For each cancer type, the number of neighbors to consider was taken between 3 and 8 and optimized for performance.
Consider a neighborhood formed by the eight nearest samples. For the COAD type cancer, if we assigned a value of −1 for each cancer neighbor and a value of 1 for each normal neighbor, the average number of neighbors for a random sample is −6.94, showing that there are many more cancer samples than normal samples. However, the average neighbor number for the normal sample is 5.64, pointing to the fact that normal samples surround normal samples; and the average number for the cancer sample is −7.98, showing that cancer samples surround cancer samples.
A second discriminating procedure was also used, where the distance weights each neighbor, and no significant difference was found. The success fraction for all cancer types is reported in Table 3. While the success rate for the normal tissue is lower in the PRAD case, compared to the one using Euclidean distance in the complexity-entropy map, the success rate for the cancer tissue is much higher, up to 0.994, similar values to those obtained for the higher ranking success rates in Table 2. The smaller success fraction for the cancer tissue was as high as 0.973 for the KIRP sample. Care must be taken, though, with such a high success fraction, as the number of the normal tissue samples and tumor samples is biased towards the tumor sample, which results, statistically, in a larger number of tumor neighbors for the random case. The same comment can be made on the low success rate for the normal tissue, as, statistically, there are fewer normal neighbors than tumor type. Improvement should be expected when more normal tissues are included in the sample set. Table 3. Success fraction using the information distance as discriminating criteria. From the distance matrix, the nearest neighbors for a given sample are taken, and a majority rule is used to classify the sample as a tumor or normal type. The fraction corresponds to the ratio between the number of correctly classified samples and the total number of samples in a given class. Finally, it was studied whether using the distance matrix allows for discriminating between types of cancer. The KIRC and KIRP kidney cancer were chosen for two reasons. On the one hand, they are two types of tumors from the same organ; therefore, discrimination from the genetic data extracted from the sample tissue could make clinical sense. On the other hand, both mean value points are too close to consider discrimination from such analysis in the complexity-entropy map. Expression reference values were calculated using the set of normal tissue values for both the KIRC and KIRP datasets. The expression classes were determined as already explained using these common reference values. The distance matrix was calculated between all samples from both tumors, KIRP and KIRC, and the normal tissues. The distance matrix and dendrogram are shown in Figure 5, where each cancer type and the normal sample are grouped in mostly separate clusters. Using the discrimination procedure described above, 0.998 of cancer tissues were identified as a tumor, and from there, 0.939 were correctly identified for KIRC type tumors and 0.913 as KIRP type cancer. In addition, 0.981 of normal tissues were correctly identified as such. The same analysis was carried out for the LUSC and LUAD cancer types. The success fraction for cancer tissue was 0.994. The 0.839 and 0.984 fractions of LUAD and LUSC cancer were correctly discriminated, respectively. The dendrogram is shown in Appendix A.5.

Discussion
This work reports the application of entropic measures that proved useful in analyzing an extensive gene expression data set. The measures seem to allow, at least for the scope of this study, the segmentation of tumor tissue samples from normal tissue samples for various cancer types. The sorting of the data along pathways enhanced the segmentation abilities of the applied techniques. One may ask if this points to features beyond cancer discrimination. Gene expression data studies from pathway analysis have proven that, even in the case where no significant variation of a single gene expression level is involved, studying groups of genes by their functional role identifies the difference between samples of healthy subjects and subjects with some types of diseases or abnormal condition, such as diabetes and smoking epithelial tissues [27,54]. Such studies emphasize that the analysis of gene expression levels alone may be insufficient to recognize or characterize a disease condition. In the hallmark paradigm of cancer, [55,56], changes in gene expression in tumor cells results in the reprogramming of the cell circuitry to sustain the so-called hallmarks of cancer. The intracellular integrated circuitry is divided into distinct subcircuitry, each of which performs some specialized function in normal cells, which is modified when changed to cancer cells. In this picture, the effect of varying gene expression levels is relevant as it modifies specific pathways supported by the cell circuitry. Our studies emphasize such a picture, as the pathway analysis of gene expression profiles, via entropic measures can be interpreted as grasping some of the relevant features of this circuitry modification.
Entropic measures have been used to study the pattern gain at different levels of grammar hierarchy in written language [4]. In our case, the ordered sequence has two possible organization levels, one within the pathways and the other from the pathways position in the list. In both levels, there is not a priori "natural" ordering, and lexicographic order, as the one used, is, in principle, as good as any other ordering scheme. However, the ordered data immediately show the emergence of a clear trend between entropy density and E. The simultaneous increase of entropy density and E is a fingerprint of increased complexity. As measured by h, the given disorder can still accommodate enough pattern recurrence measured by E. In this sense, the analysis of the tumor samples shows a more complex pathway-ordered gene expression profile. This complexity does not arise mainly by changes in the fraction of samples in each symbol class, although it is clear that upand downregulated genes increase at the expense of the non-regulated ones. The role of gene expression levels in connection with cancer's appearance is usually discussed, but what is pointed out here is that this does not describe the whole story. There is no real physically meaningful order for the pathways, so randomization breaks any pattern at the intra-pathway ordering level, leading immediately to the disappearance of any relation between pattern production and pattern recurrence as E drops an order of magnitude.

Conclusions
We have proven that, in several cases, and within the scope of the present study, entropy measures allow the discrimination of cancer samples from normal samples by a class analysis of gene expression levels. In both complexity-entropy maps and information distance, classification was possible with a high success rate within a cancer type, above 98% for several cancer types. Furthermore, information distance allows for better discrimination when more than one cancer type is involved. The studies in two types of kidney tumors allowed the correct identification of the tumor type with a very high success rate. The study also suggests that increasing the number of samples, as seen in the joint analysis of KIRC and KIRP data, enhances the classification procedure's robustness. The emergence of a clear trend in the complexity-entropy map upon sorting the gene expression level through an ordered pathways list allowed us to identify that, as crucial as upregulated and downregulated genes are, the whole context needs to look into the pathways that are activated or turned off in cancer samples.
Further experimental work is needed to evaluate if our findings are relevant in practical tumor discrimination, and current work is planned along this line. In addition, the reported results point to expanding the method to explore if it can be used to study different stages in cancer development and the occurring changes along its evolution.

Acknowledgments:
We thank Rolando Pérez Rodríguez for the fruitful discussion and for pointing us to interesting connections with current cancer models. A.G. acknowledges the Cuban Program for Basic Sciences, the Office of External Activities of the Abdus Salam Center for Theoretical Physics. E.E-R. acknowledges the continued support of the AvH foundation and the support from the Max Plank Institute of Physics of Complex System for a visiting grant. The research is carried out under the Platform for Bioinformatics project of BioCubaFarma, Cuba.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript:   Figure A2. Complexity-entropy map for the median values of all studied cancer types. The median value for the two classes, tumor and normal, was calculated for each of the 13 tumor types. As a general trend, the plot shows that tumor samples exhibit larger h and E values than normal tissue. Observe that, for the STAD and BLCA type cancer, the (h, E) point for the normal sample is unusually high compared to the other samples type, while the points for tumor samples of the PRAD and THCA samples are much lower than the other cancer types.