# Multivariate Surprisal Analysis of Gene Expression Levels

^{1}

^{2}

^{3}

^{4}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Surprisal Analysis of 2D Arrays

**X**of dimension I by J where I, the number of gene transcripts, is typically much larger than J, the number of conditions for which the expression levels of the transcripts have been measured. Each column of

**X**is a distribution over the gene transcripts i measured for the given condition j. Each row is the distribution over the conditions j measured for a given transcript i. Each reading X

_{ij}is the measured expression level of transcript i under condition j. In a typical experiment involving many cells, X

_{ij}is an average value over the fluctuations that can occur from cell to cell because the individual cell is a finite system [15].

_{ij}as a distribution of maximal entropy subject to constraints. We assume that the same constraints operate for all the conditions but it can be that different constraints are more important for different conditions. The number of constraints, C, is typically smaller than the number of conditions, J, and therefore the constraints by themselves are not sufficient to determine a unique distribution. Among all the distributions that are consistent with the constraints, surprisal analysis determines that unique distribution whose entropy is maximal. When such a distribution exists it can be analytically constructed by the method of Lagrange undetermined multipliers [4,5,22]. Surprisal analysis employs the following expression for the logarithm of the expression level of gene i under condition j:

_{iα}. These values for the transcripts are common for all the different conditions.

**Y**such that its elements are the logarithms of the expression levels ${Y}_{ij}=\mathrm{ln}\left({X}_{ij}\right)$. The range of the indices i and j are quite different, 1 ≤ i ≤ I, 1 ≤ j ≤ J and so Y is a rectangular I by J matrix, where typically I >> J. Therefore

**Y**is a singular matrix but we can ‘diagonalize’ it in the manner of SVD as

**G**and

**V**are orthonormal. The matrix

**G**is an I by J rectangular matrix and its rank is therefore J or lower. The matrix

**Ω**is a diagonal J by J matrix made up of the J eigenvalues of

**V**(or of the non-zero eigenvalues of

**G**). If we want to express

**Ω**as a function of

**Y**, we get

**G**and

**V**in Equation (3) are respectively the left and right eigenvectors of

**Y**. The number of non-zero eigenvalues of

**Y**is limited by its smallest dimension, which for us is the number of conditions J. The left and right eigenvectors of

**Y**can be obtained as the normalized eigenvectors of the two covariance matrices that can be built from

**Y**: the “small” J × J matrix

**Y**

^{T}

**Y**and the “large” I × I matrix

**YY**

^{T}. The matrix

**Ω**in Equation (3) is obtained from the eigenvalue equation

**Ω**

^{2}is the diagonal matrix of the eigenvalues ${\omega}_{\alpha}^{2}$ and the vector

**V**

_{α}the corresponding eigenvector. The maximum rank of

**Y**

^{T}

**Y**is its dimension J. The large I by I matrix

**YY**

^{T}is also maximum rank J and has the same non-zero eigenvalues as

**Y**

^{T}

**Y**. The eigenvectors

**G**

_{α}correspond to the same eigenvalue as the eigenvector

**V**

_{α}of the

**Y**

^{T}

**Y**matrix. Knowing the eigenvalues ω

_{α}and the normalized eigenvectors

**V**

_{α}is sufficient to determine the vectors

**G**

_{α}, α = 0,…, J−1, from the data:

**V**

_{α}is normalized.

**Λ**is a partial de-diagonalization of the diagonal most compact form of the data,

**Ω**.

**G**

_{α}and

**V**

_{α}are orthonormal. Therefore, also the rows of the matrix of the Lagrange multipliers

**Λ**are orthogonal.

**Y**are not mean centered. When the rows of the matrix

**Y**are mean centered, the matrix

**Y**

^{T}

**Y**becomes rank deficient and has one zero eigenvalue with the same corresponding uniform eigenvector. The remaining J−1 eigenvalues are unchanged compared to the non-mean centered case. The 2D arrays analysis provides a clear meaning to the base line term of Equation (2). The analysis of the gene transcripts that dominate the G

_{i0}constraint show that the pattern of the base line corresponds to the essential machinery of the cell [4]. For α > 0, the biological interpretation of the transcription patterns are defined by the analysis of the values of the Lagrange multipliers ${\lambda}_{\alpha j}$ on the different conditions j. Each constra G

_{iα}corresponds to a pattern in the gene expression associated with the phenotype α. We will therefore use the terms transcription pattern and phenotype interchangeably where by phenotype we mean a biological process where all the participating genes act coherently (see Equation (1) or (7)). The analysis of the phenotypes α > 0 and of the gene expression patterns was shown to characterize the progression of cancer in cell lines [5,24].

## 3. Generalization of the Surprisal Analysis for a Multivariate Array

**T**, for which the three dimensions are the number of gene transcripts, I, the number of cell types, J, and the number of patients, K. The elements of the tensor are defined as ${T}_{ijk}=\mathrm{ln}{X}_{ijk}$ where X

_{ijk}is the measured expression level for transcript i of cell type j of patient k. The number of transcripts, of the order of 20,000, is typically much larger than the numbers of cell types or the number of patients. We have therefore a rectangular 3D I × J × K tensor where I >> J ≈ K. This inequality, typical of gene expression data where the number I of transcripts is very large, plus the fact that we typically want the expression levels to remain an axis, dictates the compaction of the 3D data to a 2D form as discussed next.

#### 3.1. Representing the Tensor Surprisal **T** in Matrix Form

**T**

^{(1)}is a 2D I × (J × K) matrix.

**T**

^{(1)}is made of K side by side I × J slices. The matrix

**T**

^{(2)}is of dimension J × (K × I) and is made of I side by side (J × K) matrices and the matrix

**T**

^{(3)}of dimension K × (I × J) that is made of J side by side (K × I) matrices. The three matrices

**T**

^{(1)},

**T**

^{(2)}, and

**T**

^{(3)}are called the flattened matrices and correspond to the three ways to represent the tensor as a 2D matrix. These three flattened matrices are the basis for the high order SV decomposition of the tensor [26]. Each of these flattened rectangular matrices can be decomposed using the 2D SVD procedure as explained in Section 2 above. We get

**T**

^{(1)},

**T**

^{(2)}, and

**T**

^{(3)}is algebraically the same procedure as discussed in Section 2 above for the 2D surprisal matrix. In particular, since the rows of the three matrices are not mean centered, for each of them, the normalized eigenvector of the small covariance matrix that corresponds to the largest eigenvalue has uniform amplitudes given by $1/\sqrt{\mathrm{dim}}$ where dim is the smallest dimension of each matrix.

**U**is the I × JK matrix of the left eigenvectors of

**T**

^{(1)},

**V**the J × J matrix of the left eigenvectors of

**T**

^{(2)}, and

**W**the K × K matrix of the left eigenvectors of

**T**

^{(3)}.

**Ω**is the core tensor. Its dimensions are JK × J × K and correspond to the number of non-zero eigenvalues of the flattened matrices

**T**

^{(1)},

**T**

^{(2)}, and

**T**

^{(3)}, respectively. The elements of the core tensor can be computed from its JK × JK flattened matrix

**Ω**

^{(1)}[26] by

**V**and

**W**matrices (see Equations (12) and (13)), which leads to a square JK × JK

**Ω**

^{(1)}matrix. The matrices

**U**and

**T**

^{(1)}are defined in Equation (11) above. Equation (15) is the practical form that can be used to compute the core tensor as a flattened matrix. Formally, the core tensor

**Ω**is defined from the n mode multiplication [26,27] of the input tensor

**T**by the three matrices

**U**,

**V**, and

**W**:

**U**matrix of the left eigenvectors of

**T**

^{(1)}(Equation (11)), “×2” means to multiply

**T**from the cell type direction J by the matrix

**V**of the left eigenvectors of

**T**

^{(2)}(Equation (12)), and “×3”means to multiply

**T**from the patient direction K by

**W**(Equation (13)). By doing so, one obtains the core tensor

**Ω**of dimensions JK × J × K whose flattened expression is given by Equation (15).

**U**is the left matrix of

**T**

^{(1)}(Equation (11)). Similarly j is a cell type index so s will label cell type patterns where ${V}_{js}$ is an element of the left matrix of

**T**

^{(2)}(Equation (12)) while r is a patient type index.

#### 3.2. The Tensor Form of the Surprisal

**U**up front in the summation. Rearrangement of the order of summations one has the 3D surprisal form as a summation over the different constraints labeled by r

**Λ**of dimension JK × J × K, $\mathbf{\Lambda}={\mathbf{\Omega}}_{\text{}\mathrm{x}2}{{V}^{T}}_{\text{}\mathrm{x}3}{W}^{T}$ whose flattened matrix form is given by

**Λ**, by a matrix representing the constraints,

**U**:

#### 3.3. The Lagrange Multipliers

**U**

_{0}are not uniform. They define the gene transcript pattern that corresponds to the base line. The vector

**U**

_{0}plays the role of the vector

**G**

_{0}in the 2D case.

**T**

^{(1)}of dimension I × JK as discussed above. This matrix has JK non-zero eigenvalues. Its SV decomposition is given by Equation (11), where

**M**is the matrix on the right:

#### 3.4. 2D Limiting Forms of the Tensor Form of the Surprisal

**T**

^{(3)}have been taken into account. Similarly, when averaging over cell types, one obtains

## 4. Analysis of the Correlations between Cell Type and Patient in a Cohort of 6 Patients

#### 4.1. Multivariate Surprisal Analysis

_{rst}given by Equation (23). The highest ones are plotted in Figure 1. The base line element, Ω

_{000}, dominates and with a weight (Equation (23)) equal to 0.99721 the base line accounts for 99.7% of the data. Also, in 2D expression level vs. time data the base line dominates [4] and it typically accounts for about 98% of the data. The next highest two elements are three orders of magnitude smaller and correspond to averaging over patient (1-1-0) and averaging over cell type (2-0-1). The three highest weight terms together represent 99.8% of the data. To interpret these results in terms of cell and patient phenotypes we plot the amplitudes Ω

_{rst}V

_{js}W

_{kt}of the terms 1-1-0 and 2-0-1 as a function of the combined cell type-patient index, Figure 2. Thereby it is seen that the 1-1-0 element of the tensor is a cell type phenotype and provides the distinction between basal and luminal cells, while the 2-0-1 is a patient phenotype which distinguishes between patient P6 and the rest of the patient. The amplitudes plotted in Figure 2 are the terms in the tensor expansion, Equation (14), and are the main contributors to tensor of the Lagrange multiplier ${\lambda}_{r\left(jk\right)}$ defined in Equation (24). To ascertain that these amplitudes dominate we show the heat map of the s and t terms that contribute to a given ${\lambda}_{r\left(jk\right)}$ shown in Figure 3. It is seen that indeed the largest terms are the 1-1-0 and 2-0-1 terms. The averaging over cell types term 1-0-1 is also shown in Figure 2. The pattern of its amplitude on the jk cell-type-patient index is very similar to that of the 2-0-1 term. Its weight amounts to 0.01% of the data. We also show in Figure 2 the amplitudes of a disease phenotype, the 3-2-0 term. This term has opposite amplitudes on the benign and cancer cells. As we further discuss in Section 4.2 below, a disease phenotype needs to have the value 2 for the index s. The 3-2-0 term is the highest weight term of this type that appears in the highest weight list. It is not dominant (see Figure 1). This term corresponds to the disease cell phenotype (s = 2) and an averaging over patients (t = 0).

#### 4.2. Dominant Phenotypes Along Each Tensor Direction

**V**of the flatten matrix

**T**

^{(2)}, Equation (12). These eigenvectors are also those defining the phenotypes in the 2D limit given by Equations (27) and (28). Their importance is ranked by the magnitude of the eigenvalues. The eigenvalues of

**T**

^{(2)}are reported in Table S1 of the supplementary materials. As discussed, the largest eigenvalue corresponds to a uniform eigenvector whose amplitudes are given by $1/\sqrt{J}$ and this defines the base line. The amplitudes of the eigenvector corresponding to the next largest eigenvalue reflects the distinction between basal and luminal cells (see Figure 6a) while the next largest phenotype discriminates between benign and cancerous cells, Figure 6b. Beyond that, the phenotypes are becoming not so secure because they are increasingly more contaminated by noise. To assess the biology associated with disease, genes that discriminate between benign and cancerous cells were characterized into pathways using DAVID Bioinformatics [35]. Of note, the most significant pathways included immune response and several immune-related terms, supporting a role for the immune system in tumor development (Table S2).

**W**, of the flattened

**T**

^{(3)}matrix (Equation (13)). Those eigenvectors also define the patient phenotype in the 2D analysis given by Equation (29). Similarly to the analysis of the “cell type” phenotypes, the “patient” phenotypes obtained from the tensor decomposition can be compared to those defined by the 2D SVD analysis of the average of logarithm of the data over cell types as well as those obtained for each cell type individually.

## 5. Conclusions

## Supplementary Materials

**T**

^{(2)}that defines the cell phenotype in the tensor analysis and of the 2D surprisal analysis carried out on average over patients of the ln of the input data, Table S2. Gene analysis of the cell phenotype.

## Acknowledgments

## Author Contributions

## Conflicts of Interest

## References

- Alhassid, Y.; Levine, R.D. Connection between maximal entropy and scattering theoretic analyses of collision processes. Phys. Rev. A
**1978**, 18, 89–116. [Google Scholar] [CrossRef] - Levine, R.D. Information theory approach to molecular reaction dynamics. Annu. Rev. Phys. Chem.
**1978**, 29, 59–92. [Google Scholar] [CrossRef] - Levine, R.D.; Bernstein, R.B. Energy disposal and energy consumption in elementary chemical reactions. Information theoretic approach. Acc. Chem. Res.
**1974**, 7, 393–400. [Google Scholar] [CrossRef] - Kravchenko-Balasha, N.; Levitzki, A.; Goldstein, A.; Rotter, V.; Gross, A.; Remacle, F.; Levine, R.D. On a fundamental structure of gene networks in living cells. Proc. Natl. Acad. Sci. USA
**2012**, 109, 4702–4707. [Google Scholar] [CrossRef] [PubMed] - Remacle, F.; Kravchenko-Balasha, N.; Levitzki, A.; Levine, R.D. Information-theoretic analysis of phenotype changes in early stages of carcinogenesis. Proc. Natl. Acad. Sci. USA
**2010**, 107, 10324–10329. [Google Scholar] [CrossRef] [PubMed] - Zadran, S.; Remacle, F.; Levine, R.D. miRNA and mRNA cancer signatures determined by analysis of expression levels in large cohorts of patients. Proc. Natl. Acad. Sci. USA
**2013**, 110, 19160–19165. [Google Scholar] [CrossRef] [PubMed] - Remacle, F.; Levine, R.D. Statistical thermodynamics of transcription profiles in normal development and tumorigeneses in cohorts of patients. Eur. Biophys. J.
**2015**, 44, 709–726. [Google Scholar] [CrossRef] [PubMed] - Zadran, S.; Arumugam, R.; Herschman, H.; Phelps, M.E.; Levine, R.D. Surprisal analysis characterizes the free energy time course of cancer cells undergoing epithelial-to-mesenchymal transition. Proc. Natl. Acad. Sci. USA
**2014**, 109, 4702–4707. [Google Scholar] [CrossRef] [PubMed] - Mora, T.; Walczak, A.; Bialek, W.; Callan, C.G., Jr. Maximum entropy models for antibody diversity. Proc. Natl. Acad. Sci. USA
**2009**, 107, 5405–5410. [Google Scholar] [CrossRef] [PubMed] - Lezon, T.R.; Banavar, J.R.; Cieplak, M.; Maritan, A.; Fedoroff, N.V. Using the principle of entropy maximization to infer genetic interaction networks from gene expression patterns. Proc. Natl. Acad. Sci. USA
**2006**, 103, 19033–19038. [Google Scholar] [CrossRef] [PubMed] - Aghagolzadeh, M.; Soltanian-Zadeh, H.; Araabi, B.N. Information theoretic hierarchical clustering. Entropy
**2011**, 13, 450–465. [Google Scholar] [CrossRef] - Margolin, A.A.; Nemenman, I.; Basso, K.; Wiggins, C.; Stolovitzky, G.; Dalla Favera, R.; Califano, A. Aracne: An algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context. BMC Bioinform.
**2006**, 7 (Suppl. S1), S7. [Google Scholar] [CrossRef] [PubMed] - Margolin, A.A.; Califano, A. Theory and limitations of genetic network inference from microarray data. Ann. N. Y. Acad. Sci.
**2007**, 1115, 51–72. [Google Scholar] [CrossRef] [PubMed] - Yeung, M.K.; Tegner, J.; Collins, J.J. Reverse engineering gene networks using singular value decomposition and robust regression. Proc. Natl. Acad. Sci. USA
**2002**, 99, 6163–6168. [Google Scholar] [CrossRef] [PubMed] - Shin, Y.S.; Remacle, F.; Fan, R.; Hwang, K.; Wei, W.; Ahmad, H.; Levine, R.D.; Heath, J.R. Protein signaling networks from single cell fluctuations and information theory profiling. Biophys. J.
**2011**, 100, 2378–2386. [Google Scholar] [CrossRef] [PubMed] - Schneidman, E.; Still, S.; Berry, M.J.; Bialek, W. Network information and connected correlations. Phys. Rev. Lett.
**2003**, 91, 238701. [Google Scholar] [CrossRef] [PubMed] - Rosvall, M.; Bergstrom, C.T. An information-theoretic framework for resolving community structure in complex networks. Proc. Natl. Acad. Sci. USA
**2007**, 104, 7327–7331. [Google Scholar] [CrossRef] [PubMed] - Quigley, D.A.; To, M.D.; Kim, I.J.; Lin, K.K.; Albertson, D.G.; Sjolund, J.; Pérez-Losada, J.; Balmain, A. Network analysis of skin tumor progression identifies a rewired genetic architecture affecting inflammation and tumor susceptibility. Genome Biol.
**2011**, 12, R5. [Google Scholar] [CrossRef] [PubMed][Green Version] - Nykter, M.; Price, N.D.; Larjo, A.; Aho, T.; Kauffman, S.A.; Yli-Harja, O.; Shmulevich, I. Critical networks exhibit maximal information diversity in structure-dynamics relationships. Phys. Rev. Lett.
**2008**, 100, 058702. [Google Scholar] [CrossRef] [PubMed] - Alter, O. Genomic signal processing: From matrix algebra to genetic networks. In Microarray Data Analysis: Methods and Applications; Korenberg, M.J., Ed.; Humana Press: Totowa, NJ, USA, 2007. [Google Scholar]
- Golub, T.R.; Slonim, D.K.; Tamayo, P.; Huard, C.; Gaasenbeek, M.; Mesirov, J.P.; Coller, H.; Loh, M.L.; Downing, J.R.; Caligiuri, M.A.; et al. Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science
**1999**, 286, 531–537. [Google Scholar] [CrossRef] [PubMed] - Gross, A.; Levine, R.D. Surprisal analysis of transcripts expression levels in the presence of noise: A reliable determination of the onset of a tumor phenotype. PLoS ONE
**2013**, 8, e61554. [Google Scholar] [CrossRef] [PubMed] - Gross, A.; Li, C.M.; Remacle, F.; Levine, R.D. Free energy rhythms in saccharomyces cerevisiae: A dynamic perspective with implications for ribosomal biogenesis. Biochemistry
**2013**, 52, 1641–1648. [Google Scholar] [CrossRef] [PubMed] - Kravchenko-Balashaa, N.; Remacle, F.; Gross, A.; Rotter, V.; Levitzki, A.; Levine, R.D. Convergence of logic of cellular regulation in different premalignant cells by an information theoretic approach. BMC Syst. Biol.
**2011**, 5, 42. [Google Scholar] [CrossRef] [PubMed] - Wei, W.; Shi, Q.H.; Remacle, F.; Qin, L.D.; Shackelford, D.B.; Shin, Y.S.; Mischel, P.S.; Levine, R.D.; Heath, J.R. Hypoxia induces a phase transition within a kinase signaling network in cancer cells. Proc. Natl. Acad. Sci. USA
**2013**, 110, E1352–E1360. [Google Scholar] [CrossRef] [PubMed] - De Lathauwer, L.; De Moor, B.; Vandewalle, J. A multilinear singular value decomposition. SIAM J. Matrix Anal. Appl.
**2000**, 21, 1253–1278. [Google Scholar] [CrossRef] - Kolda, T.G.; Bader, B.W. Tensor decompositions and applications. SIAM Rev.
**2009**, 51, 455–500. [Google Scholar] [CrossRef] - Tucker, L.R. Some mathematical notes on three-mode factor analysis. Psychometrika
**1966**, 31, 279–311. [Google Scholar] [CrossRef] [PubMed] - Alon, U. An Introduction to Systems Biology; CRC Press: Boca Raton, FL, USA, 2007. [Google Scholar]
- Willamme, R.; Alsafra, Z.; Arumugam, R.; Eppe, G.; Remacle, F.; Levine, R.D.; Remacle, C. Metabolomic analysis of the green microalga chlamydomonas reinhardtii cultivated under day/night conditions. J. Biotechnol.
**2015**, 215, 20–26. [Google Scholar] [CrossRef] [PubMed] - Omberg, L.; Golub, G.H.; Alter, O. A tensor higher-order singular value decomposition for integrative analysis of DNA microarray data from different studies. Proc. Natl. Acad. Sci. USA
**2007**, 104, 18371–18376. [Google Scholar] [CrossRef] [PubMed] - Ponnapalli, S.P.; Saunders, M.A.; Van Loan, C.F.; Alter, O. A higher-order generalized singular value decomposition for comparison of global mrna expression from multiple organisms. PLoS ONE
**2011**, 6, e28072. [Google Scholar] [CrossRef] [PubMed] - Sankaranarayanan, P.; Schomay, T.E.; Aiello, K.A.; Alter, O. Tensor GSVD of patient- and platform-matched tumor and normal DNA copy-number profiles uncovers chromosome arm-wide patterns of tumor-exclusive platform-consistent alterations encoding for cell transformation and predicting ovarian cancer survival. PLoS ONE
**2015**, 10, e0121396. [Google Scholar] [CrossRef] [PubMed] - Zadran, S.; Remacle, F.; Levine, R.D. Microfluidic chip with molecular beacons detects miRNAs in human CSF to reliably characterize CNS-specific disorders. RNA Dis.
**2016**, 3, e1183. [Google Scholar] - Huang, D.W.; Sherman, B.T.; Lempicki, R.A. Systematic and integrative analysis of large gene lists using david bioinformatics resources. Nat. Protoc.
**2008**, 4, 44–57. [Google Scholar] [CrossRef] [PubMed]

**Figure 1.**The 15 highest weights ${w}_{rst}$ (Equation (23)) of the core tensor plotted on a logarithmic scale in decreasing order. Note the sheer drop beyond 0-0-0. Including terms on or beyond 4-0-3 only serves to fit the experimental noise in the data. See also Figure 4.

**Figure 2.**The magnitude of the two highest weight terms (r-s-t = 1-1-0, 2-0-1) of the tensor form of the surprisal as a function of the jk index, conditionpatient. The first two highest terms, 1-1-0 and 2-0-1, already provide a very good fit for the surprisal. The phenotypes r = 1 and r = 2 are shown in Figure 3 and Figure 4 below. While r = 1 mainly corresponds to the distinction between basal and luminal cells with a small modulation due to the patients, r = 2 corresponds to the distinction between P6 and P1 and the rest of the patients with a small modulation due to the cell type. In this sense, 2-0-1 is more clearly a patient phenotype than 1-0-1. The term 3-2-0 is the first disease term. As shown, it distinguishes between benign and cancer cells in that it changes its sign. It is the 12th term in the tensor expansion (see Figure 1 for the weights, ${w}_{rst}$) of the different terms and so it does not contribute much to the recovery of the input data.

**Figure 3.**The magnitudes of the 24 terms Ω

_{rst}V

_{js}W

_{kt}(see Equations (17) and (18)) that contribute to the value of a Lagrange multiplier ${\lambda}_{r\left(jk\right)}$ plotted as a heat map as a function of the cell phenotype index s and the patient phenotype t, r is the index of the constraint. Analyzing the relative importance of the different terms for a given constraint r and a given cell type and patient gives access to personalized medicine. Top left: r = 1, j = beba, k = P4, top right r = 2, j = beba, k = P4, bottom left: r =1, j = calu, k = P6, bottom right: r = 2, j = beba, k = P4. Note that the terms s = 1, t = 0, and s = 0 and t = 1 dominates all panels. In addition, the heat maps clearly indicate that more terms significantly contribute to a given cell type for patient P4 than P6. This can be understood because several patient phenotypes contribute to the description of P4.

**Figure 4.**Amplitudes ${M}_{\left(jk\right)r}$ of the two dominant patterns, r = 1,2, of the bivariate transcription data matrix with 24 columns labeled by both cell type and patient. The r = 1 pattern is high on basal cells and low on luminal for each patient. The r = 2 pattern distinguishes patient 6 from the rest.

**Figure 5.**Scatter plots of the recovered data vs. the input data for given cell type and patient and for a 2D computation that is done for data averaged over cell types. Shown in panels (

**a**) and (

**b**) are the cell type caba for P2 and calu for P6, respectively. The recovered data are computed for an increasing number of terms in the surprisal expansion. In the tensor analysis, panels (

**a**) and (

**b**), one (only the base line), three, eight, and fifteen terms are kept. There are 24 combinations of patients and cell types and there are thousands of transcription levels for each pair. For all 24 pairs the three terms suffice to characterize the data. Eight terms are needed for representation of details and from twelve terms on the fit is to the noise in the data. The RNA profile calu of P6 which is characterized by the first patient phenotype in the tensor analysis is better fitted than the caba profile of P2 that is spread over several patient phenotypes. In panels (

**c**) and (

**d**) where the data is averaged over cell types, two terms beyond the base line α = 0 capture most of P6 (

**d**), which is characterized by the first phenotype patient. Many more terms are need for P2, because the RNA profile is described by several patient phenotypes.

**Figure 6.**Amplitudes of the dominant cell type eigenvector V

_{js}, (

**a**) for s = 1 and (

**b**) for s = 2 on the four cell types, j = 1,2,3,4. Bold full line (red, color on line), computed for the diagonalization of the

**T**

^{(2)}matrix, cf. Equation (12). Bold dashes (blue, color on line), computed from the diagonalization of the 2D matrix obtained by averaging the logarithm of the data over patients (see Section S1 of the supplementary materials). Thin lines, computed by 2D SVD analysis of the surprisal for each one of the six patients. For the dominant phenotype, which corresponds to the distinction between basal and luminal cells, all methods of analysis concur. s = 2 is the “disease” phenotype and distinguishes between benign and cancer cells. While there is a wider dispersion from one patient to the next, the tensor analysis and the average over patients give similar results.

**Figure 7.**Amplitudes W

_{kt}of the patient eigenvector, t = 1, on the six patients, k = 1,2,..,6. Bold full line (red, color on line), computed for the diagonalization of the

**T**

^{(3)}matrix (Equation (13)). Bold dashes (blue, color on line), computed from the diagonalization of the 2D matrix obtained by averaging first over patients and then performing surprisal analysis. Thin lines, computed for each of the four cell types individually. The patient-cell type correlations are manifested strongly in the patient phenotype analysis which leads to a larger dispersion among the patient analysis for given cell types. Note, however, that the tensor analysis and the averaging over patients give similar results.

© 2016 by the authors; licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC-BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Remacle, F.; Goldstein, A.S.; Levine, R.D.
Multivariate Surprisal Analysis of Gene Expression Levels. *Entropy* **2016**, *18*, 445.
https://doi.org/10.3390/e18120445

**AMA Style**

Remacle F, Goldstein AS, Levine RD.
Multivariate Surprisal Analysis of Gene Expression Levels. *Entropy*. 2016; 18(12):445.
https://doi.org/10.3390/e18120445

**Chicago/Turabian Style**

Remacle, Francoise, Andrew S. Goldstein, and Raphael D. Levine.
2016. "Multivariate Surprisal Analysis of Gene Expression Levels" *Entropy* 18, no. 12: 445.
https://doi.org/10.3390/e18120445