Multivariate Surprisal Analysis of Gene Expression Levels

We consider here multivariate data which we understand as the problem where each data point i is measured for two or more distinct variables. In a typical situation there are many data points i while the range of the different variables is more limited. If there is only one variable then the data can be arranged as a rectangular matrix where i is the index of the rows while the values of the variable label the columns. We begin here with this case, but then proceed to the more general case with special emphasis on two variables when the data can be organized as a tensor. An analysis of such multivariate data by a maximal entropy approach is discussed and illustrated for gene expressions in four different cell types of six different patients. The different genes are indexed by i, and there are 24 (4 by 6) entries for each i. We used an unbiased thermodynamic maximal-entropy based approach (surprisal analysis) to analyze the multivariate transcriptional profiles. The measured microarray experimental data is organized as a tensor array where the two minor orthogonal directions are the different patients and the different cell types. The entries are the transcription levels on a logarithmic scale. We identify a disease signature of prostate cancer and determine the degree of variability between individual patients. Surprisal analysis determined a baseline expression level common for all cells and patients. We identify the transcripts in the baseline as the “housekeeping” genes that insure the cell stability. The baseline and two surprisal patterns satisfactorily recover (99.8%) the multivariate data. The two patterns characterize the individuality of the patients and, to a lesser extent, the commonality of the disease. The immune response was identified as the most significant pathway contributing to the cancer disease pattern. Delineating patient variability is a central issue in personalized diagnostics and it remains to be seen if additional data will confirm the power of multivariate analysis to address this key point. The collapsed limits where the data is compacted into two dimensional arrays are contained within the proposed formalism.


Surprisal Analysis and Why the Weights of the Deviations are Ensemble Properties
Surprisal analysis (1,2) of the microarray data (3,4) provides a representation for the logarithm of the expression level of the transcripts. For the array data of patient n we use the form, Equation (1) of the main text: (S1) Surprisal analysis seeks to keep few terms in the sum in Equation (S1) while still providing an accurate representation of the experimental data. In the data used here there are six patients, n = 1, 2,..,6 The zeroth term, ln X i on , is the base line part of the level of expression. It is essentially the same (see section III of the SI for all patients). This is definitely not a uniform distribution and different transcripts do have fold differences in their level. It is this variation of the baseline as a function of the gene index, i, that separates our work from studies where entropy is used as a statistical measure of dispersion. α = 1, 2,.. labels the different possible transcription patterns that cause a deviation of the measured expression level from the base line. λ α n (c) is the weight of the pattern α for cell type c. The superscript n is the label of the patient for which the microarray data was taken. G iα n is the weight of gene i in the transcription pattern α. The purpose of this section is to demonstrate explicitly that the weights λ α n (c) for different values of the cell type c (and fixed patient index n) are computed as ensemble averages over the expression levels of transcripts i.
To determine the succession of terms for Equation (S1) all the way to an exact representation, we proceed as follows. Taking the microarray data in the form of expression levels X i n (c) for transcript i of cell c, we take the logarithm of each entry. Say that there are A cell types that were measured. We form the A by A symmetric connectivity matrix C such that its matrix elements are given by Equation (S2) is a central relation in our discussion. It shows the sense in which the elements C c,c ' of the matrix C are computed as a sum over the index i of the transcripts. Technically speaking, the matrix C is (not quite but almost) the covariance matrix of the cell types where the variation is over the different transcripts.
If we introduce the N by A matrix Y where Y ic n = ln(X i n (c)) and N is the number of measured transcripts, we can write the covariance matrix as a matrix product where the superscript T denotes the transpose of the matrix. We drop the superscript n to simplify the notation, but we emphasize that the matrix C is computed for patient n.
C is a symmetric matrix and can be diagonalized. There are A eigenvectors and eigenvalues. We here discuss the common case where all the eigenvalues of C are positive. Then the eigenvalue equations as The eigenvectors are normalized by the condition The weights λ α n c ( ) of the different phenotypes are computed by an equation analogous to that for λ α =0 , namely we write for each term λ α Equation (S5) is the desired technical conclusion: The cell type dependence of the weights λ α n (c) is determined, see Equation (S2), by averaging the transcription fold levels over all the transcripts.

The Disease Pattern as an Ensemble Average
Two clear disease patterns, healthy vs. diseased and benign vs. cancer, emerge from the analysis of a cohort of patients. Individuality patterns emerge from the analysis of a cohort of cell types. The disease pattern (and other biologically meaningful deviations) appears in terms of the dependence of the weight of the pattern for the different cell types. Individuality appears in terms of the dependence of the weight of the pattern for the different patients. However, if we look not at the weights but at the actual transcription pattern deviations, the 's n i G α , they are not well correlated and more often than not are only poorly correlated, as shown in Tables S1 and S2 below. Why does averaging clearly bring out the ensemble properties?
We discuss a simple toy model that both mathematically and intuitively exhibits the properties we wish to demonstrate. It is a toy model which means that it is just a caricature of the very very much more complex biological reality. The tradeoff is that it is rather clear how an ensemble can exhibit average properties that are not easily discerned in any of its individual members.
Consider tossing a die. It is a regular die with six faces labeled 1 to 6. We make a list of the number showing up upon each toss. We consider an experiment that is 100 tosses. It is certain that in any experiment the frequency of the different faces coming up will not be the same. (It just cannot be the same, 6 × 15 = 96, 6 × 16 = 102). The multinomial distribution can readily be manipulated to show that when we average over all possible experiments the average frequency of any one face is exactly 1/6. The theorem states that the average frequency exactly equals the probability. The theorem holds for any number of tosses and, as we shortly prove, the theorem also holds even if the die is biased. Conclusion: A result that does not hold for any particular experiment holds exactly for the ensemble average.
Lastly, we make a partial average as follows. Do not ask for the value of the face, ask only if it is even or odd. In other words, lump faces 1, 3, and 5 together and ditto for faces 2, 4, and 6. In any one experiment (with one experiment representing a hundred tosses) the frequency of even faces showing up will be quite close to 1/2. This is so even though the frequency of individual faces can be not very similar. Conclusion: a partial average goes some significant way towards recovering an ensemble average.
It is also possible to lump faces 1, 2, and 3 together and ditto for faces 4, 5, and 6. Same conclusion.
To prove the theorem for any number, N, of tosses we consider a biased die where the probability of showing face i, i = 1,2,..6, is . For an unbiased die = 1/6. The multinomial distribution states that the probability to perform an experiment where face i is recorded times, i = 1,2,..6, is Note that this is already a partial average. The probability of one particular set of N tosses where face i is recorded times, i = 1,2,..6, is If all faces are equally probable the proof will yield N i = N 6 . For any N that is not a multiple of 6 we have that N i is not an integer even though any observed value of must be integers. Table S1. Eigenvalues of the flatten matrix T (2) that defines the cell phenotype in the tensor analysis and of the 2D surprisal analysis carried out on average over patients of the ln of the input data.