A New Modified Histogram Matching Normalization for Time Series Microarray Analysis

Microarray data is often utilized in inferring regulatory networks. Quantile normalization (QN) is a popular method to reduce array-to-array variation. We show that in the context of time series measurements QN may not be the best choice for this task, especially not if the inference is based on continuous time ODE model. We propose an alternative normalization method that is better suited for network inference from time series data.


Introduction
When studying temporal changes in gene expression levels via microarrays, it is important to reduce any systematic biases, since one should only compare measurements on an equal footing. Normalization is the term used for the varying techniques applied to microarray data to achieve this reduction. Several approaches have been proposed and quite thoroughly investigated in the past [1][2][3][4]. The strongest effect in terms of time series analysis arise from inter-array normalization procedures, since each array represents a time point. Therefore, in this paper we focus only on inter-array normalization.
Given microarray time series data, one may try to unravel the structure of an underlying gene network by constructing a mathematical/computational model and to test whether this can reproduce the observations. If this is the case such a model can be very useful for experiments in silico and assist in designing experiments. Popular models include graphical models [5][6][7][8] and systems of ordinary differential equations (ODEs) [9][10][11][12]. The usefulness of ODEs is underlined by the fact that they can be used to realistically simulate the dynamics of gene expressions [13,14]. We consider the effect of normalization on these models by first generating artificial data with a known network and by investigating how well we can reconstruct the original network after normalization is applied to the data. For this we compare the performances of two different normalization methods: the often employed quantile normalization (QN) and a newly developed "modified histogram matching normalization (MHMN)". In addition we apply both algorithms also to real large scale microarray data. In the following section, we present our algorithm in detail and in the results section we discuss the results of our experiments. A Mathematica code that performs the new normalization is provided on request. In the Appendix we show both normalization procedures using a simple toy example data.

Algorithm for Modified Histogram Matching Normalization
In this section we describe our new algorithm, which we refer to as modified histogram matching normalization. We assume that the expression data have dimensions M × N , i.e., M genes and N time points. For convenience we illustrate the procedure step by step in the Appendix. To normalize the data, we proceed as follows: (1) First sort all data in the whole data matrix according to magnitude from low to high; The difference with QN is that instead of averaging over the rows of column-wise sorted data, these expression levels are scaled to fit the appropriate bins in the histogram of the distribution of expression levels over time. In this way the variation present in the original data is preserved. This is important especially when doing network inference based on a continuous ODE-model that relates the rate of change of an expression level of one gene to the expression levels of other (regulating) genes. For a discrete Boolean-type network inference, QN is accurate enough, because it preserves the rank information. However, for a ODE-based inference, which is essentially continuous time approach, it is crucial to have more than merely the rank information.

Results
As mentioned in the introduction, time series data are often analyzed using graphical models or systems of ODEs. To not lose our focus because of technical details, we use here the simplest representative of graphical models, namely correlation. In spite of its simplicity, it is often a very useful tool to give an impression of expression level data and the potential connections between the genes involved. We investigate, both with simulated and real microarray data, how the correlations deviate from the "ground truth" after applying QN and MHMN normalization.
In the subsequent analysis, we investigate how well QN and MHMN allow reconstruction of the parameters in linear ODEs. The latter are the simplest representatives of ODEs. Because the temporal dynamics of gene network are often simulated using ODEs [10,12,15], this is a critical test. Since knowing the parameters in the ODEs corresponds to knowing the interactions between genes, it is important that the estimation of parameters is not blurred by normalization.

Effects on Correlation
We want to see how the correlations of time series measurements are affected by the two different normalization methods. To give an impression of the results, we take as example a dataset with 8 genes and 8 time points, stored in an 8 × 8 array shown in Figure 1. From Figure 1, we may conclude that MHMN normalization preserves the correlation structure of the data matrix somewhat better than QN for this particular case. To obtain a more quantitative measure on the performance of MHMN and QN, we need to test them on a variety of datasets. To simulate time series data, we have generated 500 random time series data with expression levels of 8 genes and 16 time points. Three genes show clear temporal patterns, which are increasing, decreasing, and sinusoidal. The other five genes have random values of varying levels. Then we apply to each dataset a random plate-effect by choosing randomly one time point and adding a random amount of additive noise to the data at this time point. Next, we apply the two normalizations to both noiseless and noisy data. Finally we compare the correlation matrices of the results with the correlation matrices of the original data. As a measure to capture the differences between matrices, we use a popular matrix norm, the Frobenius norm. From the simulation results shown in Figure 2, it is immediate that MHMN results in significantly more accurate estimates than QN.

Effects on Correlation on Real Data
The real microarray data we employ in this section is the time series data generated by Sokolović et al. in their study on the effects of fasting on murine transcriptomics [16] in various organs including the liver and the brain. We used the data that consists of expression levels in brain tissue of Mus musculus at 5 time points after 0, 12, 24, 36, and 72 h of fasting. Data has dimensions 22, 680 × 5 and the correlation matrices are of size 22, 680 × 22, 680 . In network inference, one typically is interested in some subset of genes instead of the total set of genes. By focusing on a small subset one can formulate hypotheses that are experimentally easier to test. To simulate small network inference, we randomly choose sample sets of ten genes and compare their correlations before and after normalization with and without time-point specific noise. Similarly to the simulated experiments in the previous section, we plotted the histograms of Frobenius norms of errors generated by 1000 samples in Figure 3. We observe that indeed when the data is very large, the errors in correlations have similar distributions for QN and MHMN. According to the Kolmogorov-Smirnov test, the null hypothesis that the datasets have the same distribution at the 99% confidence level is not rejected. Also, when we compare the correlations of the total set of genes, the total residuals in correlations are identical up to three significant digits. This suggests that in the case of very large datasets MHMN performs similarly to QN.

Effects on Reverse-Engineering via ODEs
In this section we simulate time series data using linear ODEs. First we generate random gene networks with a fixed number of genes and varying number of edges. Such a network correspond to a binary adjacency matrix, where a matrix element (i, j) (i = row number, j = column number) has value 1 if gene j is connected to gene i and zero otherwise. With this same adjacency matrix, we can set up a system of linear ODEs by replacing all ones with appropriate random numbers to model the strength of interaction/regulation from gene j to gene i. The resulting matrix M is used in the linear ODE system dx dt = M x, with x the vector containing the gene level expressions. Finally we choose a set of random initial values within a reasonable range. By integrating the ODE we obtain continuous time solutions x(t). We sample the obtained solutions with regular intervals to obtain time series data. An example network and the corresponding simulated data are shown in Figure 4.
The task is now to infer the entries in matrix M from these samples, first using the samples directly, second, using the samples after applying MHMN and QN-normalization respectively. To simulate a plate effect, we put additive noise to the samples at a randomly chosen time point.
We performed series of simulation experiments with small systems of ODEs (4-5 nodes). With or without time-point specific noise (plate effect), MHMN consistently resulted in better parameter inference than QN. In each experiment a random set of max 25 parameters need to be inferred. In both cases the Kolmogorov-Smirnov test leads to rejecting the null hypothesis that the datasets have the same distribution at the 99% confidence level. In Figure 5 we show the distributions of log errors in the inferred parameters.

Conclusions
We have shown that the popular normalization method, quantile normalization (QN), while being effective in equalizing the data from array to array, was in our analysis inferior to MHMN for ODE-based time series analysis. It is also less suitable for correlation-based analysis, when the number of rows in the data is not large. Time series analysis is important in practice. For example, when we consider tissues of a developing organism, one expects to see temporally meaningful changes in expression levels. QN is less suitable for time series analysis because it loses the information on the differences of expression levels between arrays. QN produces arrays that consist of exactly the same numerical values in each column (time point). See Appendix for an illustration. The modified histogram matching normalization (MHMN) that we propose here does in some sense the opposite. After the initial histogram matching, where each array has a distribution of the whole dataset, within each bin the values are scaled to preserve the contrast between different expression values. The largest difference between the two methods we see in the parameter estimation task, when using an ODE-model. Parameter inference is crucial in determining both the topology of the network and the nature of interactions in the network. Based on our simulation studies, we recommend MHMN rather than QN for ODE-based network inference using time series microarray data that is susceptible for array-to-array noise.