Microarray Data Processing Techniques for Genome-Scale Network Inference from Large Public Repositories

Pre-processing of microarray data is a well-studied problem. Furthermore, all popular platforms come with their own recommended best practices for differential analysis of genes. However, for genome-scale network inference using microarray data collected from large public repositories, these methods filter out a considerable number of genes. This is primarily due to the effects of aggregating a diverse array of experiments with different technical and biological scenarios. Here we introduce a pre-processing pipeline suitable for inferring genome-scale gene networks from large microarray datasets. We show that partitioning of the available microarray datasets according to biological relevance into tissue- and process-specific categories significantly extends the limits of downstream network construction. We demonstrate the effectiveness of our pre-processing pipeline by inferring genome-scale networks for the model plant Arabidopsis thaliana using two different construction methods and a collection of 11,760 Affymetrix ATH1 microarray chips. Our pre-processing pipeline and the datasets used in this paper are made available at http://alurulab.cc.gatech.edu/microarray-pp.


Introduction
Reverse engineering of gene networks from gene expression data is an active research area in systems biology (see [1] for a recent review). Such reverse-engineered networks provide biologists with a wealth of information on the relationships between genes and gene products. Clearly, the quality of generated networks is directly related to the size and quality of input data. With respect to the size of input data, the large number of publicly available experiments has been sufficient for network construction methods, and this has only increased over the past decade. However, when the input data is collected from public databases that contain the results of thousands of experiments conducted in hundreds of different labs around the world, a stringent pre-processing pipeline that includes appropriate quality control procedures is necessary.
Pre-processing methods designed for differential expression studies can also be used for the purpose of network inference. A recent review of the standard techniques used in pre-processing microarray data for differential expression studies can be found in Marbach et al. (2012) [2]. These pre-processing methods include quality control, normalization and filtering procedures. Quality control procedures are used to remove outlier microarray chips. Normalization procedures make it corresponding microarray data files (CEL files) from National Center for Biotechnology Information (NCBI) web portal. For ArrayExpress, we downloaded the search results for Affymetrix ATH1 microarrays as a text file, which included the web URLs to the corresponding data files. We then parsed the text file and downloaded the CEL files corresponding to the search results. Downloading files from NASC database involved scrapping the HTML webpages related to the experiments and downloading the CEL files by following the appropriate HTML links.
We also added to our datasets collection Affymetrix ATH1 datasets prepared by Aluru et al. (2012) [7]. Since Aluru et al. (2012) [7] had collected all the data files submitted to public databases before 2010, we ignored all the database entries that were submitted prior to this date.

Removal of Duplicates and Bad CELs
Some experiments were found in multiple data repositories, potentially due to duplicative submissions to improve discoverability. We used the MD5 hash function to remove exact duplicate CEL files. We also noticed that some duplicate CEL files are not exact copies. However, the descriptions provided by submitter are highly similar in most cases. We inspected the CEL files for matching descriptions-first programmatically for exact matches, and then manually for uncovering highly similar descriptions and further inspection.
After eliminating all the duplicate CEL files, we manually scanned all the CEL files and removed those with irregular hybridization patterns.

Quality Control and Normalization
After the removal of duplicates, we filtered CEL files using standard quality control (QC) measures typical to the Affymetrix platform [6]. We used the "simpleaffy" BioConductor library [5] to inspect scale factors and presence of BioB spike-ins. Chips having more than three times the mean scale factor for a given submission or having no BioB spike-ins were removed.
We then removed outlier CEL files within each submission as follows. We computed "Relative Log Expression" (RLE) and "Normalized Unscaled Standard Errors" (NUSE) values within each submission. RLE and NUSE measures should be centered on zero and one, respectively. As a submission in a database is a set of CEL files, each CEL corresponding to a microarray assay conducted under different conditions by the same group, we expect an outlier CEL to have smaller spread of RLE and NUSE values. CEL files with interquartile range higher than 0.75 for RLE and NUSE were removed. Also, CEL files that were 0.075 from the required center for RLE and NUSE are identified as outliers and removed.
After the removal of CEL files that failed the above quality control procedures, we applied a normalization procedure so that gene expressions can be compared across experiments. We first converted the raw probe intensity values into expression values using the standard MAS 5.0 procedure with a scaling factor of 1000. Then, expression values are transformed to log2 space and mean centered, i.e., G[i, j] is updated to G i,j -M i , where G i,j is the raw expression value of gene i in chip j, M i is the average gene expression of gene i and G[i, j] is the normalized expression value of gene i in chip j. Finally, we used the "limma" Bioconductor library [4] to perform quantile normalization. Table 1 shows the number of submissions, which we refer to as experiments in the rest of this paper, and the number of CEL files left after quality control procedures. Note that this table also includes 3546 CELs prepared by [7].

Data Filtering
In differential gene expression studies, where the main goal is to select genes with significant up or down regulation, genes with low expression range are eliminated because this indicates that those genes have no vital role in the pathway being studied. Filtering genes with low expression range is important in reverse-engineering gene networks also because most construction methods use correlation measures, and in order for a correlation measure between any two genes to be valid, the genes are expected to cover a wider expression range.
In the data filtering step, probes with lower expression range are identified using inter-quartile range (IQR) measure of the expression values. Typically, a threshold q is selected based on prior knowledge or empirical analysis, and all the probes having IQR < q is eliminated. In previous work where a gene network from a dataset of 3546 Affymetrix ATH CEL files was reported [7], an IQR threshold of 0.65 was selected. In Section 3.1, we discuss the effect of using this threshold on our dataset collection.
Clearly, lowering the value of the threshold q increases the number of genes that survive the filter, and consequently increase the size of the network. While it is desirable that the number of genes included in the network is as high as possible, simply reducing the IQR threshold to allow larger number of genes can produce spurious results in network construction due to invalid correlation measures, and thus undermining the primary goal of reverse-engineering networks.
A method for estimating the IQR threshold is expected to achieve two fundamental objectives: (1) As demonstrated by Bourgona et al. (2010) [11], filtering of genes by a criterion independent of the statistical test used to detect the strongly differentiated genes increases the detection power of multiple testing. This is applicable for network construction methods also, because the valid correlations are evaluated using an interaction-by-interaction statistical testing. (2) The set of genes that survive the filtering should reflect the dataset's characteristic variability of expression values.
The method proposed by Kapetis et al. (2012) [12] to select the IQR threshold is based on [11] and is designed to accomplish the first objective. The method by Kapetis et al. (2012) [12] achieves this objective by identifying the probe-sets with the highest variability in signal across arrays. It computes the first derivative of the IQR profile and identifies its minimum point i.e., the steepest slope of the IQR profile plot.
We now demonstrate this method by applying it on the complete dataset. Figure 1, shown below, is the plot of number of probe-sets that survive a given IQR threshold value. When the IQR threshold is zero, all the probe sets pass the threshold. As IQR increases, the number of genes that pass the filter reduces gradually. Kapetis et al. (2012) [12] call this plot as the IQR profile plot.  [12] select this minimum value as the threshold. We refer the reader to [11,12] for further details on this method. We used the following method to estimate the threshold q, which achieves both of the desired objectives. After computing the gene expression IQR values of all the 22,810 probe-sets, we constructed a histogram plot of the 22,810 IQR values. Based on the plot, we selected q as the IQR value x, which has the maximum number of genes in the bin. The histogram bin size is selected to be granular enough to allow us to select a unique value of the threshold q. Figure 3, shown below, demonstrates the selection of threshold q by plotting the histogram plot of IQR values for a complete dataset of all the CEL files.  [12] select this minimum value as the threshold. We refer the reader to [11,12] for further details on this method. The derivative plot of the IQR profile of the complete dataset is shown in the Figure 2 below and its minimum occurs when IQR threshold is 0.44. Kapetis et al. (2012) [12] select this minimum value as the threshold. We refer the reader to [11,12] for further details on this method. We used the following method to estimate the threshold q, which achieves both of the desired objectives. After computing the gene expression IQR values of all the 22,810 probe-sets, we constructed a histogram plot of the 22,810 IQR values. Based on the plot, we selected q as the IQR value x, which has the maximum number of genes in the bin. The histogram bin size is selected to be granular enough to allow us to select a unique value of the threshold q. Figure 3, shown below, demonstrates the selection of threshold q by plotting the histogram plot of IQR values for a complete dataset of all the CEL files. We used the following method to estimate the threshold q, which achieves both of the desired objectives. After computing the gene expression IQR values of all the 22,810 probe-sets, we constructed a histogram plot of the 22,810 IQR values. Based on the plot, we selected q as the IQR value x, which has the maximum number of genes in the bin. The histogram bin size is selected to be granular enough to allow us to select a unique value of the threshold q. Figure 3, shown below, demonstrates the selection of threshold q by plotting the histogram plot of IQR values for a complete dataset of all the CEL files.
The selected q, by being the IQR value corresponding to the bin with maximum number of genes, i.e., the most common IQR expression value in the dataset, represents the characteristic IQR value of the dataset. Any value to the right of the selected q will reject the representative majority and can potentially leave out genes showing sufficient activity within the dataset.
We also found that in our datasets, the steepest slope of IQR plot always appears before the maximum point of the IQR histogram plot (See Figure 3 for example). Since the method by Kapetis et al. (2012) [12] uses the steepest slope of the IQR plot, the value q selected by our method is at least as high as the value selected by their method. Our method, while being simpler and easier to compute compared to Kapetis et al. (2012) [12], accomplishes both the objectives. The selected q, by being the IQR value corresponding to the bin with maximum number of genes, i.e., the most common IQR expression value in the dataset, represents the characteristic IQR value of the dataset. Any value to the right of the selected q will reject the representative majority and can potentially leave out genes showing sufficient activity within the dataset.
We also found that in our datasets, the steepest slope of IQR plot always appears before the maximum point of the IQR histogram plot (See Figure 3 for example). Since the method by Kapetis et al. (2012) [12] uses the steepest slope of the IQR plot, the value q selected by our method is at least as high as the value selected by their method. Our method, while being simpler and easier to compute compared to Kapetis et al. (2012) [12], accomplishes both the objectives.

Probe Annotation
An Affymetrix ATH chip is a raw image containing the intensity values of 22,810 probe-sets. Due to continuing changes in annotation of the probe-sets of Arabidopsis thaliana, many probe-sets match with multiple genes and vice versa. We used the annotation files available with The Arabidopsis Information Resource (TAIR) [13] and Affymetrix [14], and created an initial map that contained all the 22,810 probesets. We first removed all the probes that were characterized as "no_match", i.e., probes with no corresponding Arabidopsis Gene Identifier (AGI). Then, we used this map to remove those genes that map to more than three AGIs. Finally, we ran a clustering algorithm that placed two probes in the same cluster if they share at least one AGI. For each such cluster, a probe-set that mapped to the fewest number of AGIs was selected as its representative and all other probe-sets from the cluster were removed.

Classification
Our final dataset consists of 11,760 CEL files that survived the quality control procedures described in Sections 2.2 and 2.3. We call this dataset of 11,760 CEL files as the "complete dataset". To the best of our knowledge, this is the largest collection of microarray datasets for Arabidopsis thaliana. In this complete dataset, if we remove all the genes having IQR expression value less than 0.65 (as suggested by Aluru et al. (2012) [7]), only 13,384 genes remain for network construction. This is much less than the 15,495 genes that survived data filtering in the previous analysis of 3546 Affymetrix ATH CEL files reported in Aluru et al. (2012) [7].
As our dataset collection is inclusive of and more than thrice the size of the dataset used in Aluru et al. (2012) [7], it is reasonable to expect that the enlarged dataset should result in inclusion of more genes and discovery of more interactions. This conundrum of generating smaller networks with more data is due to the fact that gene interactions are tissue or process specific. The inclusion of a large number of

Probe Annotation
An Affymetrix ATH chip is a raw image containing the intensity values of 22,810 probe-sets. Due to continuing changes in annotation of the probe-sets of Arabidopsis thaliana, many probe-sets match with multiple genes and vice versa. We used the annotation files available with The Arabidopsis Information Resource (TAIR) [13] and Affymetrix [14], and created an initial map that contained all the 22,810 probe-sets. We first removed all the probes that were characterized as "no_match", i.e., probes with no corresponding Arabidopsis Gene Identifier (AGI). Then, we used this map to remove those genes that map to more than three AGIs. Finally, we ran a clustering algorithm that placed two probes in the same cluster if they share at least one AGI. For each such cluster, a probe-set that mapped to the fewest number of AGIs was selected as its representative and all other probe-sets from the cluster were removed.

Classification
Our final dataset consists of 11,760 CEL files that survived the quality control procedures described in Sections 2.2 and 2.3. We call this dataset of 11,760 CEL files as the "complete dataset". To the best of our knowledge, this is the largest collection of microarray datasets for Arabidopsis thaliana. In this complete dataset, if we remove all the genes having IQR expression value less than 0.65 (as suggested by Aluru et al. (2012) [7]), only 13,384 genes remain for network construction. This is much less than the 15,495 genes that survived data filtering in the previous analysis of 3546 Affymetrix ATH CEL files reported in Aluru et al. (2012) [7].
As our dataset collection is inclusive of and more than thrice the size of the dataset used in Aluru et al. (2012) [7], it is reasonable to expect that the enlarged dataset should result in inclusion of more genes and discovery of more interactions. This conundrum of generating smaller networks with more data is due to the fact that gene interactions are tissue or process specific. The inclusion of a large number of experiments unrelated to the scope of a gene interaction can negatively affect its discovery. In addition, this may also lead to filtering out of a relevant gene in the network due to lack of sufficient dynamic range of expression over the entire dataset. As a consequence, simply enlarging the dataset and processing it as a whole does not lead to larger networks with more comprehensive discovery of novel gene interactions. We postulate that proper categorization of experiments is crucial to quality of gene network constructions, and demonstrate this as follows.
We classified the collection of 11,760 CEL files into tissue-specific and process-specific classes. We developed a two-step classification procedure. In the first step, we search for keywords related to different plant tissues and the physiological processes within the description and metadata available for each submission in the database. Based on the keywords identified, we made a preliminary partition of the 11,760 CEL files. In this preliminary assignment, one submission can be assigned to multiple categories. In the second step, we manually scanned the metadata available for each experiment and refine the preliminary classification further to resolve any error in the assigned categories. Table 2 shows the list of tissue and process categories we use for classification and the corresponding number of experiments and CEL files under each category.

Network Construction Methods
Though we focus primarily on pre-processing techniques in this paper, we briefly discuss network construction methods here because the effectiveness of the proposed pre-processing techniques is evaluated in the context of network construction. Reverse-engineering networks are a well-studied problem and many different methods with varying capabilities have been developed for network construction (for example, [15][16][17][18]).
For the purpose of evaluating our pre-processing methods, we use two different network-reverse engineering methods. In both of these methods, a correlation measure is evaluated for every pair of genes in the dataset, and correlations that are deemed significant are reported as edges in the output network. One uses the Pearson correlation coefficient as the correlation measure, while the other uses Mutual Information.
Pearson correlation coefficient (PCC) is a measure of linear relationship between any two variables of interest and it ranges from +1 to −1, where +1 indicates the strongest positive correlation, while −1 a strong negative correlation. We use the pcor function in R to compute PCC. After measuring PCC for every pair of genes, we use the method used by Mao et al. (2009) [8] to estimate the threshold above which a correlation can be considered significant.
Mutual Information (MI) between two random variables x and y is a measure of dependence between x and y. We use the TINGe software [19] to construct MI network. TINGe is a parallel gene network construction method that uses a B-Spline based method to estimate MI values [20] between every pair of genes and, then evaluates statistical significance using permutation testing. It also uses data processing inequality (DPI) to eliminate indirect relations between two genes. We refer the reader to [19] for further details on TINGe.

Results
We discuss data filtering and classification strategies under the context of gene network construction. First, in Section 3.1, we analyze the results of applying the data filtering procedure on the complete dataset. Then, in Section 3.2, we discuss the reasons for the rejection of genes in the complete dataset by the filtering procedure. We also describe our rationale for classification, and show how classification aids in our goal to include as many genes as possible without having to compromise on the stringency of data filtering procedures. Finally, in Section 3.3, we analyze the results of PCC and MI networks built using the tissue-specific and process-specific datasets.

Data Filtering Applied to a Complete Dataset
As noted earlier, if we eliminate in the complete dataset the genes with IQR values less than 0.65 (a threshold value suggested by Aluru et al. (2012) [7]), only 13,384 genes remain for network construction. Note that compared to Aluru et al. (2012) [7], which applied the same threshold to 3456 observations, we are 1027 genes short. When we select the IQR threshold using the histogram plot method described in Section 2.4, the number of genes that survive the IQR filtering increases to 18,806. Figure 3 shows the IQR histogram plot constructed for the complete dataset.
Though the histogram method of selecting the IQR threshold is an improvement over using the threshold suggested by Aluru et al. (2012) [7], it still leaves out many of the~22,000 genes that constitute the Arabidopsis thaliana genome. To determine why certain genes were excluded, we selected a few genes that were rejected by the data filtering procedure(s), and compared their expression level distribution. We observed that the relative expression levels of these genes remained low in a significant percentage of the experiments, although their expression levels were high in a few specific sets of experiments. As a result, these genes did not show the dynamic range of expression needed for co-expression analyses, even though the number of observations has increased. Figure 4 shows the histogram plot of expression values of one of the rejected genes.

Data Filtering Applied to Classfied Datasets
In eukaryotic organisms such as plants, gene expression levels vary according to tissue and/or the biological processes. Many genes are expressed only in those tissues and processes they actively participate in. Recent studies in maize have demonstrated this fact by constructing modules that show tissue-specific expression levels of groups of genes [21]. Therefore, in order to exploit our collection of datasets to the best possible extent, we categorized the experiments based on specific plant tissues or the biological process. Table 3 shows the number of genes that survived the data filtering procedure for each of the tissuespecific and process-specific datasets.

Data Filtering Applied to Classfied Datasets
In eukaryotic organisms such as plants, gene expression levels vary according to tissue and/or the biological processes. Many genes are expressed only in those tissues and processes they actively participate in. Recent studies in maize have demonstrated this fact by constructing modules that show tissue-specific expression levels of groups of genes [21]. Therefore, in order to exploit our collection of datasets to the best possible extent, we categorized the experiments based on specific plant tissues or the biological process. Table 3 shows the number of genes that survived the data filtering procedure for each of the tissue-specific and process-specific datasets. Even though the number of genes covered by the individual classified datasets is in the range of 17,000-19,000, the total number of genes covered is 20,146, which exceeds the number of genes that pass the data filtering procedure for the complete dataset (Section 3.1). With the help of classification, we are able to increase the number of genes covered by the gene network construction from 15,498 in the previous work reported in Aluru et al. (2012) [7] to 20,146.

Networks Construction
We constructed both PCC and MI networks for each of the tissue-specific and process-specific datasets. We then constructed three union networks-the union of all tissue-specific networks (tissue union), the union of all process-specific networks (process union), and the union of all the networks (full union). We built these union networks from the constituent networks as a set union of their vertices and edges. Apart from the tissue-specific and process-specific networks, we also constructed networks for the complete dataset of 11,760 observations so as to compare the effect of the classification on reverse-engineering gene networks. Table 4 shows the number of genes covered by both PCC and MI reverse-engineered networks using the complete dataset in two cases-(A) when the IQR threshold q is set to the value of 0.65 and (B) when the IQR threshold q is set using the histogram plot method described in Section 2.3. We use (A) to indicate the former case and (B) for the latter. Also shown in Table 4 are the results of large-scale genome networks constructed previously for Arabidopsis thaliana by [7][8][9]. Table 4. Number of genes covered by reverse-engineered networks. Shows the sizes of networks constructed from complete dataset when (A) IQR threshold q = 0.65 and (B) q is selected from the histogram plot. Also shown in the table are the sizes of networks constructed by [7][8][9].

Network
Size For a given dataset, we observe that (i) all the genes included in the dataset may not be included in the network; and (ii) the number of genes included in the network depends upon the construction method. For example, in the case of dataset (B), the MI network fails to cover 200 genes present in the input dataset, whereas 14,866 genes are absent in the PCC network.
With respect to the second observation, we note that MI networks are able to include a larger number of genes compared to PCC networks. For example, in the case of dataset (B), the MI network has 14,866 genes more than the PCC network. The MI network also covers all the genes covered by the PCC network. The reason for this difference is that the PCC network measures only the linear correlations between the genes, whereas the MI network can measure both linear and non-linear relationships.
The results given in Table 4 show that in both the network construction methods, some genes are lost because of the nature of the correlation measures. Table 4 also demonstrates that the networks constructed with the 11,760 size complete dataset are, in fact, smaller than the corresponding MI and PCC networks constructed in earlier works by Aluru et al. (2012) [7] and Mao et al. (2009) [8], respectively, even though both [7,8] use significantly smaller datasets. These factors reinforce the need for pre-processing techniques that include as many genes as possible by carefully managing the diversity of large-scale datasets. We achieve this goal by means of classification, as the following results show. Table 5 shows the network sizes of the PCC and MI networks of the classified datasets described in Table 3.
The tissue and process union of PCC networks has 7560 and 6858 genes, respectively. The full union PCC network has a total of 8429 genes and has 131,648 edges. In the case of the MI networks, the tissue and process union of the MI networks has 19,247 and 20,045 genes, respectively-significantly larger than the corresponding PCC union networks. The full union MI network has a total of 20,126 genes and 638,051 edges. discussed these standard pre-processing methods for quality control, normalization and data filtering used in network construction methods. However, as we discussed in Sections 2.4, 3.1 and 3.2, they prove to be insufficient for inferring genome-scale networks from a large collection of gene expression data. When the goal is to construct networks for eukaryotic organisms with tens of thousands of genes from a large dataset, the process of aggregation from multiple data sources introduces a limit in the utility of the downstream network construction methods. We observe this shortfall in Section 3.3 (Table 4) with the complete dataset. The data filtering procedure commonly used for the Affymetrix ATH Platform leaves us with only 18,806 genes (Section 3.1), which in turn leads to MI and PCC networks of sizes of 18,606 and 3940, respectively (Table 4).
On closer examination of genes rejected by data filtering (Figure 4), we found that they have a lower expression range in many experiments and are hence rejected. Contrary to expectations, we observed that the increase in the number of experiments does not automatically translate to an increase in the number of genes available for network construction, which in turn lead us to this fundamental insight: because of the spatial and temporal expression of genes in eukaryotic organisms, certain genes are actively expressed in only specific tissues and biological processes. Therefore, the activity of these genes becomes less recognizable when all the experiments are analyzed in aggregate.
Our solution to address this problem is to devise a classification procedure that assigns the datasets to the tissue-specific and process-specific categories described in Section 2.4. Each of these classified datasets is subjected to independent pre-processing runs. Though the number of genes covered by classified datasets is in the range of 17,000 to 19,000 (Table 3), 20,146 genes were covered in total (Section 2.6). By increasing the total coverage via classification, we are able to increase the number of genes covered by the largest union network to 20,045 (Section 3.3). To the best of our knowledge, this is the largest gene network constructed for Arabidopsis thaliana using only publicly available microarray data.
Though we describe our methods in the context of Arabidopsis thaliana, they are general in nature and can be applied to any organism. An interesting open question is whether, for a given dataset, the category names can be automatically derived from the data sizes and metadata available.

Conclusions
In this paper, we describe a method to pre-process large-scale microarray datasets, specifically in the context of reverse-engineer networks. Our novel pre-processing method led to the construction of the largest network from microarray datasets for Arabidopsis thaliana. We hope that the generality of our method enables the construction of genome-scale networks of other eukaryotic organisms of interest, and hence the discovery of novel interactions.