Construction and Analysis of Gene Co-Expression Networks in Escherichia coli

Network-based systems biology has become an important method for analyzing high-throughput gene expression data and gene function mining. Escherichia coli (E. coli) has long been a popular model organism for basic biological research. In this paper, weighted gene co-expression network analysis (WGCNA) algorithm was applied to construct gene co-expression networks in E. coli. Thirty-one gene co-expression modules were detected from 1391 microarrays of E. coli data. Further characterization of these modules with the database for annotation, visualization, and integrated discovery (DAVID) tool showed that these modules are associated with several kinds of biological processes, such as carbohydrate catabolism, fatty acid metabolism, amino acid metabolism, transportation, translation, and ncRNA metabolism. Hub genes were also screened by intra-modular connectivity. Genes with unknown functions were annotated by guilt-by-association. Comparison with a previous prediction tool, EcoliNet, suggests that our dataset can expand gene predictions. In summary, 31 functional modules were identified in E. coli, 24 of which were functionally annotated. The analysis provides a resource for future gene discovery.


Introduction
Escherichia coli (E. coli) is an abundant bacteria in the intestine of humans and animals. It is a single-celled prokaryote that is widely used as a model organism in biology research. The E. coli genome size is about 4.64 M and encodes 5416 genes. Well-developed microarray technology has been used to investigate genome-wide gene expression thanks to its low cost. For convenience, we refer to gene as the corresponding probesets on the microarray throughout the manuscript when focusing on genes rather than probesets. There is extensive E. coli transcriptome data deposited in the public databases. These data include gene expression data under various conditions, such as different nutrients, growing stages, and gene mutations [1]. Scientists have been endeavoring to mine transcriptional regulation networks, gene expression networks, and protein-protein interaction networks by mathematical models [2]. It is still a challenge to reanalyze and discover the underlying information or knowledge within high-throughput data. Through the Google Scholar search engine, we found that the application of weighted gene co-expression network analysis (WGCNA) to E. coli has not yet been comprehensively reported. Although the metabolic network of E. coli has been constructed [3,4], only one recent study mentioned the construction of an E. coli gene network from 524 arrays, which focused on the statistical aspect by comparing four algorithms, including WGCNA [5]. Another study applied a community detection algorithm to the network of interactions identified with the context likelihood of relatedness (CLR) method from 730 arrays [6]. Three other studies have investigated the regulatory network of E. coli using hundreds of microarrays [7][8][9]. The EcoliNet database, which aims to provide function information for E. coli genes, collects co-functional gene networks for E. coli from seven distinct types of data, including co-expression pattern data from 132 microarrays [10].
The aims of our analysis were as follows: (1) to integrate the independent datasets, which will help to overcome the limitations caused by sample size and statistical method bias; (2) to summarize the thousands of genes on the array into tens of modules, which will help simplify data complexity and clarify the E. coli biological functions by modules, systematically; and (3) to provide organization or connection information within module genes, which may infer the potential function of unannotated genes by guilt-by-association rationale [11].
Here, the gene co-expression network for E. coli was constructed from 1391 microarrays. Thirty-one modules were identified, representing various aspects of E. coli biological functions. Hub genes bearing high connectivity may play roles in module function. Unknown genes were also annotated.

A Gene Co-Expression Network for E. coli Was Successfully Constructed
The biggest gene co-expression network to date has been constructed ( Figure 1). The top 2000 highly connected probe pairs were visualized by Cytoscape. The highest connected gene within the whole network is probe 1761661_s_at, which is annotated as an intergenic region; however, it localizes within the putative adhesin gene ycgV as inferred from its physical coordinate. Also, probe 1761661_s_at is highly connected in the Black module, whose members are enriched with plasma membrane protein-coding genes and the two-component system related genes (Table 1).
Cells 2018, 7, x FOR PEER REVIEW 2 of 11 [5]. Another study applied a community detection algorithm to the network of interactions identified with the context likelihood of relatedness (CLR) method from 730 arrays [6]. Three other studies have investigated the regulatory network of E. coli using hundreds of microarrays [7][8][9]. The EcoliNet database, which aims to provide function information for E. coli genes, collects co-functional gene networks for E. coli from seven distinct types of data, including co-expression pattern data from 132 microarrays [10]. The aims of our analysis were as follows: (1) to integrate the independent datasets, which will help to overcome the limitations caused by sample size and statistical method bias; (2) to summarize the thousands of genes on the array into tens of modules, which will help simplify data complexity and clarify the E. coli biological functions by modules, systematically; and (3) to provide organization or connection information within module genes, which may infer the potential function of unannotated genes by guilt-by-association rationale [11].
Here, the gene co-expression network for E. coli was constructed from 1391 microarrays. Thirtyone modules were identified, representing various aspects of E. coli biological functions. Hub genes bearing high connectivity may play roles in module function. Unknown genes were also annotated.

A Gene Co-Expression Network for E. coli Was Successfully Constructed
The biggest gene co-expression network to date has been constructed ( Figure 1). The top 2000 highly connected probe pairs were visualized by Cytoscape. The highest connected gene within the whole network is probe 1761661_s_at, which is annotated as an intergenic region; however, it localizes within the putative adhesin gene ycgV as inferred from its physical coordinate. Also, probe 1761661_s_at is highly connected in the Black module, whose members are enriched with plasma membrane protein-coding genes and the two-component system related genes (Table 1).   WGCNA identified 31 stable co-expressed modules within which the genes have a similar expression pattern ( Figure 2). To test the stability of the modules, the connectivity correlations between the original one and the one sampled 1000 times were calculated and expressed as the mean ± SD. All modules, except the Lightgreen module, show an average connectivity correlation smaller than 0.8. The most stable module is the Darkgrey and the least stable is the Steelblue. WGCNA identified 31 stable co-expressed modules within which the genes have a similar expression pattern ( Figure 2). To test the stability of the modules, the connectivity correlations between the original one and the one sampled 1000 times were calculated and expressed as the mean ± SD. All modules, except the Lightgreen module, show an average connectivity correlation smaller than 0.8. The most stable module is the Darkgrey and the least stable is the Steelblue.

Each Module Performs Distinct Functions
The database for annotation, visualization, and integrated discovery (DAVID) online tool was used to characterize the function of these identified modules. Twenty-four of them were annotated with significant Gene Ontology (GO) or Kyoto Encyclopedia of Genes and Genomes (KEGG) terms (Table 1). E. coli functions can be categorized into relatively independent functional modules. For instance, Blue is associated with secretion, whose coding proteins localize outside the cell. Magenta and Lightyellow both encode the peptidoglycan-based cell wall, but Lightyellow involves fatty acid oxidation and Magenta involves amine biosynthesis. Seven of these modules did not detect any function: Floralwhite, Lightgreen, Lightsteelblue1, Purple, Skyblue3, Steelblue, and Violet, corresponding to 34, 109, 37, 276, 45, 59, and 53 probesets, respectively. Among them, Floralwhite, Lightgreen, Lightsteelblue1, Skyblue3, and Steelblue are the eight least stable modules. The absence of DAVID terms being detected may be due to their relatively fewer probesets and lower module stability.   Figure 2. Correlation of intramodule connectivity for each module after sampling 1000 times (mean ± SD). Connectivity was calculated by randomly sampling half of the 1391 samples 1000 times.

Each Module Performs Distinct Functions
The database for annotation, visualization, and integrated discovery (DAVID) online tool was used to characterize the function of these identified modules. Twenty-four of them were annotated with significant Gene Ontology (GO) or Kyoto Encyclopedia of Genes and Genomes (KEGG) terms (Table 1). E. coli functions can be categorized into relatively independent functional modules. For instance, Blue is associated with secretion, whose coding proteins localize outside the cell. Magenta and Lightyellow both encode the peptidoglycan-based cell wall, but Lightyellow involves fatty acid oxidation and Magenta involves amine biosynthesis. Seven of these modules did not detect any function: Floralwhite, Lightgreen, Lightsteelblue1, Purple, Skyblue3, Steelblue, and Violet, corresponding to 34, 109, 37, 276, 45, 59, and 53 probesets, respectively. Among them, Floralwhite, Lightgreen, Lightsteelblue1, Skyblue3, and Steelblue are the eight least stable modules. The absence of DAVID terms being detected may be due to their relatively fewer probesets and lower module stability.
RNA methyltransferase activity -represents no significant GO or KEGG terms detected.

E. coli Hub Gene Screening
In E. coli gene co-expression modules, each gene has a different contribution, that is, different connectivity. The higher the connectivity, the more important the gene in the module. Therefore, the contribution of a gene to a module can be simply described by its connectivity. The hub genes were screened by their connectivity in each module [12]. The screened hub genes are shown in Table 2. For instance, the top 200 connections of the Darkorange and Darkred modules are visualized in Figure 3. In E. coli gene co-expression modules, each gene has a different contribution, that is, different connectivity. The higher the connectivity, the more important the gene in the module. Therefore, the contribution of a gene to a module can be simply described by its connectivity. The hub genes were screened by their connectivity in each module [12]. The screened hub genes are shown in Table 2. For instance, the top 200 connections of the Darkorange and Darkred modules are visualized in Figure 3.

Module-Based Analysis of Gene Expression Variation
Analysis of gene expression variation in each module found modules performing housekeeping functions. Tan, Green, and Darkgrey have relatively low gene expression variation (lowest three), suggesting their housekeeping functions, such as ncRNA metabolism, translation, and lipopolysaccharide biosynthesis. Brown4, Darkorange, and Plum1 have relatively high gene expression variation (top three), mainly corresponding to stress responses, such as iron transmembrane transporter activity, pathogenesis, and anaerobic respiration (Figure 4).

Module-Based Analysis of Gene Expression Variation
Analysis of gene expression variation in each module found modules performing housekeeping functions. Tan, Green, and Darkgrey have relatively low gene expression variation (lowest three), suggesting their housekeeping functions, such as ncRNA metabolism, translation, and lipopolysaccharide biosynthesis. Brown4, Darkorange, and Plum1 have relatively high gene expression variation (top three), mainly corresponding to stress responses, such as iron transmembrane transporter activity, pathogenesis, and anaerobic respiration (Figure 4).

Module-Based Analysis of Gene Connectivity
Module gene connectivity includes intramodule connectivity and intermodule connectivity. Intramodule connectivity represents the relationship between genes within a specific module. Intermodule connectivity equals the total network connectivity minus intramodule connectivity, indicating the contribution of genes in coordinating functions across modules. Module-based gene connectivity analysis revealed that Black has the lowest intermodule connectivity while Darkorange has the highest intermodule connectivity ( Figure 5). Darkorange is enriched with pathogenesis genes. Connectivity analysis indicates its important roles in coordinating module functions.

Module-Based Analysis of Gene Connectivity
Module gene connectivity includes intramodule connectivity and intermodule connectivity. Intramodule connectivity represents the relationship between genes within a specific module. Intermodule connectivity equals the total network connectivity minus intramodule connectivity, indicating the contribution of genes in coordinating functions across modules. Module-based gene connectivity analysis revealed that Black has the lowest intermodule connectivity while Darkorange has the highest intermodule connectivity ( Figure 5). Darkorange is enriched with pathogenesis genes. Connectivity analysis indicates its important roles in coordinating module functions.

Module-Based Analysis of Gene Expression Variation
Analysis of gene expression variation in each module found modules performing housekeeping functions. Tan, Green, and Darkgrey have relatively low gene expression variation (lowest three), suggesting their housekeeping functions, such as ncRNA metabolism, translation, and lipopolysaccharide biosynthesis. Brown4, Darkorange, and Plum1 have relatively high gene expression variation (top three), mainly corresponding to stress responses, such as iron transmembrane transporter activity, pathogenesis, and anaerobic respiration (Figure 4).

Module-Based Analysis of Gene Connectivity
Module gene connectivity includes intramodule connectivity and intermodule connectivity. Intramodule connectivity represents the relationship between genes within a specific module. Intermodule connectivity equals the total network connectivity minus intramodule connectivity, indicating the contribution of genes in coordinating functions across modules. Module-based gene connectivity analysis revealed that Black has the lowest intermodule connectivity while Darkorange has the highest intermodule connectivity ( Figure 5). Darkorange is enriched with pathogenesis genes. Connectivity analysis indicates its important roles in coordinating module functions.

Modules That Correlate with Experimental Conditions
Modules can be independent units that perform biological functions. Linking the modular gene expression with experimental conditions may help to discover modules functioning in specific conditions. For example, inorganic phosphate limitation induces the highest Darkorange expression (Table S3). Darkorange is associated with pathogenesis and may be involved in responses to Pi availability and in the coordination of its virulence gene response [13]. Several modular genes are critical genes that respond to Pi limitation as reported, such as sepZ (1759760_s_at), escV (1761005_s_at), EspB (1765196_s_at), alkaline phosphatase (1767525_s_at), and Ler (1761795_s_at). Many other genes have been reported to be associated with virulence, such as intimin (1760207_s_at) [14], SepQ (1759472_s_at), SepL (1764730_s_at), EscC (1767809_s_at), EscD (1767897_s_at) ( Table S4).
The highest Darkred expression is induced when E. coli is cultured with bovine serum albumin (BSA) at the logarithmic growth phase [15]. Darkred is associated with flagella, which are important for biofilm formation by E. coli. Many Darkred genes are related to flagella, such as flgF (1759235_at), fliN (1765241_s_at), fliG (1766530_s_at), and flgG (1767435_s_at) ( Table S4). Seventeen genes are annotated as hypothetical proteins. Our results may provide information for mining novel flagellum-related genes.

Comparison with Previous E. coli Networks
To compare our modules with those identified by a community detection algorithm [6], we overlapped the module functions and the Green and Darkred module member genes. Among the 13 modules from the Trevino study, 11 had GO annotations and all 11 modules were covered by our study with the same biological function (Table S5). We found 77 overlapping genes between the structural constituent of ribosome genes (the total was 106 in the Trevino study) and the Green module (Table S6). Another example is the 72 flagella genes, which were identified as the most significant GO-enriched community in the Trevino study. In our study, the Darkred module was annotated as flagellum and contains 87 potential genes that may be associated with flagella. Fifty-six genes overlap between these two lists (Table S7). In our study, the flagellum module also contains some genes annotated as hypothetical proteins.
Some databases provide network-based E. coli gene function predictions, such as EcoliNet. An example here is the Darkred module, where all module genes should be associated with flagella. The 11 genes with unknown function were submitted to EcoliNet for function prediction, which found that 9 of these genes are predicted to be associated with flagella (Table S8). These results show that our analysis can expand the candidate gene list for flagella.

Discussion
WGCNA has been extensively applied for gene co-expression network construction in many species. The application in human brain transcriptome analysis revealed co-expressed modules in different brain regions [16]. In plant research, WGCNA identified cell-type specific and endoderm differentiation-associated gene co-expression modules [17]. Here, we analyzed 1391 E. coli microarrays and identified 24 modules with biological functional annotations. As E. coli is a simple organism, the results are easy to interpret. Traditional microarray analysis methods compare the control and treatment group, or multiple groups by ANOVA. When the data contains enormous amounts of samples, it is hard to analyze. WGCNA helps to reduce data dimension/complexity to only several modules that are complementary to the traditional analysis method. WGCNA does not require prior knowledge, so it may provide information for genome annotation.
E. coli biological functions can be categorized into four types: metabolism, ion transport, behavior, and pathogenesis. Metabolism modules include amino acid metabolism (Brown4, Darkturquoise, Magenta), carbohydrate metabolism (Black), lipopolysaccharide biosynthesis (Darkgreen, Darkgrey), sulfur metabolism (Darkolivegreen, Orangered4), and glycerol metabolism (Paleturquoise). Ion transport modules are Ivory and Grey60. The behavior module is Darkred. The pathogenesis modules are Blue, Darkmagenta, and Darkorange. Each module has a different function, suggesting the robustness of WGCNA.
Gene function prediction by co-expression approaches is based on the similarity of gene expression profiles across multiple samples or conditions. More data would help to increase the predictive power [18]. Therefore, all the data under various conditions were pooled to obtain more general module results. The advantages of our study is that it is the largest sample size to date. WGCNA assumes that groups of genes that encode proteins primarily or exclusively and function Cells 2018, 7, 19 8 of 10 together in a pathway or molecular complex will be coordinately transcribed. The modules may help to annotate genes with unknown function. The Darkred module example shows that our results are comparable to those previously reported.
Further identification of hub genes may help to reveal their important role in modules. For instance, the Darkorange SepD hub gene may play roles in E. coli pathogenesis [19]; Black Z4148 is annotated as a hypothetical gene that might participate in carbohydrate metabolism, inferred from module function. Our results provide information for gene function and experimental validation. Lastly, the module-based analysis has revealed potential housekeeping and stress response modules by gene expression variation. Individual genes with high intermodule gene connectivity may suggest a role as module coordinator. For example, Darkorange may sense signals from the environment and play roles in intracellular signaling. In sum, our analysis is the first to construct an E. coli gene co-expression network and provides clues for candidate genes suitable for experimental validation.

Data Preparation
A total of 165 GEO Series (GSE) datasets from E. coli were downloaded from the National Center for Biotechnology Information Gene Expression Omnibus (GEO) database [1]. These datasets include 1391 GSM files, representing 1391 biological samples. Details about these experiments and samples are provided in Tables S1 and S2. To improve the reproducibility and simplicity of the data analysis, only Affymetrix E. coli Genome 2.0 Array data were processed. Raw gene chip data were analyzed by Expression Console (Version 1.2) by MAS5.0 method. Probe-level gene expression data were retrieved.

Weighted Gene Co-Expression Network Analysis (WGCNA)
The WGCNA package was run on R (Version 2.3.2) to construct a gene co-expression network and identify modules with the following parameters: networkType = 'signed', softPower = 12, minModuleSize = 30, deepSplit = 4. Briefly, a weighted correlation network was created by calculating the correlation coefficients with the power β [20]. The β = 12 was chosen as a saturation level for a soft threshold of the correlation matrix based on the criterion of approximate scale-free topology. The weighted network was transformed into a network of topological overlap (TO), an advanced co-expression measurement that considers not only the correlation of two genes with each other but also the extent of their shared correlations across the weighted network [20]. The TO matrix was then used to group highly co-expressed genes by hierarchical clustering. Finally, a dynamic tree cut algorithm was used to cut the hierarchical clustering tree, and modules were defined as the branches resulting from this tree cutting [21]. Each module was summarized using singular value decomposition so that each module eigengene (ME) represents the first principal component of the module expression profiles [20]. Thus, ME explains the maximum amount of variation of the module expression levels and is considered the most representative gene expression in a module.
Module stability was tested as the average correlation between the original connectivity and the connectivity from 1391 samples that were randomly sampled 1000 times. The process was run for every module.

Module-Based Qualitative and Quantitative analysis
Genes in each of the identified co-expressed modules were annotated by the database for annotation, visualization and integrated discovery (DAVID 6.7) [22]. In DAVID, the term enrichment was defined as the Benjamini-adjusted Fisher exact test p-value. To visualize the whole network or specific module, the WGCNA exportNetworkToCytoscape function and Cytoscape tool [23] were used.
For gene expression variation analysis, the relative standard deviation for each gene in a module was calculated and the average values for each module were visualized. For connectivity analysis, the intermodule connectivity/total connectivity value was calculated for every gene and then the average value for each module was visualized.

Comparison of Gene Prediction Using Published Data
The E. coli network constructed by Trevino [6] and the EcoliNet gene-prediction tool [10] were used to compare candidate gene lists. For direct comparison, we overlapped the modules with the same biological function. Genes with unknown function in Darkred module were submitted to EcoliNet [10] for prediction.
Supplementary Materials: Supplementary materials are available online www.mdpi.com/2073-4409/7/3/19/s1. Table S1: E. coli datasets used in our study; Table S2: 1391 E. coli samples' characterization; Table S3: Samples that have the highest or lowest ME value in a specific module; Table S4: Probes modular assignment, connectivity, and their functional annotation; Table S5: Comparison of seven communities with GO annotations identified in the Trevino study at fmin = 4; Table S6: Comparison of ribosome modular genes identified by the Trevino study and our study; Table S7: Overlap of flagellum modular genes identified by the Trevino study and our study; Table S8: 11 genes with unknown function in Darkred module were submitted to EcoliNet for function prediction.