Symmetry in the Language of Gene Expression: a Survey of Gene Promoter Networks in Multiple Bacterial Species and Non-σ Regulons

The language of gene expression displays topological symmetry. An important step during gene expression is the binding of transcriptional proteins to DNA promoters adjacent to a gene. Some proteins bind to many promoters in a genome, defining a regulon of genes wherein each promoter might vary in DNA sequence relative to the average consensus. Here we examine the linguistic organization of gene promoter networks, wherein each node in the network represents a promoter and links between nodes represent the extent of base pair-sharing. Prior work revealed a fractal nucleus in several σ-factor regulons from Escherichia coli. We extend these findings to show fractal nuclei in gene promoter networks from three bacterial species, E. coli, Bacillus subtilis, and Pseudomonas aeruginosa. We surveyed several non-σ transcription factors from these species and found that many contain a nucleus that is both visually and numerically fractal. Promoter footprint size scaled as a negative power-law with both information entropy and fractal dimension, while the latter two parameters scaled positively and linearly. The fractal dimension of the diffuse networks (d B = ~1.7) was close to that expected of a diffusion limited aggregation process, confirming prior predictions as to a possible mechanism for development of this structure.


Introduction
Genomes display symmetry on several levels of organization.On the most basic level, the DNA double helix is symmetric about its central axis [1].There is also self-similarity distributed along the length of a DNA molecule as 1/f correlations and power-law scaling in nucleotide base composition and in the abundance of genome components including gene types [2][3][4][5][6][7].Several studies now show a fractal organization of chromatin [8][9][10] and fractal folding and compaction of whole genomes [11].In medicine, the fractal dimension of nuclear chromatin has been examined for the prognosis of cancer [12][13][14].
While many of these examples of symmetry derive from studies of structural genomics, others draw from functional genomics-cases in which genome behavior is symmetric as revealed by Gene Regulatory Networks (GRNs; [15][16][17]).In a GRN, nodes represent genes and arcs or arrows represent regulatory relationships between genes.Typically there are a few highly connected genes, or global regulators, that organize much of the transcription in a genome; most regulators control only one or a few genes.This highly skewed distribution of control linkages across genes tends to follow a power-law relationship [18,19], showing that there is symmetry in the rate at which regulatory work is partitioned across the genome.
These fractal genetic systems are embedded within other levels of self-similarity in living systems.Many cellular interaction networks also exhibit scale-free architectures including protein-protein interaction and metabolic networks [18,[20][21][22][23].There is fractal structure found at the level of tissue organization [24], as well as whole-organism physiology [25] and development [26], and there are patterns of allometric scaling that explain biological pattern ranging from biochemistry to ecosystem organization [27][28][29].
In the present study we consider symmetry at the interface between structural and functional genomics-at the DNA-protein binding event that initiates transcription.A critical stage of gene expression is the binding of RNA polymerase and its helper proteins during transcription [30].Typically there is a promoter region slightly upstream of the transcription start site that is recognized by one or more regulatory proteins, or transcription factors; these proteins aid in the positioning and activation of RNA polymerase so that binding is both efficient and specific [31][32][33].Many of the regulatory proteins are dimers and bind to symmetric palindromic sequences in the promoter region [30,34].Some transcription factors bind to more than one place in the genome and some bind to hundreds-the latter being an example of a global regulator.These global regulators define a regulon of genes that may be turned on or off in concert in response to a given cellular need [33,35].In bacteria, the sigma factors (σ) are important global regulators that mediate responses such as heat shock and nitrogen response.
Gene Promoter Networks (GPNs)-as opposed to Gene Regulatory Networks (GRNs)-are systems-level representations of the DNA-based language used to initiate gene expression [36].More literally, a GPN is a network-based rendering of base pair-sharing among the promoter elements within a genome, typically promoters within a regulon.Consider the hypothetical GPN comprised of nodes A, B, and C. Each node represents a promoter sequence bound by the same DNA-binding protein.Links or edges form between nodes (promoters) in the GPN based on the number of base pairs shared by promoter pairs.For example, given promoters A (AATA), B (AATT), and C (GCTA), there would be the following weighted edges formed between node pairs: A -B (w = 3 base pairs), A -C (w = 2 bp), and B -C (w = 1 bp)(see Supplementary Figures).
GPN analysis was first applied to -factor regulons in the bacterium Escherichia coli [36].The promoters in these GPNs displayed considerable sequence variation and were not well-represented by a single consensus motif.Given this, and that the regulons contained numerous promoters, the full GPNs were exceedingly dense and contained numerous weak links representing the sharing of few bases between promoters.In order to detect fine topological structure it was necessary to apply a thresholding method, removing a subset of the edges and retaining only those within a certain interval.Such thresholding is standard procedure in studies ranging from protein interaction networks [37] to neurological networks in the brain [38] and social networks [39].The thresholding of the GPNs at the phase transition (break up of the giant component) revealed nuclei of high-weight edges that represented high bp-sharing among promoters.These GPN nuclei displayed a strong fractal structure [36].In general, fractal structures display a self-similar symmetry across spatial scales [40], and the presence of a fractal core in the GNP suggested a self-organizing complexity interfacing the evolution of genome regulatory elements and the grammar of transcriptional regulation.
More recent work on GPNs [41] involved in silico simulations intended to identify mechanisms that could produce a fractal nucleus.Evolutionary factors that contributed to the development of fractal topology in modeled GPNs included promoter duplication and attractive forces arising from DNA-protein binding chemistry; repulsive forces due to binding chemistry were ruled out as unimportant under the conditions examined.However, random evolution of the promoter set introduces an intrinsic repulsion among promoters-in that most random promoters differ considerably in sequence composition from any optimal consensus promoter.These findings were notable since it is thought that repulsion is a critical causal agent in the development of fractality in most networks [42].Collectively these patterns supported a weak version of the diffusion limited aggregation model first posed to explain the fractal nuclei in GPNs [36], positing that fractality arises by the random accretion of promoters around the periphery of a GPN, and repulsive forces increase the fractal signal.
These studies of GPNs prompt certain questions about the evolution of gene regulation.Why is it that a given protein regulator binds a multiplicity of promoter sequences in large regulons as opposed to one or a few sequences presenting an optimal binding chemistry?A partial answer is that suboptimal binding events present the opportunity to fine-tune gene expression using additional transcription factors that aid the binding, transcription factors that may be responsive to other conditions in the cell [30].But is this selective advantage of flexibility the driving force of promoter evolution, or simply a response to a stochastically driven promoter evolution in which promoters blink on and off all across the genome?While it is likely that promoters arise by either random mutation or duplication events, it is unclear to what extent natural selection directly controls the frequency of promoters in a population or species.Instead of a presence/absence of a simple consensus promoter, it is likely we must consider the extended promoter phenotype that must be evaluated in the context of whether or not the cell can form an adequate transcription factor set that will facilitate RNA polymerase function.This loosens the selective forces on the promoter bases, and allows a more diffuse and random process of accretion into a GPN, compared to natural selection rigidly maintaining base composition near an optimal consensus sequence.This plasticity of the language controlling gene expression, a lack of simple 1:1 mappings, is also found in human languages.While linguists [43] have held for some time that language must be functional but not overly rigid, recent studies [44,45] suggest that human languages emerge abruptly at the phase transition between referentially useless systems and overly indexed systems.Here the explanation entails consideration of the balance between the opposing interests of the listener and the speaker in a communicative exchange.The listener attempts to get as much information as possible, which is best achieved when there is a 1:1 mapping between words and objects.The approach employs mutual information, a metric from Shannon's information theory [46], to measure this mapping.By contrast, the speaker seeks to minimize his effort in forming sentences, which is more difficult to do when there are many words to choose from.Here, information entropy can be used to quantify the lexical diversity.Large studies of semantic networks [47] support this view of language evolution by showing that synonym sets display some of the topological features we have found in bacterial gene regulatory systems.
Our current work tests the generality of prior findings on fractality in GPNs, and evaluates pattern within the context of information theory.Specifically we address the following: (a) Do other regulons (other than -factors) contain fractal nuclei in their GPNs?(b) Do other species (other than E. coli) contain fractal nuclei in their GPNs?(c) Is there a quantitative relationship between information entropy of the promoter signals in a regulon and the fractal dimension of a GPN nucleus?Do these factors co-vary with promoter footprint size?(d) In addition to addressing these questions, we offer greater detail in the methods used to study GPNs in this fashion (see Supplementary Information).

Predicted Promoters
Promoter sequences were obtained from Virtual Footprint [48,49] which offers promoter prediction tools that interface with several prokaryotic genome libraries including Prodoric [48] and RegulonDB [50].We used the Regulon Analysis option, downloading several non-σ factor predictions for Escherichia coli (strain K12), Bacillus subtilis (strain 168), and Pseudomonas aeruginosa (strain ATCC 15692/PAO1).Default settings were used, though sensitivities were varied (0.5-1.0).The single pattern option (rather than bipartite) was used yielding single block promoter footprints (~10-20 bp) without a spacer.Alignments used were as provided by Virtual Footprint.Regulons were sampled haphazardly with a preference for those that included >100 genes and exhibited some type of visually non-random structure.

GPNs and Thresholding
Perl script was used to extract and process promoter sequence information from the Virtual Footprint flat files.Pairwise similarities between promoter sequences i and j (A ij ) were evaluated as the number of bp shared.These weighted edge values were used to form the adjacency matrix, A. A network or graph G was generated based on the matrix A (see Supplementary Figure 1).Networks were visualized using Pajek [51,52], a program used to work with large networks.Projections were rendered with the Kamada and Kawai [53] algorithm.
Thresholding was achieved through serial x-sections [39] in which only edges with value x are retained, all others are removed (see Supplementary Figure 1).For our purposes x was a positive integer no larger than F (footprint size).Separate subgraphs (G') were formed from the total graph (G) based on a sliding x-threshold (see Supplementary Figure 2).In each subgraph, the largest connected component was extracted from G and evaluated for the number of nodes (graph size) and number of edges.The largest (maximal) connected component (LCC) is the subgraph G' containing all nodes still interconnected in the largest group after removal of edges not meeting the threshold criterion.An LCC is a giant component when it contains at least half the nodes present in the full graph G.Note that for these regulons taking m-slices (retaining edges above a certain threshold) at the upper phase transition returned comparable LCC sizes (node counts) as did x-sections, so we somewhat arbitrarily chose to report x-sections.
We report the LCC at the upper phase transition of these serial extractions.The phase transition in graph size occurs around the threshold that fractures the giant component into numerous smaller connected components (see Supplementary Figure 2).

Assessment of Fractal Structure
Python script was written to implement the renormalization method of Song et al. [54,55] and evaluate the fractal dimension of the LCCs extracted from the GPNs.The program utilized NetworkX [56,57], an open source Python package for the analysis of complex networks.The renormalization method is a graph coloring exercise founded on the traditional box-covering method of fractal measurement.In brief, for a given box length (l B ), or shortest path length between nodes, each node is colored in a fashion such that neighbors of like color are no further away than the current box length.Then the network is renormalized by collapsing adjacent nodes into a single node if they share the same color (see Supplementary Figure 3).This enforces the graph coloring rule that no two adjacent nodes can share the same color.The value N B then gives the minimum number of boxes of length l B required to cover the graph of N B nodes, and is equal to the graph size (node count) following renormalization.Considering a range of box lengths, a plot of l B versus N B on a log-log scale will be linear for networks with a fractal topology (see Supplementary Figure 3).On a normalized series of graphs with minimum size N, the fractal dimension d B is obtained from linear regression of the log-log transformation of the general scaling relation:

~
The coefficient of determination (R 2 d ) was used to assess the fit of the renormalization data to this fractal model.

Information Entropy
Base conservation was evaluated using Shannon's information entropy [36] as described for DNA sequences [58,59]: where H(l) is the entropy or the amount of uncertainty regarding base composition (b: A, C, G, T) at position l in the promoter sequence.The frequency of each base at position l is denoted by p b,l and is calculated across the set of promoter sequences within a GPN or regulon.R Sequence represents the amount of information present at position l, and is calculated as the observed uncertainty minus the maximum possible uncertainty (2) under a 2-bit system: 2 In our analyses, the value I RMean denotes the arithmetic mean of R Sequence (l) across the l positions in the promoter footprint.We used the entropy calculator from the HCV database website [60] maintained by the Los Alamos National Laboratory to obtain summary estimates of R Sequence for each regulon, and WebLogo [61] to obtain figures of the sequence logos [59].

Footprint Size, Information Entropy, and Fractal Dimension
We explored the correspondence between footprint size (F), information entropy (I RMean ), and fractal dimension (d B ) using regression.MATLAB (The MathWorks Inc., Natick, MA, USA) was used to fit the linear model Y = βX + A + ε in which x and y were the independent and dependent variables, respectively.All three pair-wise relationships were explored, separately, given the three parameters F, I RMean , and d B .We examined relationships in which neither variable was transformed and in which both variables were log 10 -transformed; a linear fit in the latter circumstance indicates a power-law relationship, and this type of association has been found among numerous genome properties [4].The fit was judged by the coefficient of determination (R 2 ), and the corresponding correlation coefficient (r) is also reported.

Visual Pattern
Symmetric patterns were visually evident at upper phase transitions in several of the GPNs studied (Figure 1).There were other GPNs that showed no obvious visual structure and we do not consider these any further in the present study.The type of symmetry varied across GPNs.Several of the large networks (containing many nodes) displayed radial symmetry around an open center (e.g., Figure 1A, K).Others, typically smaller GPNs, displayed bilateral symmetry along a linear axis (e.g., Figure 1L, M), while some showed a combination of bilateral and radial symmetry (e.g., Figure 1I, N).A few of the GPNs were very dense with numerous links between closely related node pairs (e.g., Figure 1G, J).  1 for more details.

Fractal Dimensions
Most of the GPN x-sections taken from phase transitions exhibited a strong self-similarity as measured by the method of Song et al. [55], as seen in the fractal dimension (d B , Table 1).The average fractal dimension was d B = 2.118.The lowest observed was d B = 1.534 (Figure 1M) for a long linear symmetry; the highest observed was d B = 3.415 (Figure 1J) for a highly dense network.The fit of the fractal relationship was generally high for these (R 2 d = 0.906-0.978).

Table 1.
Promoter prediction settings for Virtual Footprint downloads along with footprint size, information entropy, and the outcome of fractal analyses.[48] (Prod, Prodoric [48]; Reg, RegulonDB [50]); S, sensitivity setting on Virtual Footprint; X, x-section critical value for thresholding; for example, X = 17 implies that all edges were removed from the network except those of weight 17 bp shared between promoter pairs; F, footprint size of promoter in base pairs; I RMean , arithmetic mean across base positions of the position-specific information (R Sequence ) for a regulon of promoters; d B , fractal dimension of GPN for upper phase transition x-section as calculated by method of Song et al. [55]; R 2 d , coefficient of determination for regression of log-log transformed plot of l B versus N B .
The promoter sequences were highly variable and departed from the consensus sequence considerably in many but not all regulons.Sequence-level information, I RMean , ranged between 0.927 (Fur) and 1.557 (−35 box of AlgU).

Scaling of Footprint Size, Information (I RMean ), and Fractal Dimension
Fractal dimension (d B ) scaled negatively with footprint size (F) (Figures 2, 3; Table 2).The power-law relationship (log-log transformation) was strongest (R 2 = 0.569) and significant (P = 0.002).The relation can be seen in Figure 2 in that the sequences near the base are generally longer than near the top.The smallest footprints ranged in size from 9-13 bp and displayed the highest fractal dimensions (group mean d B = 2.937), and these were topologically densest, containing numerous edges between very similar promoters (Figure 1).This contrasted with the larger footprint GPNs which generally had lower fractal dimensions (group mean d B = 1.790) and a greater likelihood of visually evident bilateral or radial symmetry.F, footprint size of promoter in base pairs; d B , fractal dimension of GPN for upper phase transition x-section; I RMean , average Shannon's Information index based on measures of sequence entropy; β, slope of regression line (power-law coefficient for log-log scaling); A, intercept of regression line; R 2 , coefficient of determination for regression; r, correlation coefficient; F stat , F statistic for regression; P, P-value for regression.
Information entropy (I RMean ) also scaled negatively with footprint size (Figure 2, 3; Table 2).Again, the power-law relationship gave the best and significant fit (R 2 = 0.795; P = 0.001).The relationship can be seen in Figure 2 whereby the smaller promoter motifs tend to have greater sequence conservation overall.
Information entropy scaled positively, not negatively, with fractal dimension (d B ).Moreover, the scaling was best described by a simple linear model (R 2 = 0.734; P = 0.001), not the log-log transformed model (Figure 3, Table 2).

Discussion
Our findings demonstrate that a fractal symmetry is present in the nucleus of GPNs of bacteria including but not limited to E. coli.We have expanded the scope of the first study of E. coli GPNs [36] to include two other species, B. subtilis and P. aeruginosa.This taxonomic coverage spans considerable diversity within the bacteria.Bacillus represents the phylum Firmicutes whereas both Escherichia and Pseudomonas are in the Proteobacteria, each representing orders Enterobacteriales and Pseudomonadales, respectively [62].
Our results also show that fractal organization is not limited to the σ-factors.Several, though not all, of the GPNs we examined displayed strong visual self-similar structure including bilateral and radial symmetry, and these displayed a good fit to the power-law relation expected for fractal networks [54,55].It should be noted, though, that ours was not a quantitative survey across all regulons available through the Virtual Footprint database.A principle challenge to estimating the frequency of fractal nuclei relates to the fact that the algorithm used by the Virtual Footprint site to detect promoters is adjustable by sensitivity.This is a desirable property given the complexity of transcription [33,63], but it leads to a variety of possible GPN outcomes for any given regulon defined by a DNA-binding protein.Thus, we took it as our task simply to determine if fractal nuclei are present beyond the domain of the original study [36], outside of the E. coli σ-factor system.
On a very general level, fewer than half of the regulons considered showed visually obvious fractal structure in the LCC taken from the upper phase transition of their GPN (we have not shown all of them here).Yet, it is not clear that different processes are necessarily at play in the fractal versus non-fractal GPNs since many of the regulons lacking an obvious visual symmetry contained fairly few promoters to begin with, and network size poses an inherent constraint on fractal complexity.Our results are in keeping with the first study [36] which was small in scope but found that several but not all (3 of 4) of the -factor regulons had a fractal nucleus.
The data on the fourteen fractal nuclei revealed several scaling relationships which offer insights into dynamics that might yield symmetry in the grammar of transcription.Although further research is warranted, some reasonable patterns emerged.Footprint sizes of the DNA-binding proteins scaled negatively as a power-law with both fractal dimension and information, and the latter two attributes scaled directly and positively.
The scaling of footprint with information shows that small footprints exhibit more sequence conservation and so are more likely to be uniformly constrained by natural selection.Larger footprints often contain spacer regions in which DNA base composition varies more widely across promoters compared to the distal portions of the footprint.Indeed, the promoter data used in the first study of σ-factors [36] had been provided by RegulonDB [50] as a two block footprint with spacer size information.Putative spacer regions are evident in sequence logos of several of the larger regulons shown in Figure 2. Bases in these spacer regions might be constrained by natural selection though perhaps under diversifying selection in which some promoters within a regulon demand one base at a given position while other promoters within the regulon require another base.It also is likely that some of these bases are free to vary and are neutral to selection.Given that the DNA double helix undergoes a complete turn in roughly ten base pairs [1], it is noteworthy that many of the conserved sites in Figure 2 are separated by spacers of ten or twenty bases (one or two full turns around the double helix), suggesting that the binding sites occur on the same side of the DNA.Such patterns have been noted in other DNA-protein binding studies [58,64,65] and argue that the spacer positions may be less accessible to the binding of the protein.
The fractal dimensions of the GPN nuclei had a significant negative power-law correspondence with footprint size and a simple linear scaling with information.Before addressing these relationships it is perhaps useful to consider in greater detail what fractal dimension means in this context.
In general, fractal dimension gives a quantitative measure of self-similarity, that is, how many smaller parts are revealed as one rescales the graininess or magnification with which an object is viewed [66].In the method we used to appraise fractal dimension [55], the fractal scaling coefficient (d B ) implies the rate at which the network changes in size (log 10 of the number of nodes) with each change of box length (log 10 of the relevant scale in number of edges).Thus a GPN with d B = 3.0 (i.e., line with slope = −3.0)drops in size much more quickly as box length is increased compared to one with d B = 2.0.Those with d B = 3.0 (e.g., Figure 1G and J) contain dense groups of highly related promoters, and on renormalization the size of the groups changes rapidly whereas change on renormalization is more gradual for the GPNs with d B = 2.0 (e.g., Figure 1A).
The fractal dimension gives us some indication of how rapidly mappings amplify or condense at the phase transition x-section.We focused on the x-section positioned at the phase transition which is the part of the sectional series at which the giant component experiences the greatest change in size (number of nodes).If there is a correspondence between human language and cellular DNA-based communication, this phase transition represents the point at which mappings between symbols and objects changes from overly general (many diffuse connections) to overly specific (few but specific connections).
Promoters with small footprints are generally highly conserved in their sequence and their GPNs scale rapidly under renormalization during the fractal analysis.This is because there are numerous connections amongst highly related promoters in these higher x-sections, and renormalizing across larger and larger scales (number of edge steps between promoter nodes in a GPN) serves to collapse the graph rapidly (see Supplementary Figures) resulting in a higher scaling exponent.For regulons involving promoters with a larger footprint, these tend to be less conserved in their sequence.Thus each promoter node is connected to fewer other nodes of high sequence similarity in these higher x-sections, and renormalization collapses the network more slowly, leading to a smaller scaling exponent; these GPNs are more diffuse and spread out, yet fractal nonetheless.

Conclusion
The fractal symmetry of these bacterial GPNs supports the view that natural communicative systems emerge as a self-organizing form of complexity.For some time it has been known that DNA contains a self-similarity [3,5] comparable to Zipf's scaling law first observed in human textual corpora [67].And there is continued interest in understanding functional aspects of self-organizing complexity as it pertains to behavior in the cell and the genome [68][69][70][71].In the present study we have probed the generality of the observation that there exists a symmetric interface between the signals used to initiate gene expression.The GPN permits visualization of this grammatical context of transcription events within a regulon.Linguists have found with human language that words are embedded within a wider context of usages which support and give form to meanings [43,44], and in bacterial gene regulation this contextual framework is also present, symmetric, and self-organizing.

Figure 3 .
Figure 3. Relationships between footprint size, information entropy, and fractal dimension across fourteen nuclei of GPNs from three bacterial species.

Table 2 .
Relationships between footprint size, information (I RMean ), and fractal dimension across nuclei of fourteen GPNs from three bacterial species based on linear regression of original data and log-log transformations of data.