Recent Advances in the Computational Discovery of Transcription Factor Binding Sites

The discovery of gene regulatory elements requires the synergism between computational and experimental techniques in order to reveal the underlying regulatory mechanisms that drive gene expression in response to external cues and signals. Utilizing the large amount of high-throughput experimental data, constantly growing in recent years, researchers have attempted to decipher the patterns which are hidden in the genomic sequences. These patterns, called motifs, are potential binding sites to transcription factors which are hypothesized to be the main regulators of the transcription process. Consequently, precise detection of these elements is required and thus a large number of computational approaches have been developed to support the de novo identification of TFBSs. Even though novel approaches are continuously proposed and almost all have reported some success in yeast and other lower organisms, in higher organisms the problem still remains a challenge. In this paper, we therefore review the recent developments in computational methods for transcription factor binding site prediction. We start with a brief review of the basic approaches for binding site representation and promoter identification, then discuss the techniques to locate physical TFBSs, identify functional binding sites using orthologous information, and infer functional TFBSs within some context defined by additional prior knowledge. Finally, we briefly explore the opportunities for expanding these approaches towards the computational identification of transcriptional regulatory networks.


Introduction
The gene is the fundamental unit on the genomic DNA which contains the required information to carry out the biological functions of cells.The expression of genes i.e. mRNA synthesis can be measured efficiently in a high-throughput fashion and such expression patterns are characteristic of cellular responses to external stimuli [1].It is widely accepted that these responses are mainly driven by the interactions between transcription factors (TFs) and their corresponding transcription factor binding sites (TFBSs) on the proximal promoters of the target genes [2,3].However, with a large number of genes in eukaryotic genomes, deciphering how these interactions evolve to control the expression of tens of thousands of genes (~ 35,000 genes in human) remains an open question.Recent studies [4] have shown that the underlying regulatory mechanisms are complex, dynamic (especially in higher organisms) and can be arranged in multiple hierarchical levels such as the sequence, the chromatin, and the nuclear level.
The sequence level, also the best-studied level of gene regulation, is characterized by the linear organization of transcription units and cis-regulatory elements considered as the regulatory code which governs gene expression.These cis-regulatory elements i.e. binding sites which are more important when found on the proximal promoters form a highly flexible and context-dependent structure [5] for each gene [6][7][8].Furthermore, in eukaryotic cells genomic DNA is 'packed' into an efficient structure, called chromatin, composed of nucleosomes that consist of approximately 147bp of DNA wrapped around a protein octamer [9,10].This structure not only packs DNA but also creates an added layer of gene regulation which ensures correct gene expression and accessibility to DNAdependent processes e.g.gene transcription, DNA repair, and DNA replication.The overall process of the transcription process encompassing the nuclear architecture and/or the complex spatial arrangement of genes, gene clusters, chromatin, and regulatory DNA elements [11,12] is far beyond the scope of any single review and hence we only focus on the sequence level aiming at discovering cis-regulatory elements on the proximal promoters.
Two of the most important functional elements in gene regulation are transcription factors and their binding sites on the promoters of their target genes.A TF is a protein which binds to specific DNA binding motifs that can be present multiple times on the same promoter of a gene or on different promoters of different genes.The transcription factor binding sites where a TF binds are usually short (5 -15bp) and degenerate but highly selective through evolution [13].A gene can have multiple alternative promoters [14,15] and each promoter frequently contains a large number of binding sites (10 -50 binding sites) for 5 -15 different TFs [16].Therefore, a more comprehensive understanding of these elements and their interactions will provide a deeper understanding of the regulatory pathways within cells and potential functions of individual genes and/or gene clusters [17].
Although various approaches have been developed, we are still limited by both experimental and computational techniques in order to detect these binding sites and understand their interplay with corresponding TFs as well as their role in the transcription process.However, recent high-throughput technologies which identify high-affinity binding sequences e.g.ChIP-chip [18,19], SELEX [20,21] revealed the genomic regions to which a particular TF is bound, providing a powerful resource for discovering binding sites of transcription factors.Even though the information collected through such methods is substantial, how to best annotate and interpret such a large volume of data and how the data can be explored to predict novel binding sites are still open issues.
Traditionally, one has attempted to extract a set of promoters of a set of genes similar in function or expression pattern using a fixed-size length of the promoter sequence and then search for statistically overrepresented subsequences, also called motifs and considered as TFBSs if they match with some known TF profiles i.e. similar to known binding sites of a TF.A large number of tools have been developed using a variety of algorithmic approaches and underlying models.The relative advantages and disadvantages of each approach is making selection among them as well as developing novel tools become a non-trivial task.Consequently, one tried to build up testing datasets to measure their performance as well as identify the strengths and limitations [22][23][24].Although there remain a number of difficulties in constructing the testing datasets and the accuracy measurements, a general view from these benchmarks is that in silico predictions is still lack of corresponding to in vivo experiments.The main reason is that binding motifs are short, degenerate and contain lack of information i.e. encoded by only four types of characters (A, T, C, and G), leading to the fact that most binding sites are found as random hits throughout the genomic DNA.Additionally, regulatory elements are not randomly distributed; they tend to form clusters with a particular structure, cis-regulatory modules [5,25].Therefore, recent studies have tried to combine with additional information such as gene expression, gene annotation, phylogenetic footprinting and/or search for composite motifs instead of single motifs to increase the sensitivity of the methods [25][26][27][28].
A number of excellent reviews have addressed a variety of critical issues.Typically, Brazma et al. [29] classified motif discovery methods following the motif model (deterministic or statistical, pattern driven or sequence driven), the scoring function, and the search strategy.Pavesi et al. [30] provided a very comprehensive discussion about different algorithmic methods and approaches to the problem (similarly in Wasserman [31] and Bulyk [17]).And then since individual binding sites are lack of information for algorithmic methods and recent advances have moved to model the regulatory regions or combine with other biological lines of evidence, Sandve et al. [16] proposed an integrated framework to divide the trend following the description of motif discovering models such as single motifs, composite motifs (cis-regulatory modules), gene level -how several modules interact together to regulate a gene, and genome level -how several classes of modules interact together to regulate a set of genes.Alternatively, with the idea of exploring the interdependence between computational and experimental techniques in this aspect, Elnitski et al. [32] made a summary on the synergism between in silico, in vitro and in vivo identification of transcription factor binding sites.And later, Das et al. [26] surveyed again different approaches combining with other biological evidence in motif discovery.Therefore, in this review we would like to concentrate on the developing strategies that detect physical TFBSs on the proximal promoters of target genes and some promising preliminary results on identifying functional-relevant binding sites.The remainder of this manuscript will discuss issues related to the representation of binding motifs, promoter identification and then review the basic approaches for the identification of TFBSs.Finally, we briefly explore the possibility to infer transcriptional regulatory networks under the aspect of promoter analysis.

Binding site representation
Assuming a list of DNA binding sites for some TF is available, one of the very first questions is how to best represent and characterize the information contained in these sites for further analysis.The goal is to find a representation that matches as closely as possible all the binding sites in the collection and is clearly distinguished from the background.From the point of view of string processing, a simple and widely-used concept is the consensus sequence in which the most frequent character at each position is chosen to represent in the binding motif at that position.However, some positions might consist of characters of equivalent frequencies and thus a more complex pattern, the IUPAC sequence [33,34] was used to characterize the diversity of those binding sites (Figure 1a).Although this representation works well for highly conserved and short binding motifs, it is defined somewhat arbitrarily and removes much of the information in the original set of binding sites.In a case for yeast TF ABF1, for instance, two IUPAC sequences (RTCRYYNNNNACG or RTCRYNNNNNACG) have been published and used as a relatively precise description of ABF1 binding sites [35].However, these representations failed to recognize the binding site SCPK01 on PYK1 promoter from position -610 to -598 which was showed to be bound by TF ABF1 experimentally [36].Consequently, a more precise representation was proposed to utilize almost all binding site information, known as the nucleotide distribution matrix or position weight matrix (PWM) [35,37,38], which has been proven very successful in various problems in DNA and protein sequence analysis [35,39].The PWM is a matrix of scores (e.g.occurrences, frequencies) with four rows corresponding to four DNA bases and m columns, each of which is a position in the binding motif.The basic assumption of the PWM is that the base-pairs at different positions are statistically independent and thus the fitness score of a matched oligonucleotide 'p' with this profile is the sum of the fitness at each position.This representation reflects the extent to which a position is conserved within the binding motifs and thus the higher the similarity, the higher the fitness is.
The main weakness of the PWM approach stems from the assumption is that the positions contribute independently and additively to the total activity of the binding site.However, position dependence may exist on the binding sites and has been experimentally and/or statistically verified in some cases [40].For example, using a new quantitative multiple fluorescence relative affinity assay Man et al. [41] showed that position 16 and 17 on the operator DNA were not independent in the interactions with its TF, Salmonella bacteriophage repressor Mnt; or in another case, when Ellrott et al. [42] applied χ 2 test on the 71 binding sites of TF hepatocyte nuclear factor 4α HNF4α, a significant dependence was found between several pairs of positions e.g.position 4 and 8, 4 and 11.Therefore, more comprehensive representations were introduced to capture the potential dependence between positions in binding sites, such as maximal dependence decomposition [43], hidden Markov model [44,45], Markov chain optimization [42], as well as a more flexible approach based on variable-order Bayesian network which combines PWM, Markov models and Bayesian network model to fit with each particular subset of binding sites of a TF [46].
However, despite the limitations of the basic PWM approach, it is still the leading model in the search for discovering potential TFBSs.In fact, besides its intuitive representation and fast computation, it has been shown to be comparable at least, and in some case outperforms, other more complicated models e.g.fixed-order Markov models that are usually over-fitted due to a limited .Top-left is the collection of binding sites, each of which is called an oligo or conserved sequence; oligos can be aligned with gaps to maximize the motif content but in this case, it is a gap-free alignment.Several models have been displayed and lastly an advance model of PWM is presented [47]; the normalized formula is inferred from the original equation to ensure the rule that the fitness score of a matched oligo can be estimated by taking the sum of the fitness at each position.The 'bold' part is the core region of the binding sites i.e. the most conserved region in the binding motif model.Bottom-right is the sequence logo that can quickly visualize the specificity of the conserved information in each column.(b) A brief look on the history of binding motif models.Starting from the first simple representation, consensus sequences, one has developed more advance models to characterize the binding motifs of TFs.However, due to the nature of the binding sites e.g.short, degenerate, etc., the problem has become a challenge and the proposed strategies have been modified when applied to higher eukaryotes e.g.search for composite motifs (a set of TFBSs) instead of single motifs, combining additional lines of biological evidence in detecting TFBSs (phylogeny, co-expression, and/or co-function).
Raw PWM (Position Frequency Matrix) The score in each cell is normalized by: training data [46].Therefore, emphasis has been given to strategies that optimize the PWM instead of building more complicated models.For example, the scores in the cells of the matrix can be transformed to improve the specificity of the binding motif model (e.g.convert frequencies to probabilities, adding pseudo-count, taking logarithms, etc. [30,37]) and the binding sites can be aligned before creating the PWM [35].In some cases, the information content (IC) of the PWM, or some similar form, is made use to select a suitable number of binding sites for creating the binding motif model [30,47,48]; Additionally, other significant efforts have been devoted towards enhancing the power of the PWM in order to better discriminate between real binding sites and the background e.g.random data or nonregulatory regions (Figure 1b).In this direction, Gershenzon et al. [49] proposed 16-row matrices to replace the 4-row PWMs; Sandelin et al. [50] tried to classify TFBSs into TF families based on the constrained binding sequence diversity for groups of structurally related TFs to create familial binding profiles; Hannenhalli et al. [51] computationally divided the binding site collection of a TF into two subsets corresponding to two-child PWMs to increase the binding specificity of TF profiles.As earlier noted, however, the short length of the binding sites makes them appear fairly redundant and predictive methods are often replete with false positives.Therefore, given that the main question concerns the actual identification of TFBSs and effective the location of the promoter, searching becomes a more critical issue than simply optimizing the representation.

Promoter identification
The first step towards discovering TFBSs is identifying the set of promoters.In principle, they are defined as the upstream regions proximal to the transcription start sites (TSSs) of genes; however, their length is still not clearly defined among different studies although it is one of the most important factors affecting to the computational predictions.Numerous activities have been proposed such as the recent experiment known as genome-wide open chromatin map that integrates high-throughput sequencing and genome-wide titled array technologies has been performed to identify DNase I hypersensitive sites within human primary CD4 + T cells [52].Such activities aim at better defining proximal promoter lengths which are subsequently incorporated in commercial tools, such as [53].
Besides experimentally identified promoters, a number of computational methods have been proposed to predict promoter regions.Available tools include PromoterInspector [54], DragonGSF [55], EnSemPro [56], and have all been thoroughly reviewed [57,58].Prediction tools can be classified into two main categories, signal-based approaches which rely on conserved signals relevant to promoters, e.g.TATA box, CAAT box, CpG islands, and content-based approaches that utilize conserved motifs to distinguish between promoters and non-promoter regions [59].Several models have been shown to be promising but due to the complexities of the genome structure, large-scale predictions are still difficult [60].
The structure of promoters, especially in mammals, is a complex which can be considered as a mini-structure of a gene where regulatory elements are interspersed within a large number of regions non-conserved and unknown function [60].Traditionally, it has been assumed that the combinatorial interaction of multiple transcription factors with the gene promoter is sufficient to explain the process of transcription.However, recent studies provided results to show that a large proportion of mammalian genes possess multiple transcription start sites (TSSs) and thus multiple promoters driving gene expression in a context-specific manner [61][62][63].Specifically, in a recent study Singer at el. [15] developed and employed a custom microarray platform to show that nearly 35,000 alternative putative promoters are present on around 7,000 human genes.Furthermore, each set of unique combination of TFBSs in the promoter will determine its temporal and spatial expression in a specific context [60] (Figure 2).These observations significantly increase the complexity of understanding gene regulation and the transcription process in general, and create a huge challenge for both computational and experimental methodologies of TFBS identification.
Figure 2. Data complexities in TFBS prediction.(a) Alternative promoters usually occur for genes in higher eukaryotes e.g.nearly 35,000 alternative putative promoters are present on around 7,000 human genes [15].For a specific gene, different promoters are activated to drive the gene expression in different corresponding contexts.(b) Alternative sets of combinatorial TFs regulate the transcription process even though only one promoter is activated in these contexts.M 1 , M 2 , M 3 are three example transcriptional modules (a set of TFs or corresponding TFBSs) activated to regulate the transcription process; module M 1 is present on two cases whereas only a part of M 2 is functional in the other case e.g.human RANTES/CCL5 gene consists of different set of functional TFBSs in different cell types [60].

Discovery of physical TFBSs
One of the first questions related to TFBS identification would be how to detect a conserved motif in a given set of sequences.The problem can be simply stated as follows: given a set of N sequences that are overrepresented, i.e. motifs present in S at a statistical significant rate.The fundamental assumption is that if the sequences are promoters of genes, then conserved motifs can be assumed to be potential binding sites for TFs. (a) There have been a wide range of possible applications for such in silico motif discovery methods.First, they greatly assist experimental studies aiming towards detection of the collection of binding sites for a given TF [32].ChIP-chip assays, for example, identify genomic regions to which a TF of interest binds.However, locating exact sites where the TF binds might be very difficult due to the limitations of the assays.As a result, once the DNA sequences to which the TF binds have been collected motif discovery algorithms, e.g.consensus [64], Gibbs sampling [65], MEME [66], are then applied to locate the exact binding sites.Secondly, if one identifies a set of genes that can be considered as regulated by some common TF(s), then one can begin to search computationally for conserved motifs in the corresponding promoter to infer regulating TFs.The underlying assumption of such a computation is that the common patterns are the likely functional ones.Furthermore, motif discovery algorithms can also assist in cross-species extrapolation to improve the specificity of finding TFBSs on a gene promoter.Once a set of corresponding promoters of a gene across multiple species have been extracted, motif discovery algorithms are used to detect conserved sub-sequences in this promoter across species in an attempt to identify all potential cis-regulatory elements (discussed more details in the next section).Because of the importance of this problem, a variety of algorithms as well as computational tools have been developed for those problems above for the past twenty years (Table 1).However, generally speaking the core algorithms can be classified into two categories: combinatorial and probabilistic [26,30,67].Exhaustive search with pattern-based scoring (combinatorial category) is the starting point of discovering conserved motifs in a set of promoter sequences [67].Due to magnitude of the search space, methods were further improved by exploring sequence-based exhaustive search [68] and also consensus search [69].The probabilistic-based methods employ two main algorithms e.g.Gibbs sampling [65] and MEME [66] and have also been used extensively for motif discovery tools.The basic idea is to continuously reduce the search space and the false positive matches by more accurately representing the motif models.
However, it is important to realize that although a large number of TFs has already been identified, and more are being identified, through numerous high-throughput activities emanating from the decoding of the human, in silico analysis is further hindered by the fact that only a fraction of those can currently be mapped to known and well characterized profiles [53,70,71] (around 600 human TFs in www.genomatix.de vs. approximately 1,850 TFs found in human [72]).When conserved motifs are predicted computationally that are not present in available collections, these are then considered as novel binding sites and/or regulatory regions but they are set aside for further investigation.Therefore, besides such motif discovery methods, another approach to detect potential TFBSs is directly scanning known TF profiles and scoring to determine whether or not the matches are potential binding sites.
Given that the scoring metric would assign relative importance to alternative binding sites in motif discovery methods [29,73,74], it is of equal importance to score directly the subsequences of interest in terms of their potential of being binding sites compared to known TF profiles.Despite the large number of alternative representation models and their associated scoring function, the most widelyused approach is still the one based on the PWM model and the sum fitness function, as discussed above.Given, therefore, that the sum fitness is used, which based on the relative abundance of bases in a specific position based on scanning the TF profiles, the strategy to predict whether or not a site is a binding site is among the most critical factors.Therefore, major emphasis is placed on developing strategies that score a candidate oligo and identify the thresholds for the prediction.A typical approach is based on core similarity matches (Figure 1a) to reduce the number of false positive matches [47].Furthermore, the threshold for each PWM is optimized so that a maximum of three matches are allowed in 10,000bp of non-regulatory test sequences (coding sequences excluding first exons and genomic repeats).This is the approach used in tool MatInspector in Genomatix [47].As an alternative strategy, [48] implemented P-Match in TRANSFAC to select the optimized thresholds so that the false positive rate is minimum and/or the false negative rate reaches some user-defined threshold.The threshold for minimum false positive rate is the one at which no match is found on the background set of exon sequences; and the threshold for false negative rate α is the rate at which α% of binding sites in the collection used to build the TF profile are not detected by that threshold using leave-one-out cross validation.Besides determining is the magnitude of a score threshold, both approaches also make use of the concept of TF family profiles [50,51] with some variations to reduce the redundant matches in scanning TF profiles on a promoter sequence.Generally speaking, the key idea here is using prior knowledge such as known TF profiles to predict the most probable TFBSs on promoter sequences with a minimum false positive matches; for example, those PWMs that represent similar DNA patterns will be assigned into the same TF family [47].

Discovery of functional relevant TFBSs
All the above approaches focus on identifying either experimentally or computationally putative binding sites.Regardless of the approach used, however, physical binding does not necessarily imply functional activity.As such, a major question concerns the functional characterization of the putative binding sites.Although, one can never be certain of the true activity of a TF, unless an appropriate experiment is conducted, computational approaches aim at reducing the number of alternatives on which further experimentation is conducted.Therefore, the next critical question to explore is whether we can identify those TFBS that are more likely to be functional among a set of candidates.Probably as expected, this has turned to be a very challenging question, particularly in higher eukaryotes.However, it has also served as an endless source inspiration for developing numerous computational strategies.
Before delving into the specific details of the computational approaches, it is instructive to classify the putative functional binding sites into two major categories: (a) general vs. context specific functional binding sites.The former aims at identifying cis-regulatory elements of a single gene that are conserved across multiple species based on the evolutionary hypothesis (so-called phylogenetic footprinting) whereas the latter searches for overrepresented TFBSs across multiple genes of a single species that share common characteristics in a specific context e.g.co-expression and/or co-function.

Phylogenetic footprinting
With the advent of novel high-throughput technologies, a large number of genomic sequences of different species have been sequenced, catalogued and annotated, making it possible to explore the conserved information among orthologous genes in an effort to enhance TFBS prediction.The basic underlying assumption of comparative genomics, or phylogenetic footprinting, is that functional regions evolve under constraints and thus at a lower rate than non-functional regions.Therefore, it is hypothesized that well conserved regions in a set of orthologous sequences survived due to their special functional implications, making them become promising candidates for functional cisregulatory elements [75].Preliminary evidence seems to support the hypothesis that conservation does imply so kind of, yet to be determined, significance.For instance, Cliften et al. [76] sequenced six Saccharomyces species and verified that many TFBSs are conserved across species and also located in conserved blocks although the blocks are often times much longer than the binding sites.Similarly, Gibbs et al. [77] demonstrated that regions with high-scoring PWM matches that are conserved across human-mouse-rat genomic alignment provided a 44-fold increase in the specificity of the predictions compared to those that are not conserved.Therefore, utilizing the information from orthologous genes across multiple species is becoming a useful paradigm in predicting putatively functional binding sites as well as reducing the false positive matches in motif discovering methods.Now given a gene of interest, one begins constructing a global [78,79] or local multiple sequence alignment [80][81][82] of orthologous promoter sequences and then identifies conserved regions which are considered as regulatory regions of the gene.However, all these tools assume that all nucleotides are alike or use a well-established substitution matrix to penalize the insertion, deletion, or substitution in the alignment, and thus they may not align properly non-coding DNA sequences of orthologous genes [82].Another critical limitation is that for closely related species, the alignment is obvious but impossible to distinguish functional elements from the surrounding non-functional regions like the case of four Saccharomyces species from the sensu stricto group, two species from the sensu lato group and one petite-negative species for example [83].On the other hand, when the species are too far apart evolutionarily as the case of rbcS gene in 10 plants shown span ~ 760 million years of evolution, only 3 conserved sites that are known regulatory elements each of 9 bp long are present in around 500 bp in the 5' upstream regions [84].Therefore, given that cis-regulatory elements are short, degenerate fragments, and present on such a large number of non-functional, diverged regions, regular sequence alignment methods are usually failed to detect properly these short conserved regions [85].
To overcome the problems associated with alignment, motif discovery algorithms e.g.Consensus, Gibbs sampling, MEME have been utilized along with the set of orthologous promoter sequences as the input data, [83,86].As a result, it is more likely to have a functional binding site as well as reducing false positive matches if some TF profile is located on a conserved region in the set of promoter sequences; especially, if the region is conserved among sequences from distantly related species.As such the match would be subject to selection pressure and more likely to be functional.A number of available tools such as PhyloCon [87] and CONREAL [88], explored those ideas.
However, in the aforementioned approaches phylogenetic relationships of the given sequences are not explored.Therefore, results are still highly biased to favor relationships between sequences located on closely related species [85].A series of later models have attempted to incorporate the phylogenetic information into the search strategy, including the phylogenetic relationships between sequences and/or the binding site evolution model.Specifically, FootPrinter [85,89] used a standard phylogenetic tree to estimate the significance of each conserved motif; EMnEM [90] applied the Jukes-Cantor (JC) model [91] with a fixed substitution rate for the evolutionary model of regulatory elements; PhyME [92,93] and PhyloGibbs [94,95] used a model suggested by Sinha et al. [96] which is similar to Felsenstein's molecular evolution model [97] to model the binding site evolution; MONKEY [98] employed Felsenstein's molecular evolution model [97] and allows users to select between the JC [91] and HKY [99] models for the background.Finally, Gertz et al. [100] proposed a model that employed a more detailed evolution model for binding sites based on the work of [99].Besides such tools that find conserved regions in a set of orthologous promoters independently with known TF profiles, some attempt has been made to incorporate both into a single search method e.g.PhyloScan [101].

Context-specific search
While it is recognized that not all binding sites found on a promoter will be functional elements, it is also recognized that functional sites are not activated simultaneously or independently of condition, or environment, since the cooperation of TFs is highly dependent on context [102][103][104][105][106]. Human RANTES/CCL5, a member of the CC-or β-subfamily chemotactic cytokines for instance, appears to have six functionally characterized short regulatory elements on its promoter that mediate its transcription initiation.However, not all six elements are activated simultaneously in any specific tissue in five cell types analyzed and the elements are also highly selective under different stimulating signals regulating gene expression [107].Consequently, a critical question is to establish a relationship between binding sites and the context in which these sites become functional.The term 'context' here is used in a way that implicitly refers to a set of potentially co-regulated genes e.g.genes that appear either to exhibit correlation in their expression patterns or to be involved in similar functions in a specific condition and/or tissue [103,108,109].Two elements become critical in this direction: (i) knowledge of the set of potentially co-regulated genes, and (ii) the context-specific nature of functionality.
We must, however, realize that TFs in higher organisms are more likely to cooperate with nearby bound factors in a combinatorial manner to regulate the transcription process rather than function isolation.As an example, gene even skipped (eve) is known to be regulated by at least five TFs (Twi, Tin, dTcf, Mad, and Pnt) binding to a 312bp MHE enhancer located ~6kbp downstream of the eve coding region.The corresponding binding sites of these TFs form a cis-regulatory modules that has been shown to occur significantly less than randomly expected and also more likely responsible for the regulation of other genes if present on their promoters.Therefore, the field has been shifting from the single motif detection to the discovery of composite motifs i.e. cis-regulatory modules.From a computational point of view, the concept of cis-regulatory modules also helps to reduce the number of false positive matches and make the search strategy more efficient.A cis-regulatory module is in general defined as a cis-regulatory element which is the smallest functional unit to have a biological role in transcriptional regulation [53].It consists of a set of individual binding sites of TFs on the proximal promoter region of a gene.A module is mainly characterized by two factors: composition and structural constraints.Composition is a set of non-overlapping binding sites, whereas structural constraints are the strand orientation to which the corresponding TF binds, the order and the distance between binding sites [110].A variety of tools have been developed to search for modules in a set of promoter sequences without taking into account the structural constraints (also called composite motifs) e.g.Cluster-Buster [111], CisModule [112], MSCAN [113], CisPlusFinder [114] , ModuleMiner [115], DiRE [116].Alternatively, when cis-regulatory modules are associated with their structural constraints (now so-called TF-modules), it refers to methods that detect TFBSs using known TF profiles (FrameWorker [110], CMA [117]).
The main idea in this direction is to use prior knowledge to identify the set of potentially coregulated genes and then search the corresponding promoter set for common and/or significant cisregulatory modules (Figure 3).Earlier studies assumed that a cluster of coexpressed genes could be under the same regulatory mechanism, e.g.co-regulation [118,119] or co-function [120].However, more recent evidence suggests that co-expression alone is not enough to infer the existence of common regulatory mechanisms and instead additional information is required [108,121], especially in higher organisms.Specifically, recent studies have shown that genes sharing similar expression patterns can participate in a number of different biological functions and/or genes in the same pathway can exhibit different patterns of expression [122,123].Moreover, the underlying gene regulation is shown to be tissue and/or condition specific and the TFs that drive the gene expression are very flexible in function and activity under different conditions [103][104][105][106]. Therefore, defining the context in which a set of genes are more likely to be co-regulated poses a formidable challenge to researchers.
A number of assumptions have been suggested and preliminary results appear promising.For example, Segal et al. [124] attempted to generate testable hypotheses like 'regulator X regulates coexpression module Y under conditions W' with the assumption that co-expressed genes across a set of conditions will be co-regulated.The work was done on yeast with some cases experimentally verified successfully.Similarly, Elkon et al. [125] analyzed a set of coexpressed gene that are cell-cycledependent and found eight TFs whose binding sites are significantly overrepresented in their promoters, or Long et al. [109] identified statistically overrepresented cooperating TFBSs in the 1 kbp upstream sequence set of each biological-process gene group in GO.

Figure 3. (a)
A brief overview of how computational models are developed.Motif discovery methods consist of three main components: the binding site representation, the scoring function, and the search strategy.There are alternative approaches for searching novel motifs, including scanning for TFs with known profiles.Phylo-tools incorporate phylogenetic relationships among species or their corresponding promoter sequences, and the binding site evolution model to improve the search strategies.Besides, context-specific information can also be explored to predict functional binding sites and then infer a set of context-relevant TFs for further analysis.
(b) Different strategies to predict cis-regulatory elements using additional biological knowledge.Two main strategies exist: single gene, multiple species and single motif discovery methods vs. multiple genes, single species and cis-regulatory module discovery methods.The concept of cisregulatory modules was introduced to capture the biological aspect as well as enhance the specificity of the search, especially in higher organisms; besides, other combinatorial strategies have also been developed in the literature (dash arrows).Besides the two main strategies for discovery functional binding sites, i.e. single gene, multiple species, single motifs for detecting general functional binding sites; and multiple genes, single species, composite motifs plus context for predicting context-specific functional binding sites as well as relevant condition-specific activated TFs, other combinatorial strategies have also been developed in the literature.For example, combining co-expressed genes with phylogenetic footprinting for single motif discovery [87,[126][127][128] or for composite motif discovery [129].On the other hand, some studies require a collection of promoter sequences with known module sites to serve as training data for building predictive models [130][131][132][133].These approaches seem to be more accurate both in detecting single TFBSs [42,46] and in predicting TF-modules [132,133].However, these models have to be built carefully, cannot be easily inferred automatically, and a number of parameters need to be determined using training data.

Inference of transcriptional regulatory networks
Automatic inference of regulatory networks is an essential step in bridging the gap between the raw expression data and the mechanistic understanding at the molecular level.Better predictions of such networks will find widespread application towards efforts to delineate the impact of external stimuli on cellular responses.Although gene expression can reveal some part of the picture, such results are often difficult to interpret without an understanding of relevant pathways and networks [134].Promoter analysis provides suggestions to which TFs are relevant to the response as well as sketch out a preliminary picture of the interplays between TFs and gene promoters that orchestrate the gene expression of thousands of genes due to changes in the environment.With the assumption that if a corresponding binding site of TF A is present on the promoter region of gene B, or statistically overrepresented on a gene set B, B will be considered as regulated by A, while the regulation can be either activation or repression.Several computational tools such as PAINT [135,136], CARRIE [137,138] have been developed to automatically produce the transcriptional regulatory network given a set of genes.Although much work needs to be done, these tools can provide preliminary testable hypothesis.

Concluding Remarks
In this review, we have summarized the current state in characterizing promoter sequences for the search for putative transcription factor binding sites.We addressed the elements associated with the representation and mining of the genomic information and characterized the basic methods, algorithms and computational tools.Future success of such endeavors is expected to have major impact on biological and clinical applications.However, a major point of concern and a most critical open question refers to the possibility of establishing de novo link between putative binding sites and actually functional binding sites.Function prediction from sequence information is an open question in a number of areas of computational biology and transcription regulation is no exception.Regardless of our present inability to establish that link, the availability of methods, like the ones described in this review, are of paramount important as they allow for the systematic generation of critical testable hypotheses.

Figure 1 .
Figure 1.Binding site representation.(a)Illustration of several motif models for human factor ETS1. Top-left is the collection of binding sites, each of which is called an oligo or conserved sequence; oligos can be aligned with gaps to maximize the motif content but in this case, it is a gap-free alignment.Several models have been displayed and lastly an advance model of PWM is presented[47]; the normalized formula is inferred from the original equation to ensure the rule that the fitness score of a matched oligo can be estimated by taking the sum of the fitness at each position.The 'bold' part is the core region of the binding sites i.e. the most conserved region in the binding motif model.Bottom-right is the sequence logo that can quickly visualize the specificity of the conserved information in each column.(b) A brief look on the history of binding motif models.Starting from the first simple representation, consensus sequences, one has developed more advance models to characterize the binding motifs of TFs.However, due to the nature of the binding sites e.g.short, degenerate, etc., the problem has become a challenge and the proposed strategies have been modified when applied to higher eukaryotes e.g.search for composite motifs (a set of TFBSs) instead of single motifs, combining additional lines of biological evidence in detecting TFBSs (phylogeny, co-expression, and/or co-function).
the observed frequency of base b at position i and b p is the background frequency of base b (usually 25% as neutral distribution across the genome is assumed).

Table 1 .
Selected resources and relevant tools for in silico TFBS identification.