1. Introduction
Bacterial promoters are DNA sequences positioned upstream of the transcription start site (TSS), crucial for recognition by the RNA polymerase (RNAP) [
1]. Promoters initiate transcription due to their affinity to the RNAP, and therefore are essential for maintaining cellular homeostasis. Promoter affinity is highly determined by two conserved hexamers in positions −10 and −35 upstream of the TSS [
1,
2]. The promoter efficiency is also modulated by a spacer region around 17 ± 3 bp length between the hexamers and the adjacent nucleotide sequences (UP element, extended −10 element; (
Figure 1A) with lower conservancy in comparison to the −10 and −35 promoter elements [
3,
4].
The bacterial RNAP consists of six subunits—the core RNAP composed of five subunits (
), and the sigma (
) subunit or
factor (
Figure 1B). The
factor reversibly binds to the core RNAP modulating the DNA-binding characteristics of the enzyme, increasing the affinity for promoters [
1]. The most abundant sigma factor is
(
), encoded by
rpoD and responsible for almost all gene expression. Different
factors have been identified, each one competing for the core RNAP and initiating the transcription of different genes associated to a specific nutritional status or environmental condition [
1,
5]. Alternative sigma factors targeting different promoters have been characterized in
E. coli. These include the
(
) encoded by
rpoH involved in heat shock stress response, the
(
) associated to stationary phase regulation,
(
) related to extracytoplasmatic functions,
(
) relevant to nitrogen metabolism, and
(
) associated to flagellar synthesis and chemotaxis [
5]. Each
factor recognizes and facilitates the binding of RNAP to different promoters with distinct consensus sequences, which hampers the in silico identification of promoter sequences in bacterial genomes [
6].
Accurate and fast prediction of promoter sites associated with a
factor remains a troublesome issue in genomics and molecular biology, despite being highly relevant for gene expression patterns, genetic regulatory networks, and synthetic biology studies, wherein the synthesis of new nucleotide sequences can incorporate undesired promoter sequences [
6,
7,
8]. Traditional methods carried out by low-scale methods (e.g., DNA footprinting, primer extension, electrophoretic mobility shift assay) are slow and time-consuming. Meanwhile, even the use of high-throughput technologies (e.g., RNA-seq, systematic genomic evolution of ligands by exponential enrichment) [
9,
10,
11] still does not compare to the explosive amount of genomic data generated during the last decade [
12,
13]. In silico prediction based on bioinformatic tools has also been explored based on sequence information, first as position weight-matrices [
14,
15,
16], and later by using machine-learning (ML) techniques. The latter stand out as they do not require manual assembly of the characteristics or patterns to be detected, are free-alignment methods that do not require comparison with known sequences or databases, and work on raw information [
6,
7,
17,
18,
19,
20,
21,
22]. This feature is particularly relevant because one of the biggest problems identifying promoters is precisely their high variability. Then, depending on the overall context, a sequence may or may not be part of a promoter, as the so-called
Pribnow box conformed by TATAAT (the −10 promoter element) is typically associated with promoters, but their sole identification does not define them.
To date, there are still few proposals based on neural networks for promoter sequence classification capable of making predictions with specificity and sensitivity values more significant than 80% and even 90%. These techniques generally use data from promoters recognized by
factor [
7,
20]. In bacteria, they have been tested on the model strains (
E. coli str. K-12 substr. MG1655 and
Bacillus subtilis subsp.
subtilis str. 168) by using already characterized promoter sequences as positive samples and delivering random sequences of coding sequences (CDS) as the negative sample. These methods generally take the problem as a classification problem, classifying the sequence as either a promoter
or a non-promoter
[
6,
7,
20].
In recent years, various strategies have been explored to address the problem described, emphasizing those based on machine-learning techniques. Notably, the following references are based on convolutional neural networks (CNN). Reference [
20] proposed CNNProm, consisting of a CNN of a one-dimensional (1D) convolution layer followed by a max-pooling and a fully connected ReLU, with sigmoid-triggered output. Its dataset contains bacterial, human, mouse, and plant sequences, each of 81 nucleotides for bacteria and 251 for the rest. These were coded according to one-hot encoding:
,
,
,
. For negative examples, random coding sequences were selected. Qian et al., proposed improvements to CNNProm [
19]. They used support vector machines to highlight the importance of element sequences of eukaryotic promoters (9 elements included), compressing the non-element sequences of the promoter. The promoter sequences also used one-hot encoding. Subsequently, [
7]. proposed DeePromoter, adding a long short-term memory (LSTM) to the architecture. They also stop using coding sequences as negative examples, generating them from the positive ones and replacing random parts, increasing the method robustness against false positives. Another novelty is incorporating dropout layers to increase robustness and prevent overfitting. pcPromoter-CNN [
22] presents a convolutional neural network model for promoter prediction and classification of sigma sub-classes following a cascading architecture, performing a binary classification. First, classifying promoters and non-promoters, then for the promoters, it checks if it belongs to
; if not, it continues with
and so on. Finally, Ref. [
18] developed IPromoter-BnCNN, capable of classifying promoters into five sigma categories by using a series of cascading binary classifiers as well as pcPromoters-CNN. In IPromoter-BnCNN, each binary classifier is a CNN of 4 parallel branches.
This study proposes a light two-stage promoter prediction and classification model by using multiclass CNN. We denote this model as PromoterLCNN. The first stage was designed to distinguish between promoters and non-promoters, and the second stage performs the sigma classification by using a multiclass classification model. As stated by previous works in the literature such as [
22,
23,
24], we use Chou et al.’s 2011 [
25] five-step rules for a clear presentation and validation of the model. We applied the model to
E. coli by using RegulonDB v9.3 and 10.7 benchmark databases [
26,
27], for training and independent testing, respectively. We validate the results by using the K-fold cross-validation technique, examining four performance evaluation metrics to compare them with the best-performing methods found in the literature.
This paper is structured as follows.
Section 2 presents our prediction and classification model.
Section 3 presents the numerical results and a further discussion. Lastly,
Section 4 states several conclusions and final remarks.
3. Results and Discussion
We train our model by using k-fold validation with a k value equal to 5. The performance results are summarized in
Figure 5 and
Figure 6, illustrating the results for our PromoterLCNN (Lc), and obtaining performances better than or similar to pcPromoter-CNN (Pc) and iPromoter-BcNN (Bc) methods, measuring their accuracy (Acc), sensitivity (Sn), specificity (Sp) and Matthews correlation coefficient (MCC), for the training dataset (
Figure 5) and the independent test dataset (
Figure 6) mentioned in
Section 2.1. We remark that these figures are also presented as a heatmap, in which the colour intensity indicates the quality of each value displayed here.
As displayed in
Figure 5, the results using the training database show that our approach (Lc) performs better than the pcPromoter-CNN (Pc) method for every metric and every promoter. In fact, a weighted average of accuracy with respect to the number of elements in each class gives 96.7% for Lc and 91.2% for Pc. From the sensitivity results, our method tends to produce a few more misclassifications than iPromoter-BcNN for this dataset, and the specificity row suggests that these corresponded to false-negative non-promoters as the dataset is balanced. In other words, the approach is very efficient detecting and classifying promoters, with minor contamination of non-promoters. In any case, the sensitivity values of the PromoterLCNN drop or improve a few points concerning iPromoter-BcNN, but it does not significantly fail like the pcPromoter-CNN does for
and
. The MCC shows that the confusion matrix quality for most classes is very similar for the two leading methods. A few misclassifications on the training data might indicate that the network is generalizing correctly, as opposed to overfitting problems found in shallow or classical learning methods. To verify this, we must achieve similar performance in the testing dataset.
Regarding the results over the test dataset presented in
Figure 6, our approach outperforms the pcPromoter-CNN, as the weighted averages are 89.6% for Lc and 83.0% for Pc, and achieves comparable (and even slightly better) results than iPromoter-BcNN in terms of accuracy. PromoterLCNN produces similar or even fewer false negatives, as illustrated by the specificity values of promoters (Sp). However, it is important to recall
Table 1 for a tempered analysis: each promoter class has only a few examples, and there are no non-promoters. Therefore, the most relevant statistics here are those obtained for
and
(i.e.,
. These statistics show that our approach is better at classifying promoters, at the cost of non-detecting a few of them (false negatives) compared to iPromoter-BcNN.
Establishing the competitive performance of our approach is vital to notice the parsimony in our proposal. We present an architecture appreciably lighter than pcPromoter-CNN and iPromoter-BcNN. The first one uses a cascading architecture for classifying first promoters and non-promoters, and later sigma sub-classes sequentially, with nine different layers each [
22]. The second one uses the same cascading architecture, with four parallel layers (4 layers each branch), converging into 4 layers at the end [
18]. On the contrary, our method has only two stages with eight different layers each. This feature significantly diminishes the computing time for training, prediction, and hyperparameter optimization processes, taking a tenth of the time compared to iPromoter-BcNN, and 30% less than pcPromoter-CNN.
4. Conclusions
This work presents a two-stage promoter prediction and classification model by using a multiclass convolution neural network called PromoterLCNN. The first stage of the architecture attempts to recognize between promoters and non-promoters, and the second stage engages the sigma classification by using a straightforward multiclass classification model, in contrast to standard approaches found in the literature. We use Chou et al.’s five-step rules for our model presentation and validation process by using E. coli databases found in RegulonDB v9.3 and v10.7 benchmark databases for training and independent testing, respectively.
We used a K-fold cross-validation training and assessed the results with an independent test dataset. We found out that our method outperforms one of the most competitive methods in the literature. For the more recent state-of-the-art approaches, our proposal has competitive results in accuracy, with better promoter-type classification. Remarkably, PromoterLCNN has a lighter architecture than other models, leading to a shorter time for the hyperparameter tuning, training, and prediction processes without compromising classification quality, an attractive quality for molecular or synthetic biologists working with nucleotide sequences on a daily basis. By using part of a genome or a newly synthetized sequence as input data, PromoterLCNN might help researchers and users working in the field of bacterial genomics, molecular biology, and bioinformatics to identify bacterial promoters and classify them into each of the subclasses validated in this study.