Design of Machine Learning Models for the Prediction of Transcription Factor Binding Regions in Bacterial DNA †

: Transcription Factors (TFs) are proteins that regulate the expression of genes by binding to their promoter regions. There is great interest in understanding in which regions TFs will bind to the DNA sequence of an organism and the possible genetic implications that this entails. Occasionally, the sequence patterns (motifs) that a TF binds are not well deﬁned. In this work, machine learning (ML) models were applied to TF binding data from ChIP-seq experiments. The objective was to detect patterns in TF binding regions that involved structural (DNAShapeR) and compositional (kmers) characteristics of the DNA sequence. After the application of random forest and Glmnet ML techniques with both internal and external validation, it was observed that two types of generated descriptors (HelT and tetramers) were signiﬁcantly better than the others in terms of prediction, achieving values of more than 90%.


Introduction
The prediction of specific transcription factor (TF) binding regions in the DNA of bacterial organisms is a challenging task, especially when the TF binding motifs are not well defined or there are certain structural parameters of the DNA structure at play that classical models do not take into account. Advances in machine learning (ML) techniques have made it possible to create models capable of incorporating DNA structural parameters in the prediction of TF binding sites in genomic sequences.
In this work, using ML techniques, we were able to predict with high accuracy whether a nucleotide sequence would be a region where the TF would bind or not. Our work has been previously published as a Master's Thesis at the Universitat Oberta de Catalunya (UOC) [1]. The DNA sequences used as positive data were extracted from the article by Adhikari et al. [2], obtained using the ChIP-seq technique with the GcrA TF in the bacterial organism Brevundimonas subvibrioides.

Creation of Negative Sequence Set
From the 879 peaks (nucleotide sequences where the target TF had bound), two types of nucleotide sequences were created from these in order to generate the database of negative sequences.
On the one hand, biologically plausible replicates, here termed Replicates, were created. For these, the peaks were located in the reference genome, and their subsequences were classified according to whether they were located relative to a gene: intergenic, intragenic, upstream, or downstream. From this, 879 sequences homologous to the positive dataset were generated matching the composition of each one but using regions of the organism's genome where the TF had not been bound.
On the other hand, a negative dataset was generated with a pseudo-replicate process, termed Boots replicates. In this case, the generation of negative data was carried out by extracting trimers that existed in the original peak until the target length was completed.

Descriptor Extraction
The FASTA file of each of the datasets (879 peaks + 879 replicates for each of the cases) was introduced to the R DNAshapeR library, and vectors of values were obtained for each of the 4 selected elements: HelT, MGW, ProT, and Roll. From the data vectors obtained for each sequence, histograms generated for each sequence were used as descriptors. In this case, 25 descriptors were chosen by DNAshape and dataset algorithms.
Descriptors were also calculated by counting the appearance of k-mers in each of the sequences. Monomers, dimers, and tetramers (4, 16, and 256 combinations, respectively) were studied.

Machine Learning Models
Random forest [3] and generalized linear model (Glmnet) [4] were used in this work. A nested cross-validation was carried out, using a validation for the search of the best hyperparameters through a holdout and a second validation with a 10-fold CV to validate the model, taking the average of 5 iterations of this technique. Thus, the performance values reflected in Figure 1 represent the average of the 50 models generated for each descriptor and each ML model.

Results
The total number of datasets to carry out the ML models was seven sets for each of the datasets (Replicates and Boots_replicates), resulting in a total of fourteen sets of descriptors to be trained with the two algorithms. The models obtained by the algorithms were evaluated using the area under the receiver operating characteristics curve (AUCROC). The main results are described in the subsection presented below.
HelT and Tetramers as Key Descriptors Figure 1 illustrates the performances of the AUCROC for both datasets and all descriptors. On the left side, we can observe the results obtained for the Replicates, while the right side shows those corresponding to Boots_replicates. The first four sets represented belong to the structural descriptors extracted by the DNAshape library, while the last three correspond to k-mers counts.
As can be observed, both the HelT and tetramers descriptors produce the ML models with the highest performance, reaching up to 0.75 in the Replicates and almost 0.9-1 in the Boots_replicates.

Discussion
The use of these descriptors is due to the fact that we seek to represent the sequences not only in terms of base reading but also in terms of their structural features. These have been described as two different modes for protein-DNA recognition that are key and necessary to predict the behavior of TFs in DNA [5].
Of the seven descriptors studied, it was observed that different ML models with a high level of prediction could be developed to determine the TF binding regions across the genome. In particular, both HelT and tetramer counts proved to be particularly predictive.
Similarly, it has been found that the most significant tetramers for ML algorithms largely correspond to the motif that was found in the reference paper for the data used [2] with the MEME program, as presented in Figure 2.  [2] for GcrA TF in Brevundimonas subvibrioides.
Funding: This work has received financial support from the Xunta de Galicia and the European Union (European Social Fund (ESF)).

Data Availability Statement:
The database used for the development of this study it is available at the Gene Expression Omnibus (GEO) under accession number GSE138845.