Evaluation of the Abundance of DNA-Binding Transcription Factors in Prokaryotes

The ability of bacteria and archaea to modulate metabolic process, defensive response, and pathogenic capabilities depend on their repertoire of genes and capacity to regulate the expression of them. Transcription factors (TFs) have fundamental roles in controlling these processes. TFs are proteins dedicated to favor and/or impede the activity of the RNA polymerase. In prokaryotes these proteins have been grouped into families that can be found in most of the different taxonomic divisions. In this work, the association between the expansion patterns of 111 protein regulatory families was systematically evaluated in 1351 non-redundant prokaryotic genomes. This analysis provides insights into the functional and evolutionary constraints imposed on different classes of regulatory factors in bacterial and archaeal organisms. Based on their distribution, we found a relationship between the contents of some TF families and genome size. For example, nine TF families that represent 43.7% of the complete collection of TFs are closely associated with genome size; i.e., in large genomes, members of these families are also abundant, but when a genome is small, such TF family sizes are decreased. In contrast, almost 102 families (56.3% of the collection) do not exhibit or show only a low correlation with the genome size, suggesting that a large proportion of duplication or gene loss events occur independently of the genome size and that various yet-unexplored questions about the evolution of these TF families remain. In addition, we identified a group of families that have a similar distribution pattern across Bacteria and Archaea, suggesting common functional and probable coevolution processes, and a group of families universally distributed among all the genomes. Finally, a specific association between the TF families and their additional domains was identified, suggesting that the families sense specific signals or make specific protein-protein contacts to achieve the regulatory roles.


Introduction
Gene regulation is crucial for optimal processes in the cell and as the first action to achieve expression during adaptation of metabolic responses to environmental conditions. In this context, regulation of gene expression at the transcriptional level, where DNA-binding transcription factors (TFs) play a fundamental role, allows the organisms to modulate the synthesis of specific genes depending on the metabolic requirements, stress responses, or food availability, among others. Hence, TFs interact with their DNA-binding sites around or overlapping the promoter-binding site [1,2], and in consequence allow or block access to the RNA polymerase, i.e. activating or repressing gene expression. In general, TFs are two-domain proteins, with a DNA-binding domain (DBD) in either the amino or carboxy terminus, which is involved in specific contacts with the regulatory region of the corresponding cognate genes, and an additional domain associated with diverse functions such as ligand binding or protein-protein interactions [3,4]. To date, diverse studies have shown that some TF families are common to bacteria and archaea, suggesting that the mechanisms affecting gene expression could be similar in the cellular domains of both of these groups of prokaryotes [5,6].
Therefore, a variety of factors are involved in the diversity of TFs and their families, such as the lifestyles. For instance, bacteria that have free-living lifestyles, such as Pseudomonas aeruginosa or Escherichia coli, bear a much larger number and variety of genes encoding transcriptional proteins than do intracellular pathogens that thrive in more stable biotopes [7,8]. In contrast, archaea organisms seem to have a lower proportion of TFs than bacteria, suggesting the existence in archaea of alternative mechanisms to compensate for the apparent deficit of protein regulators, including conformations of diverse protein complexes as a function of metabolic status [6,9].
Hence, to understand the association between the expansion patterns of different protein regulatory families, 1351 completely sequenced bacterial genomes, which represent adaptive designs for evolutionary classification, were analyzed. This analysis is important to understand their contribution to gene regulation in different lineages and provide insights into the functional and evolutionary constraints imposed on different classes of regulatory factors in bacterial and archaeal organisms. In this context, abundant families are not widely distributed across all bacteria and archaea. In contrast, certain small families are the most widely distributed. This difference might be associated with different phenomena, such as evolutionary constraints by regulatory mechanisms, as the case of LexA and LysR families. Our results also suggest that in larger genomes, regulatory complexity may possibly increase as a result of the increasing number of some TF families.

Bacterial and Archaeal Genomes Analyzed
A total of 5321 prokaryotic genomes from the NCBI Refseq genome database [10] were downloaded, and to exclude any bias associated with the overrepresentation of bacterial or archaeal genomes of one genus or species, we employed a web-based tool and a genome similarity score (GCCa) of ≥0.95 [11] as the limit to consider a genome non-redundant; this method resulted in a set of 1321 representative genomes.

Identification of DNA-Binding Domains Associated with TFs
We retrieved 16,712 hidden Markov models (HMMs) from the PFAM database and used them to scan 5321 genomes, using the program pfam_scan.pl with an E-value of ≤10 −3 and with the option of clan_overlap "activated" (to show overlapping hits within clan member families; this step only applies to Pfam-A families). In a posterior step, 111 PFAMs associated with DNA-binding TFs were retrieved from diverse databases containing information on regulatory proteins, such as the DBD (DNA-binding domain) database, Regulon Database, and Database of transcriptional regulation in Bacillus subtilis (DBTBs). We also identified relevant information by manual curation to identify proteins devoted to gene regulation. (The complete list of PFAM IDs is included as supplementary material Table S1).

Protein Domain Enrichment Analysis
To evaluate the content of protein domains associated with the 111 families of TFs, their structural domains were determined by considering the PFAM assignments and enrichment analysis for each group. To this end, we used a one-tailed Fisher's exact test (FET) to perform enrichment analysis, because it is related to the hypergeometric probability and can be used to calculate the significance (p-value) of the overlap between two independent datasets. We set statistical significance at a p-value of −10. Together with the FET, we also determined the false discovery rate (FDR) of the tests in order to account for type I errors. Corrections for multiple testing were performed using the Benjamini and Hochberg step-up FDR-controlling procedure to calculate adjusted p-values. All analyses were performed using the R software environment for statistical computing and graphics and the package multtest [12].

The Repertoires of TF Families Correlate with Genome Sizes
In order to identify how TF families are distributed as a function of genome size, the Pearson correlation (R-value) was calculated for the abundance of each family against the number of open reading frames (ORFs) associated with all genomes. From this, nine TF families were identified as correlating with genome size (R ≥ 0.6), such as the Trans_reg_C (PF00486) and GerE (PF00196) families, which include two-component systems, and seven families associated with a one-component system [GntR (PF00392), TetR_N (PF00440), MarR (PF01047), HxlR (PF01638), MerR_1 (PF13411), CSD, and HTH_AraC families] ( Figure 1 and Table 1). These families followed a similar trend of duplication and loss events as a function of genome dynamics, reinforcing the notion that increased gene complexity also requires the development of mechanisms for gene regulation at the transcription level [13], i.e., when the genome is duplicated, members of these families are also duplicated, but when gene loss occurs these families are affected, decreasing the size of the family. An interesting observation of these families is the fact that they are regulating central functions in the organisms, such as carbon sources uptake (HTH_AraC) and resistance to multiple drugs, as antibiotics or heavy metals (MarR and MerR_1), among others. Conversely, 102 families did not exhibit an evident correlation with the genome size, suggesting that a large proportion of duplication or gene loss events occur independently from genome size, such as the highly abundant families LysR (PF00126) and HTH_3 (PF01381), associated to regulate amino acid biosynthesis and a hypothetical family widely distributed along the organisms.

The Abundance of Families is Not Homogeneous Across the Genomes
In order to determine how regulatory proteins abundant are along the prokaryotic genomes, a total of 225,999 TFs were identified in 1321 bacterial and archaeal genomes. These proteins were clustered into 111 different families, and their abundances and distributions along the genomes were evaluated. In this regard, some families are quite heterogeneous in terms of their abundances; for instance, 34 of these families include fewer than 100 proteins per group, such as the transcriptional repressor of hyc and hyp operons, HycA_repressor (PF11046), or the PerC transcriptional activator (PF06069), whereas two of them [HTH_1 (LysR) or PF00126, and TetR_N or PF00440] include more than 20,000 members per group. See Table 1. Therefore, to evaluate how the abundance of TF families correlated with the bacterial and archaeal genome sizes, we calculated the coefficient of variation (CV), a measure of the dispersion of data points in a data series around the mean. In this regard, large families showed a minor variation among the genomes, whereas small families exhibited a wide variation (presence and abundance) among the organisms analyzed in this work. For instance, the large families Trans_reg_C  Table 1.
When the abundance of the families was analyzed in detail, 12 of them were identified with more than 10 members per phylum (on average), whereas the rest of the families contained a low number of copies per phylum. In Figure 2 we show that seven families with more than 10 members were identified in Actinobacteria, 6 in Proteobacteria and Acidobacteria, and 5 are abundant in Firmicutes. There are three families in Verrumicrobia, two families each in Chlorobi, Cyanobacteria, Nitrospora, Planctomyces, and Bacterioidete. The family HTH_24 [or AsnC (PF13412)] was found to be abundant in Euryarchaeota, and the family Trans_reg_C (PF00486) is abundant in Deinococcus. Indeed, the family AsnC (HTH_24), abundant in Euryarchaeota, has been described as a group with global regulators in bacteria and archaea, suggesting a role in global regulation [14].

The Abundance of Families is Not Homogeneous Across the Genomes
In order to determine how regulatory proteins abundant are along the prokaryotic genomes, a total of 225,999 TFs were identified in 1321 bacterial and archaeal genomes. These proteins were clustered into 111 different families, and their abundances and distributions along the genomes were evaluated. In this regard, some families are quite heterogeneous in terms of their abundances; for instance, 34 of these families include fewer than 100 proteins per group, such as the transcriptional repressor of hyc and hyp operons, HycA_repressor (PF11046), or the PerC transcriptional activator (PF06069), whereas two of them [HTH_1 (LysR) or PF00126, and TetR_N or PF00440] include more than 20,000 members per group. See Table 1. Therefore, to evaluate how the abundance of TF families correlated with the bacterial and archaeal genome sizes, we calculated the coefficient of variation (CV), a measure of the dispersion of data points in a data series around the mean. In this regard, large families showed a minor variation among the genomes, whereas small families exhibited a wide variation (presence and abundance) among the organisms analyzed in this work. For instance, the large families Trans_reg_C (PF00486), with 14,446 members, exhibits an average 10.7 ± 10.91 proteins per genome (CV = 1.01); GntR (PF00392), with 13263 members, has 9.90 proteins per genome (CV = 1.33), and TetR_N (PF00440) had 12.02 proteins per genome (CV = 2.6). In contrast, the Histone_HNS family (PF00816) has 434 members (CV = 3.67), the AP2 (PF00847) family has 77 members (CV = 5.49), and the CitT (PF12431) family has 82 proteins (CV = 5.05). These results suggest that small families, such as AP2 (PF00847), CitT, or Histone_HNS, have a heterogeneous distribution and abundance among bacteria and archaeal genomes, whereas large families have a small CV, i.e. they have a more homogeneous distribution of size among the genomes. See Table 1.
When the abundance of the families was analyzed in detail, 12 of them were identified with more than 10 members per phylum (on average), whereas the rest of the families contained a low number of copies per phylum. In Figure 2 we show that seven families with more than 10 members were identified in Actinobacteria, 6 in Proteobacteria and Acidobacteria, and 5 are abundant in Firmicutes. There are three families in Verrumicrobia, two families each in Chlorobi, Cyanobacteria, Nitrospora, Planctomyces, and Bacterioidete. The family HTH_24 [or AsnC (PF13412)] was found to be abundant in Euryarchaeota, and the family Trans_reg_C (PF00486) is abundant in Deinococcus. Indeed, the family AsnC (HTH_24), abundant in Euryarchaeota, has been described as a group with global regulators in bacteria and archaea, suggesting a role in global regulation [14].

Correlation of TF Families among Bacteria and Archaeal Genomes
In order to evaluate if the families show a common distribution pattern that would allow us to hypothesize about potential correlation patterns and in consequence families working together, we calculated their coefficient of correlation, and a matrix of ALL versus ALL was created. From this matrix, a hierarchical cluster (HCL) analysis was achieved with a Manhattan distance and support tree with average linkage algorithm, with correlation uncentered as a similarity measure [15]. From this analysis, a total of 90 families were included in 15 clusters, whereas 21 families were not included in an evident cluster. Therefore, the clustering of families with similar distribution patterns suggests the existence of common distribution patterns as a consequence of regulation via similar mechanisms, such as the cluster in which members of the YoeB_toxin (PF06769), PhdYeFM_antitox (PF02604), and ParE_toxin (PF05016) families, all of which belong to toxin-antitoxin systems and are associated with the clan CL0136, were included. Indeed, proteins of these families interact among themselves to regulate postsegregation cell killing systems that might function as regulatory switches under stress conditions [16], and are involved in initiate cell death in bacterial and archaeal cultures and to content against the infection by phages or to regulate subpopulations [17].
Another interesting cluster is integrated by the flagellar regulation families (PF05247 FlhD and PF05280 FlhC); histone-like proteins such as Histone_HNS (PF00816); the Phage_AlpA (PF05930), BolA (PF01722), and Arc (PF03869) familes; and the Pro_dh-DNA_bdg (PF14850) family. All these familes exhibit a similar correlation pattern of distribution. In this regard, proteins of the bacterial flagellar transcriptional activator (FlhC) combine with FlhD to form a regulatory complex in E. coli [18] or members of the histone-like nucleoid-structuring (H-NS) protein, which plays a role in the formation of nucleoid structure. In addition, AlpA is in a family that consists of several short bacterial and phage proteins that are related to the E. coli protein AlpA, whereas BolA causes round morphology and may be involved in switching the cell between elongation and septation systems during cell division [19]. It has also been suggested that BolA induces the transcription of penicillin-binding proteins 6 and 5 [20]. In summary, these findings suggest that families work together to regulate common functions in bacteria and archaeal genomes, such as the FlhD and FlhC families, or AlpA and BolA families, and opens diverse correlations to be further analyzed in functional and structural terms to identify potential protein-protein contacts or similar regulatory mechanisms.

Distributions of Families among All the Genomes
To determine how the families are distributed among the complete collection of bacterial and archaeal genomes and to determine if there are families that are universally distributed, the distributions of the 111 PFAMs were traced along the 1321 genomes, and their rates of occurrence were calculated, considering the rate of total presence of a PFAM against the total number of organisms. Therefore, a value close to 1 indicates that the family is present in 100% of the organisms, whereas a value near 0 indicates that the family is absent in all the organisms. Table 1 and Figure 3. From this distribution, a set of 12 families were considered universally distributed, because they were found as in at least 80% of the total organisms, and they could be considered the basic core of regulators associated with prokaryotes. In this dataset, the following families were identified: HTH_3 (PF01381), Bac_DnaA_C (PF08299), Bac_DNA_binding (PF00216), Fur (PF01475), HTH_5 (PF01022), MerR_1 (PF13411), HTH_Crp_2 (PF13545), HTH_24 (PF13412), MarR (PF01047) and TetR_N (PF00440), HTH_1 (PF00126), and Trans_reg_C (PF00486). The rest of the families identified in all the genomes can be interchanged or lost among the bacteria and archaea as a consequence of their lifestyles. In general, the set of universal families is comprised of highly abundant groups, as TetR (highly abundant) and those that are not necessarily the most abundant ones, suggesting that families with a small number of members are also fundamental for the regulation of basic processes, such as ferric uptake regulator Fur, that connects iron transport and utilization enzymes with negative-feedback loop pairs for iron homeostasis [21]; Fur has been identified across a large diversity of organisms [22][23][24]. Similarly, DnaA is involved in initiation of chromosomal replication [25], a fundamental process of all organisms; or LexA-like proteins, with a wide distribution along the bacteria and archaeal genomes, suggesting that the SOS response might be a universal adaptation of bacteria to DNA damage [26]. This distribution, together with the probable coevolution of the LexA recognition domain and its binding motif [27], indicates that the regulated genes must also contain a conserved binding motif in the upstream regions. In summary, we suggest a probable scenario concerning the distribution of universal or widely distributed families, with a conservation of the regulated genes such as the SOS regulon and LexA, with few probable recruitments of additional TFs to regulated regulons, such as occurs in the evolution of regulatory networks [28].
Genes 2019, 12, x FOR PEER REVIEW 10 of 14 bacteria to DNA damage [26]. This distribution, together with the probable coevolution of the LexA recognition domain and its binding motif [27], indicates that the regulated genes must also contain a conserved binding motif in the upstream regions. In summary, we suggest a probable scenario concerning the distribution of universal or widely distributed families, with a conservation of the regulated genes such as the SOS regulon and LexA, with few probable recruitments of additional TFs to regulated regulons, such as occurs in the evolution of regulatory networks [28].

Structural Domains Associated with Families
To gain insights into the association between the structural domains connected with the families, these groups of proteins were analyzed in terms of their domains. In total, 2694 different domains were identified to be associated 111 families. In Figure 4, the association between those domains and families, shows that abundant families do not have a high diversity of structural domains, i.e. small families, constrained to specific phyla, contain a large proportion of additional domains, such as the Fez1 with few members associated to more than 300 different domains. Similarly, large and universal families also contain a large number of domains, such as the Trans_reg_C with around 160. Therefore, to determine if there is a specific association between the domains identified and the families, an enrichment analysis using a one-tailed FET was performed, and a significance of p-value of less than −10 was considered. Table 1 shows the number of domains identified in all the families, and in Figure  5 (and supplementary material S1), a network representation of all associations between TFs and additional domains (enriched domains) is shown. From this network representation, we evaluated the most important nodes (TF families) by using the Maximal Clique Centrality (MCC) method, because it has been described to show excellent performance and precision in predicting essential proteins from networks [29]. Based on this approach, we identified a set of 10 families as highly important and specifically connected with their respective domains, including Fez1 (PF06818), bZIP_1 (PF00170), and bZIP_2 (PF07716), sharing domains among them, and suggesting that they could be associated with similar regulatory processes. In contrast, HTH_3 (PF01381) is strongly associated with the Methyltransf_22 (PF13383), suggesting that one of the most common architectures of this family contains those domains. In addition, the HTH_23 (PF13384), HTH_38 (PF13936), GerE

Structural Domains Associated with Families
To gain insights into the association between the structural domains connected with the families, these groups of proteins were analyzed in terms of their domains. In total, 2694 different domains were identified to be associated 111 families. In Figure 4, the association between those domains and families, shows that abundant families do not have a high diversity of structural domains, i.e. small families, constrained to specific phyla, contain a large proportion of additional domains, such as the Fez1 with few members associated to more than 300 different domains. Similarly, large and universal families also contain a large number of domains, such as the Trans_reg_C with around 160. Therefore, to determine if there is a specific association between the domains identified and the families, an enrichment analysis using a one-tailed FET was performed, and a significance of p-value of less than −10 was considered. Table 1 shows the number of domains identified in all the families, and in Figure 5 (and supplementary material S1), a network representation of all associations between TFs and additional domains (enriched domains) is shown. From this network representation, we evaluated the most important nodes (TF families) by using the Maximal Clique Centrality (MCC) method, because it has been described to show excellent performance and precision in predicting essential proteins from networks [29]. Based on this approach, we identified a set of 10 families as highly important and specifically connected with their respective domains, including Fez1 (PF06818), bZIP_1 (PF00170), and bZIP_2 (PF07716), sharing domains among them, and suggesting that they could be associated with similar regulatory processes. In contrast, HTH_3 (PF01381) is strongly associated with the Methyltransf_22 (PF13383), suggesting that one of the most common architectures of this family contains those domains. In addition, the HTH_23 (PF13384), HTH_38 (PF13936), GerE (PF00196), MerR_1 (PF13411), HTH_24 (PF13412), and HTH_11 (PF08279) are described in the dataset as regulators with a large diversity of domains (more than 60 different domains), and are not linked among them, suggesting that they contain specific domains with few domains that are shared with other families of TFs. Therefore, we suggest that combinations between the DNA-binding domains and their associated domains significantly increase the sensing of diverse signal compounds, decreasing signaling cross talk and making the response to environmental stimuli in bacterial and archaeal organisms more efficient.
Genes 2019, 12, x FOR PEER REVIEW 11 of 14 (PF00196), MerR_1 (PF13411), HTH_24 (PF13412), and HTH_11 (PF08279) are described in the dataset as regulators with a large diversity of domains (more than 60 different domains), and are not linked among them, suggesting that they contain specific domains with few domains that are shared with other families of TFs. Therefore, we suggest that combinations between the DNA-binding domains and their associated domains significantly increase the sensing of diverse signal compounds, decreasing signaling cross talk and making the response to environmental stimuli in bacterial and archaeal organisms more efficient.

Discussions
In this work, we evaluated 111 families of DNA-binding TFs on bacterial and archaeal genomes, how abundant they are in prokaryotic genomes, and how they are distributed according to genome size. For the examined families, we found a set of nine families whose distribution correlates with the genome size and that represent more than 40% of the total of TF identified by HMM profiles. These families have been intimately associated with diverse and central functions in the organisms, such as the two-component systems (Trans_reg_C and GerE), multiple antibiotic resistance responses (TetR_N and MarR), or carbon sources uptake, virulence, and nitrogen assimilation (HTH_AraC), among others. The correlation between the abundance of these families and the genome size reinforces the notion that increased gene complexity also requires the development of mechanisms for gene regulation at the transcription level [13]. In contrast, 56.3% of the collection do not exhibit a clear correlation with the genome size, suggesting that a large proportion of families have independent evolutionary events associated with their increasing, such as duplications or gene losses, opening questions to be further explored, such as how many families in a genome are product of lateral gene transfer or what occurs with the regulated genes or how many families with a similar distribution pattern across Bacteria and Archaea, are product coevolution processes. In addition, we found a specific association between the DNA-binding domains and their associated companion domains, as it has previously described [3], suggesting that the scaffold to protein-protein interactions could be conserved among members of the same family contacts, as occurs in the Crp family and that their association in diverse bacterial and archaeal genomes could increase the ability of the organisms to recognize and respond to diverse environmental stimuli [30]. This result opens the opportunity to predict and modify the probable ligands to understand the diversity of signals that modulate the activity of transcription factors, as it has been identified for E. coli [31].
Finally, based on a correlation matrix of all families, we identified a probable coevolution processes of families devoted to regulate similar processes, such as the members of the YoeB_toxin (PF06769), PhdYeFM_antitox (PF02604), and ParE_toxin (PF05016) families exhibited similar distribution patterns among all the bacterial and archaeal genomes, suggesting that the regulation to

Discussions
In this work, we evaluated 111 families of DNA-binding TFs on bacterial and archaeal genomes, how abundant they are in prokaryotic genomes, and how they are distributed according to genome size. For the examined families, we found a set of nine families whose distribution correlates with the genome size and that represent more than 40% of the total of TF identified by HMM profiles. These families have been intimately associated with diverse and central functions in the organisms, such as the two-component systems (Trans_reg_C and GerE), multiple antibiotic resistance responses (TetR_N and MarR), or carbon sources uptake, virulence, and nitrogen assimilation (HTH_AraC), among others. The correlation between the abundance of these families and the genome size reinforces the notion that increased gene complexity also requires the development of mechanisms for gene regulation at the transcription level [13]. In contrast, 56.3% of the collection do not exhibit a clear correlation with the genome size, suggesting that a large proportion of families have independent evolutionary events associated with their increasing, such as duplications or gene losses, opening questions to be further explored, such as how many families in a genome are product of lateral gene transfer or what occurs with the regulated genes or how many families with a similar distribution pattern across Bacteria and Archaea, are product coevolution processes. In addition, we found a specific association between the DNA-binding domains and their associated companion domains, as it has previously described [3], suggesting that the scaffold to protein-protein interactions could be conserved among members of the same family contacts, as occurs in the Crp family and that their association in diverse bacterial and archaeal genomes could increase the ability of the organisms to recognize and respond to diverse environmental stimuli [30]. This result opens the opportunity to predict and modify the probable ligands to understand the diversity of signals that modulate the activity of transcription factors, as it has been identified for E. coli [31].
Finally, based on a correlation matrix of all families, we identified a probable coevolution processes of families devoted to regulate similar processes, such as the members of the YoeB_toxin (PF06769), PhdYeFM_antitox (PF02604), and ParE_toxin (PF05016) families exhibited similar distribution patterns among all the bacterial and archaeal genomes, suggesting that the regulation to initiate cell death in bacterial and archaeal cultures is widely distributed to content against the infection by phages or to regulate subpopulations [17].

Conclusions
In conclusion, diverse scenarios might occur, depending notably on the family of TF associated, such as those abundant and universal families devoted to regulate amino acid biosynthesis (LysR) or antibiotic resistance (TetR and MarR), or those less abundant ones such as the LexA family, whose distribution suggest that the SOS response might be a universal adaptation of prokaryotic organisms to DNA damage [26].