Evolution End Classification of tfd Gene Clusters Mediating Bacterial Degradation of 2,4-Dichlorophenoxyacetic Acid (2,4-D)

The tfd (tfdI and tfdII) are gene clusters originally discovered in plasmid pJP4 which are involved in the bacterial degradation of 2,4-dichlorophenoxyacetic acid (2,4-D) via the ortho-cleavage pathway of chlorinated catechols. They share this activity, with respect to substituted catechols, with clusters tcb and clc. Although great effort has been devoted over nearly forty years to exploring the structural diversity of these clusters, their evolution has been poorly resolved to date, and their classification is clearly obsolete. Employing comparative genomic and phylogenetic approaches has revealed that all tfd clusters can be classified as one of four different types. The following four-type classification and new nomenclature are proposed: tfdI, tfdII, tfdIII and tfdIV(A,B,C). Horizontal gene transfer between Burkholderiales and Sphingomonadales provides phenomenal linkage between tfdI, tfdII, tfdIII and tfdIV type clusters and their mosaic nature. It is hypothesized that the evolution of tfd gene clusters proceeded within first (tcb, clc and tfdI), second (tfdII and tfdIII) and third (tfdIV(A,B,C)) evolutionary lineages, in each of which, the genes were clustered in specific combinations. Their clustering is discussed through the prism of hot spots and driving forces of various models, theories, and hypotheses of cluster and operon formation. Two hypotheses about series of gene deletions and displacements are also proposed to explain the structural variations across members of clusters tfdII and tfdIII, respectively. Taking everything into account, these findings reconstruct the phylogeny of tfd clusters, have delineated their evolutionary trajectories, and allow the contribution of various evolutionary processes to be assessed.


Introduction
Currently tfd gene clusters are model objects for studying the microbial acquisition of xenobiotic degradation capacity.They encode for enzymes involved in the degradation of 2,4-dichlorophenoxyacetic acid (2,4-D) which has human health and ecological risks [1,2], but is still used worldwide as an herbicide in agriculture [3].Continued exploration of microbial 2,4-D degradation in various environments in China [4,5], Brazil [6], Vietnam [7,8], Russia [9] and Japan [10] suggests that this problem is still in the spotlight.
Previously, under the name "TFD", conjugative plasmids (TFD plasmids) containing genes of the 2,4-D and 4-chloro-2-methylphenoxyacetic acid (MCPA) degradation pathways were described [11].At the moment, the abbreviation tfd designates two clusters of genes involved in the degradation of 2,4-D, which are located on the pJP4 plasmid of strain Cupriavidus pinatubonensis JMP134 (previously identified as Ralstonia eutropha, Alcaligenes eutrophus, Waustersia eutropha and Cupriavidus necator).Each of these clusters, tfdB I F I E I D I C I T (designated as tfd-I/tfd I /tfd-I/tfd I ) and tfdKB II F II E II C II D II R (designated as tfd-II/tfd II /tfd-II/tfd II ), encodes a core set of genes for the ortho-cleavage pathway of chlorocatechol.In the case of cluster tfd II , this core set of genes is extended by the tfdA and tfdK genes [12][13][14][15][16][17][18].
An ortho-cleavage pathway of chlorinated catechols is a common catabolic function between tfd gene clusters on the one hand, and tcb with clc gene clusters on the other hand.The tcb gene clusters are located on catabolic plasmid pP51 which contains two gene clusters, tcbAB and tcbRCDEF.The tcbAB controls the degradation of chlorinated benzenes to chlorinated catechols, while tcbRCDEF are directly responsible for the degradation of chlorinated catechols [23,24].The clc gene cluster (chlorocatechols-Clc) of plasmid pAC27 with a clcRABDE architecture controls the degradation of 3-chlorobenzoate (3-CB) via the ortho-cleavage pathway of chlorinated catechols [25][26][27].It has been shown that functionally related proteins of tfd, tcb and clc gene clusters have high levels of similarity and identity between their amino acid and nucleotide sequences.Additionally, all clusters have shown strong DNA homology and similar organization [28][29][30][31].Schlömann with colleagues [32] noted that phylogenetically, tfd-I, tcb and clc gene clusters diverged from a common ancestral pathway of chlorocatechols.However, further phylogenetic analyses have indicated independent recruitment of diverse genes during assemblage of the tfd gene cluster [33] and the probable independent evolution of the tfdA gene [34].
Comparative genomic and phylogenetic approaches were applied to fill these knowledge gaps.The findings allowed the proposal of a new classification and nomenclature of these gene clusters, into tfd I , tfd II , tfd III and tfd IV(A,B,C) types.Additionally, the findings indicated that tfd are a unique family of mosaic clusters whose clustering occurred in several evolutionary lineages with active recruitment of ancestral genes from Burkholderiales and Sphingomonadales through horizontal gene transfer.Possible paths for the clustering of tfd genes within each lineage in light of the various models of operon formation were discussed.These results constitute a further contribution to the understanding of bacterial genome organization and will be beneficial for the correct annotation of tfd clusters, as well as further studies of their diversity, propagation, and evolution.

Results
The tfd, tcb and clc gene clusters were identified from publicly available genomes and plasmids by: (i) using canonical gene clusters of the plasmids pJP4 (AY365053), pP51 (M57629) and pAC27 (M16964) as NCBI BLAST query sequences; and (ii) the analysis of published articles.Clusters searched for on 1 October 2023 which had two or more gene overlap matches were taken into account in analysis.In total, the above-mentioned search methods resulted in the identification of eighty-five tfd, tcb and clc genes clusters with different genetic structures.
The genetic structures of each tfd I gene cluster were determined and are represented (Figure 1).Although the genetic architecture tfdB I F I E I D I C I T is well conserved in most members of the tfd I type, subsequent comparative analyses have suggested that another five clusters had an incomplete arrangement with deletions of the tfdB I or tfdT genes.Amongst them was the tfdF I E I D I C I T cluster of the plasmid pNK84 (AP024328) and P. phytofirmans OLGA172 (Burkholderia sp.R172 = Burkholderia sp.OLGA172) (CP014578).It should be noted that in P. phytofirmans OLGA172 earlier, an incomplete cluster with a tfdD I C I TT genetic architecture was found.A tfdF I E I D I C I type structure was found in an uncultivated bacterium (AB478351).It is also interesting to note that in addition to the complete tfd I gene cluster of the plasmid5, an incomplete cluster with the tfdB I F I E I D I structure and fully identical at the DNA sequence level was found almost immediately downstream on the opposite strand of the plasmid (Figure 1).Analysis of the DNA sequence identity of the aligned complete clusters of this type showed that it varied between 92.7% and 100%.At the same time, the pkk4 plasmid cluster had a minimum level of identity with respect to the other-the range was 92.7-96.8%.An interesting feature to note is that the tfdT gene, both in the pJP4 plasmid and in the vast majority of pPO line plasmids (with the exception of pPO4) was truncated.Thus, the vast majority of tfd I type clusters are represented by a complete set of genes without genetic rearrangements.

The tfd II Gene Cluster
A blastn search revealed thirteen clusters which had a genetic architecture closely related to the tfd II seven-gene cluster of the pJP4 plasmid, namely, tfd II KBFECDR.All the clusters had plasmid localization in bacterial hosts belonging to the Burkholderiaceae family.Four of them had a tfd II AKBFECDR eight-gene structure (plasmids pDB1 (JQ436721), p712 (JQ436722), pEMT3 (JX469827) and plasmid5 (CP038640)), with the tfdA II gene located immediately downstream of the tfdK II gene in a cluster, in contrast with the prototype plasmid pJP4.Thus, they had a complete set of genes in their structures (Figure 1).It should be noted that the tfdA II gene was absent in the tfd II cluster in the pJP4 plasmid.It was located on the opposite strand downstream from the tfdR II gene of the cluster, together with the tfdS II gene, and was separated from them by the open reading frames ORF31 and

The tfdII Gene Cluster
A blastn search revealed thirteen clusters which had a genetic architecture closely related to the tfdII seven-gene cluster of the pJP4 plasmid, namely, tfdIIKBFECDR.All the clusters had plasmid localization in bacterial hosts belonging to the Burkholderiaceae family.Four of them had a tfdIIAKBFECDR eight-gene structure (plasmids pDB1 (JQ436721), p712 (JQ436722), pEMT3 (JX469827) and plasmid5 (CP038640)), with the tfdAII gene located immediately downstream of the tfdKII gene in a cluster, in contrast with the prototype plasmid pJP4.Thus, they had a complete set of genes in their structures (Figure 1).It should be noted that the tfdAII gene was absent in the tfdII cluster in the pJP4 plasmid.It was located on the opposite strand downstream from the tfdRII gene of the cluster, together with the tfdSII gene, and was separated from them by the open reading frames ORF31 and ORF32.However, unlike pJP4, three of them (pDB1, p712, and pEMT3) did not have the tfdI gene cluster.Meanwhile, the plasmid5 plasmid contained the tfdI cluster.

The tfd III Gene Cluster
A large group of tfd III type clusters (eleven) sharing a common gene structure with different lengths has been identified by blastn-search (details of the new classification and nomenclature are described in the Sections 3 and 4).Some were annotated without specifying the type of tfd cluster-tfdDRCEBKAF [36], tfdRCDEF [59], tfdRCEBKA [37], and tfdAKBECR [39].Others belonged to the tfd I and tfd II type clusters-tfdRC II E II BKA [59], tfdFAKB I E I C I D I R I , and tfdC II E II B II [42] or were unpublished (plasmid pkk4) or not annotated [60] (Table S1).The more complete tfd III FAKBECDR eight-gene structures were encoded only on mega-plasmids pM7012 (AB853026) and pkk4.They were fully identical to each other and shared a 67.7-68.2%identity with the complete clusters tfd II AKBFECDR for the tfd II type (plasmids p712, pEMT3, pDB1 and plasmid5) at the nucleotide level.In both plasmids, the tfdF III gene was located downstream of the tfdA III gene.Another feature was the presence of incomplete versions of this type of cluster with structures tfdBECRR and tfdBECR immediately downstream of the tfdR III gene on the opposite DNA strand.In these reduced clusters, the tfdD III gene was completely deleted, while tfdR III was partially reduced and, in pkk4, was represented by two copies.The plasmid regions included both clusters on pM7012 and pkk4 almost identical at the nucleotide level.The clusters with structures tfd III BECRR and tfd III BECR both shared a 77.6% identity with both tfd III FAKBECDR clusters.Importantly, in the mega-plasmid pkk4, the tfd I type cluster was located upstream from tfd III FAKBECDR, as described below (Figure 1).
The clusters with structures tfd III F,AKBECR were identified in plasmids pIJB1 (JX847411) and pEST4011 (AY540995).In them, tfdF III genes were also located downstream of the tfdA III gene such as for the mega-plasmids pkk4 and pM7012 described above, but there were two small ORFs between them.Additionally, in contrast to pM7012 and pkk4, these clusters lacked a tfdD III gene and short clusters tfd III BECR/BECRR.Both plasmids pAKD25 (JN106170) and pAKD26 (JN106171) from the same bacterial host [39] had clusters with a tfd III AKBECR gene structure without tfdF III and tfdD III genes.The shortest structures of that type were identified in plasmids pRK1-5 (CP062809), pTV1 (AB028643) and Delftia acidovorans P4a (AY078159).They had clusters with tfd III KBECR, tfd III AKBE and tfd III BECR gene structures, respectively.Interestingly, the core set of genes of this tfd III cluster type (tfd III BECR), with the exception of D. acidovorans P4a, varied in terms of nucleotide identity by between 77.3% and 100%.An important distinguishing feature, in addition to a similar structure, in this group of clusters was the presence of tcb gene clusters adjacent to the tfd III gene clusters.In total, nine clusters were identified, of which three had a full set of tcbRCDEF genes with ORF3, five clusters with structure tcbDORF3 and one with tcbRC.The plasmids pkk4 and pM7012 were exceptions as they did not possess any tcb gene clusters.The complete tcb clusters were identified on opposite chains almost immediately downstream of the tfdA III genes of D. acidovorans P4a and plasmid pAKD26, as well as the tfdB III gene of plasmid pRK1-5.Moreover, in D. acidovorans P4a, the tcbRC cluster was identified immediately upstream of gene tfdE III .This was flanked by an ORF encoding a transcriptional regulator.The other plasmids pEST4011, pIJB1, pAKD25, and pTV1 possessed a tcbDORF3 cluster.Interestingly, pEST4011 possessed two copies of that cluster.The detailed synteny of the regions mentioned above is depicted in Figure 1.

The tfd IV Gene Clusters from Sphingomonas and Bradyrhizobium
This group combines non-traditional clusters that have variable gene structures and significantly differ from the canonical tfd gene clusters of the I and II types.The published clusters were annotated as dccEA I D I , dccA II D II , dccEA I D I [62], tfdC2E2, tfdC2E2F2, tfdDR-FCE, tfdDRF [38,63], cnbCDEF [64], tfdBCDEFKR [41], tfdEC, tfdCE [7], tfdBaFRDEC [10], tfdEICIFIRDI,K,BI and tfdDII,FIIEIICII,BII [46].Nevertheless, comparative genomic analysis showed that the genetic structure of this tfd IV type cluster could be divided into three main subtypes-A, B and C (Figure 2a) (details of the new classification and nomenclature are described in the Sections 3 and 4.

The tfdIV Gene Clusters from Sphingomonas and Bradyrhizobium
This group combines non-traditional clusters that have variable gene structures and significantly differ from the canonical tfd gene clusters of the I and II types.The published clusters were annotated as dccEAIDI, dccAIIDII, dccEAIDI [62], tfdC2E2, tfdC2E2F2 tfdDRFCE, tfdDRF [38,63], cnbCDEF [64], tfdBCDEFKR [41], tfdEC, tfdCE [7], tfdBaFRDEC [10], tfdEICIFIRDI,K,BI and tfdDII,FIIEIICII,BII [46].Nevertheless, comparative genomic analysis showed that the genetic structure of this tfdIV type cluster could be divided into three main subtypes-A, B and C (Figure 2a) (details of the new classification and nomenclature are described in the Sections 3 and 4. The most common structure, subtype A-tfdIVAECFRD,B, possessed bacterial strains Sphingobium herbicidovorans MH, Sphingopyxis sp.KK2, and two plasmids pDB-1 and pCADAB1.According to Nielsen et al. [63], the strain Sphingomonas sp.TFD44 also possesses tfdIVAECFRD,B, but this sequence was absent in publicly available databases.Therefore, the early variant of the sequence Sphingomonas sp.TFD44 with a tfdIVAECFRD structure without the tfdBIVA gene which was sequenced by Thiel and colleagues (2005) [38] was used for analysis.An interesting feature of tfdIVAECFRD,B subclusters was the reverse orientation tfdRIVA gene compared with the tcb, clc and other tfd clusters.Normally, that gene has the opposite orientation to other genes in the cluster.Moreover, the tfdDIVA gene also had another orientation, opposite to the other cluster genes (with the exception of the pDB-1 plasmid).In the strain Sphingomonas histidinilytica BT1 5.2 (WMBU01000047) and on the plasmid pMSHV (CP020539) of the S. herbicidovorans MH, described above, less The most common structure, subtype A-tfd IVA ECFRD,B, possessed bacterial strains Sphingobium herbicidovorans MH, Sphingopyxis sp.KK2, and two plasmids pDB-1 and pCADAB1.According to Nielsen et al. [63], the strain Sphingomonas sp.TFD44 also possesses tfd IVA ECFRD,B, but this sequence was absent in publicly available databases.Therefore, the early variant of the sequence Sphingomonas sp.TFD44 with a tfd IVA ECFRD structure without the tfdB IVA gene which was sequenced by Thiel and colleagues (2005) [38], was used for analysis.An interesting feature of tfd IVA ECFRD,B subclusters was the reverse orientation tfdR IVA gene compared with the tcb, clc and other tfd clusters.Normally, that gene has the opposite orientation to other genes in the cluster.Moreover, the tfdD IVA gene also had another orientation, opposite to the other cluster genes (with the exception of the pDB-1 plasmid).In the strain Sphingomonas histidinilytica BT1 5.2 (WMBU01000047) and on the plasmid pMSHV (CP020539) of the S. herbicidovorans MH, described above, less clustered variations with structures tfd IVA RD,B,CE and tfd IVA FRD,B,CE, respectively, were found.It should be noted that there was a high degree of synteny between the majority of this these subtype members (Figure 2b).The most common tfd IVA ECFRD structures of subtype A shared an identity at the nucleotide level ranging from 88.6% to 100%.
The second subtype of this cluster, subtype B-tfd IVB D,FEC/FEC/D,FEC,B, was identified on the plasmids pDB-4 (CP102388), pHSL1 (CP018222) and in the strains Pseudomonas stutzeri ZWLR2-1 (GU181397) and Sphingomonas sp.tfd44.This last had the short variant without the tfdD IVB gene-the tfd IVB FEC of the whole set of genes in that subtype were ordered in one direction, in contrast with subtype A (Figure 2a).Interestingly, on the pHSL1 and pDB-4 gene structure, tfd IVB FEC was sandwiched into the cluster of genes responsible for encoding three-component Rieske-type [2Fe-2S] dioxygenase (anthranilate 1,2-dioxygenase) of B. cepacia DBO1 designated AntDO-3C and consisting of AndAaAbAcAd genes.The hybrid cluster had the structure tfdD IVB AndAaAbtfd IVB FECAndAdAc.Other strains, P. stutzeri ZWLR2-1 and Sphingomonas sp.
tfd44, had shorter structures-tfdD IVB AndAaAbtfd IVB FECAndAd and AndAaAbtfd IVB FECAndAd, respectively.The tfd IVB FEC core set of genes shared 84.0-95.2%identity at the nucleotide level.Thus, that subtype was distributed mostly in the Sphingomonadaceae family with the exception of P. stutzeri ZWLR2-1.
The third subtype-C was presented on contigs of Bradyrhizobium sp.RD5-C2 (BOVL01000048) and S. histidinilytica BT1 5.2 (WMBU01000019) by gene structures with different degrees of gene assemblage-tfd IVC CEDRF,B and tfd IVC CED,R,F, respectively.The order and orientation of genes in that subtype differed from both subtype A and B (Figure 2a).
The obtained results indicated that three bacterial strains possessed two subtypes of tfd-like clusters.Therefore, Sphingomonas sp.TFD44 (tfd44) and Sphingopyxis sp.DBS4 had the A and B subtypes, while S. histidinilytica BT1 5.2 had the A and C subtypes.

The tcb and clc Gene Clusters
A blastn search revealed the sixteen sequences with high identity at nucleotide level with the canonical clcRABDE cluster of plasmid pP51 (M57629).Among them were both gene clusters annotated as clc and others: clcr1a1b1d1e1 [52], dccEDCBAR [53] and tfdFE(tctC)DCS [7].Additionally, some clusters were sequenced by different authors, but belonged to the same bacterial strain Pseudomonas knackmussii B13 [52,[61][62][63][64][65][66], Paraburkholderia xenovorans LB400 [67,68], Pseudomonas aeruginosa strain JB2 [69-71] and Pandoraea pnomenusa MCB032 [51,72].As a result, the most fully sequenced gene clusters were taken for further analysis based on the latest data deposited in GenBank (Table S1).From a genetic architecture point of view, all the clusters were highly conserved and had a common set of clcRABDE genes, as well as ORF3 genes in their structure.Within the clc cluster, the nucleotide sequences had a shared identity of between 96.3% and 100%.The plasmid pAC27 (M16964) Pseudomonas aeruginosa 142 (AF161263) possessed the truncated clcR gene.The ORF3 of strain Diaphorobacter sp.JS3051 was presented by the two truncated ORFs.There were no deletions, inversions or duplications of genes in the identified clusters.Figure 3a illustrates the results of a comparative genomic analysis of the identified clusters and flanked ORFs.Moreover, the clusters of strains B. petrii DSM 12804 (AM902716), P. knackmussii B13 (HG322950), and plasmid unnamed 2 of the strain P. pnomenusa MCB032 (CP015373) had a high level of synteny downstream of the clcE gene.Downstream of the clcR gene, a high level of synteny was observed in B. petrii DSM 12804 (AM902716), P. knackmussii B13 (HG322950), P. xenovorans LB400 (CP008760), P. aeruginosa JB2 and Diaphorobacter sp.JS3051 (CP065406).Almost all the gene clusters, with one exception (pAC27, paaa and unnamed 2), had non-plasmid localization.Most of the bacteria carrying these clusters belonged to the families Alcaligenaceae, Burkholderiaceae, and Comamonadaceae of the order Burkholderiales of Betaproteobacteria, with the exception of few pseudomonades and Escherichia coli JM103 from Gammaproteobacteria.
Comparative genomic analysis has shown a high level of synteny of flanking for all the clusters (Figure 3b).Interestingly, only in genome Bordetella petrii, DSM near the RCDEF was identified tcbAB gene cluster.Most of the bacteria carrying clusters belonged to the families Alcaligenaceae, Burkholderiaceae and Comamonadaceae order Burkholderiales of Betaproteobacteria, with the exception of Sphingomonas sp andPseudomonas sp.P51 from Alpha-and Gammaproteobacteria, respectively.
The common feature of both clc and tcb clusters was the flanking immediately d stream from the gene encoding maleylacetate reductases (clcE and tcbF, respectively ORF with a contained conserved AraC binding domain.The following were the e tions: strains P. aeruginosa 142, Alcaligenes sp.NyZ215 and plasmids pAC27, pAKD2 pRK1-5.The ORFs across these clusters shared a similarity in the range of 42.5% to Meanwhile, the range of similarity across clc and tcb clusters ranged from 97.3% to and from 71.5% to 100%, respectively.Comparative genomic analysis has shown a high level of synteny of flanking ORFs for all the clusters (Figure 3b).Interestingly, only in genome Bordetella petrii, DSM 12804 near the RCDEF was identified tcbAB gene cluster.Most of the bacteria carrying these clusters belonged to the families Alcaligenaceae, Burkholderiaceae and Comamonadaceae of the order Burkholderiales of Betaproteobacteria, with the exception of Sphingomonas sp.C8-2 andPseudomonas sp.P51 from Alpha-and Gammaproteobacteria, respectively.
The common feature of both clc and tcb clusters was the flanking immediately downstream from the gene encoding maleylacetate reductases (clcE and tcbF, respectively) by a ORF with a contained conserved AraC binding domain.The following were the exceptions: strains P. aeruginosa 142, Alcaligenes sp.NyZ215 and plasmids pAC27, pAKD26 and pRK1-5.The ORFs across these clusters shared a similarity in the range of 42.5% to 100%.Meanwhile, the range of similarity across clc and tcb clusters ranged from 97.3% to 100% and from 71.5% to 100%, respectively.
2.1.6.Comparative Genomics for the tfd, tcb and clc Clusters Structurally, clc, tcb and all types of tfd clusters (excluding tfd IV ) shared a common structure with a unidirectional set of genes, encoding catabolic reactions and regulatory protein in the opposite direction.Figure 4   As for clc and tcb, tfdII and tfdIII clusters were structurally closely related to each other, but differed in terms of tfdF gene localization.Both tfdII and tfdIII clusters possessed a reverse order of chloromuconate cycloisomerase (tfdDII and tfdDIII) and chlorocatechol 1,2dioxygenase (tfdCII and tfdCIII) encoding genes compared with clc, tcb and tfdI.However, localization of the tfdB gene in tfdI, tfdII and tfdIII clusters was identical.The identity between tfdII and tfdIII clusters ranged from 67.7% to 68.2%.At the same time tfdI clusters shared from 49.0% to 52.2% and from 49.5% to 52.5% identity at the nucleotide level with tfdII and tfdIII, respectively.

The Phylogenetic Analysis of Deduced Protein Sequences Encoded by the tfd, tcb and clc Gene Clusters
Phylogenetic analysis of corresponding deduced protein sequences allowed for the evolutionary fate of genes in tfd, tcb and clc clusters, as well as the role of horizontal gene transfer to be assessed.
Comparative analysis of the protein sequences of α-ketoglutarate-dependent 2,4-D dioxygenases of the tfdII and tfdIII clusters showed that similarity between them ranged As for clc and tcb, tfd II and tfd III clusters were structurally closely related to each other, but differed in terms of tfdF gene localization.Both tfd II and tfd III clusters possessed a reverse order of chloromuconate cycloisomerase (tfdD II and tfdD III ) and chlorocatechol 1,2dioxygenase (tfdC II and tfdC III ) encoding genes compared with clc, tcb and tfd I .However, localization of the tfdB gene in tfd I , tfd II and tfd III clusters was identical.The identity between tfd II and tfd III clusters ranged from 67.7% to 68.2%.At the same time tfd I clusters shared from 49.0% to 52.2% and from 49.5% to 52.5% identity at the nucleotide level with tfd II and tfd III , respectively.

The Phylogenetic Analysis of Deduced Protein Sequences Encoded by the tfd, tcb and clc Gene Clusters
Phylogenetic analysis of corresponding deduced protein sequences allowed for the evolutionary fate of genes in tfd, tcb and clc clusters, as well as the role of horizontal gene transfer to be assessed.
Comparative analysis of the protein sequences of α-ketoglutarate-dependent 2,4-D dioxygenases of the tfd II and tfd III clusters showed that similarity between them ranged from 87.2% to 100%.The ranges of similarity between the protein sequences of TfdAs inside each cluster were 93.8-100% and 96.9-100% for tfd II and tfd III , respectively.

2,4-D Transport Protein (tfdK)
The ML tree topology recovered clades of tfd II and tfd III clusters as monophyletic with strong support.Each lineage further diverged into two well-and strongly supported sister clades (Figure S1b).The two clades of the first lineage were formed TfdKs of plasmid5 and pDB1 on the one hand, and p712, pEMT3, pJP4, pPO1 and pPO26 on the other.The second lineage diverged into two clades, the first of which was formed by almost all the transport proteins of the tfd III clusters, and the second of which was formed by pkk4 and pM7012.
Comparative analysis of the amino acid sequences of 2,4-D transport proteins of the tfd II and tfd III clusters encoded by the tfdK II and tfdK III genes showed that similarity between them varied from 80.8% to 100%.The ranges of similarity between the amino acid sequences of TfdKs inside each cluster ranged from 94.4% to 100% and from 93.8% to 100% for tfd II and tfd III clusters, respectively.

2,4-DCP Hydroxylase (tfdB)
The results of the ML analysis clearly showed strongly-supported paraphyly of the tfd IV cluster (subcluster B) with respect to the tfd I , tfd II and tfd III .Nevertheless, monophyly tfd I clade and other clades were poorly supported (Figure S1c).Interestingly, subcluster A of the tfd IV cluster resolved as paraphyletic with respect to the tfd II and tfd III clusters, but that result was not supported by bootstrap.Moreover, the tfd III clade was recovered as a strongly-supported paraphyletic with respect to the tfd II cluster.
A comparative analysis of amino acid sequences of 2,4-DCP hydroxylases encoded by the tfdB I , tfdB II , tfdB III , and tfdB IV genes showed that they shared similarities of between 47.3% and 100%.The similarities between the amino acid sequences of TfdBs inside each cluster were in the range of 78.0-100% (tfd I ), 95.9-100% (tfd II ), 91.3-100% (tfd III ), and 64.3-100% (tfd IV ), respectively.The ML tree topology indicated that all the strongly-supported clades, uniting the chlorocatechol 1,2-dioxygenases of all the tfd, clc and tcb gene clusters (with the exception of subcluster C from the tfd IV cluster) resolved as monophyletic with moderate support (Figure 5a).The clades uniting chlorocatechol 1,2-dioxygenases from the tfd II and tfd III clusters were recovered as sister clades with strong support.The monophyly of chlorocatechol 1,2-dioxygenases for tfd I , tfd II and tfd III was not significantly supported.At the same time, clades uniting chlorocatechol 1,2-dioxygenases from the clc and tcb clusters resolved as moderately-supported sister clades.Within the tfd III clade, two well-supported sister clades were recovered, uniting proteins from plasmids pkk4 and pM7012 in the first clade and other plasmids in the second clade.The monophyly of clades, uniting proteins from subtypes A and B of the tfd IV cluster, was well supported.
A comparative analysis of the amino acid sequences of chlorocatechol 1,2-dioxygenases in these clusters encoded by the tcbC, clcA, tfdC I , tfdC II , tfdC III , and tfdC IV genes showed that they shared a similarity of between 37.9% and 100%.The similarity between the amino acid sequences of chlorocatechol 1,2-dioxygenases inside each cluster was in the range of 95.2-100% (tcb), 96.9-100% (clc), 85.4-100% (tfd I ), 95.7-100% (tfd II ), 96.5-100% (tfd III ) and 38.0-100% (tfd IV ), respectively.2.2.5.Chloromuconate Cycloisomerases (tcbD, clcB, tfdD I , tfdD II , tfdD III , tfdD IV ) In ML analysis, almost all the clades were resolved as well-supported monophyletic clades (Figure 5b).The exceptions to this were the two strongly-supported clades uniting the chloromuconate cycloisomerases from the tfd II and tfd III clusters.At the same time, tfd II was recovered as a poorly supported paraphyletic clade, with the tfd III clade falling within it.In monophyletic lineage, all clades received strong support.At the same time, two sister clades with low support were recovered within the clade uniting chloromuconate cycloisomerases from the tcb cluster.The monophyly of the clades uniting chloromuconate cycloisomerases from clc and tcb was moderately supported.However, together with the tfd I clade, they were recovered to be monophyletic with strong support.Within the tfd IV cluster, three subclusters (A, B, and C), received good or strong support and were found to be monophyletic with moderate support.A comparative analysis of the amino acid sequences of chloromuconate cycloisomerases in these clusters encoded by the tcbD, clcB, tfdD I , tfdD II , tfdD III , tfdD IV genes indicated that they shared a similarity of between 45.6% and 100%.The similarity between the amino acid sequences of chloromuconate cycloisomerases inside each cluster was in the range of 80.3-100% (tcb), 98.7-100% (clc), 92.7-100% (tfd I ), 94.8-100% (tfd II ), 100% (tfd III ) and 49.5-100% (tfd IV ), respectively.2.2.6.Chlorodienelactone Hydrolases (tcbE, clcD, tfdE I , tfdE II , tfdE III , tfdE IV ) By ML analysis chlorodienelactone hydrolases were recovered as two monophyletic well-and strong-supported lineages (Figure 5c).The first included clusters tcb, clc, tfd I , and, surprisingly, subcluster A of the tfd IV cluster.Each cluster formed a strongly-supported clade.The tcb and tfd I clusters appeared as the sister group, albeit that they received low support.The second lineage included tfd II , tfd III , and tfd IV (subclusters A and B) clades with well and strong support.The clade comprised tfdE III chlorodienelactone hydrolases recovered as paraphyletic, with tfd II cluster members falling within that clade.However, that result was moderately supported.Subclusters A and B of tfd IV were recovered as sister clades with well-supported topology.

Maleylacetate Reductases (tcbF, clcE, tfdF
This phylogenetic analysis recovered these proteins, encoded by all the analyzed clusters (with the exception of maleylacetate reductases from tfd II ), as monophyletic with good support (Figure 5d).As such, clc, tcb, tfd I , tfd IV and tfd III , each clustered in well-and strongly supported clades.Monophyly maleylacetate reductases from clc, tcb and tfd I were strongly supported.The tfd IV cluster was recovered as paraphyletic, with the tfd III cluster falling within that clade albeit with low bootstrap support.The ML analysis recovered tfd II as monophyletic with strong support and positioned the clade consisting of pDB1 and plasmid5 as the sister of the clade formed by all the other plasmids, also with strong support.All of the above confirmed the polyphyly of maleylacetate reductases.
2.2.8.Transcriptional Regulator (tcbR, clcR, tfdT, tfdR II /tfdS, tfdR III , tfdR IV ) The monophyly of the transcriptional regulators encoded by clc, tcb, tfd I , tfd II and tfd III clusters (with the exception of tfd IV ) was well supported in ML analyses (Figure S2).The nodes, which are crucial for understanding the phylogenetic relationships between clc, tcb, tfd I , tfd II and tfd III clusters, were weak or unsupported (with the exception of the tfd II and tfd III clusters).Nevertheless, each of the analyzed clusters, tfd I , tfd II , tfd III , clc and tcb, formed its own strongly-supported clade.The clade, consisting of TfdR III transcriptional regulators, was recovered as a strongly-supported paraphyletic, with the tfd II cluster members falling within that clade.The monophyly of tfd IV cluster transcriptional regulators was moderately supported, and within the tfd IV cluster, both subclusters, I and III, were recovered to be monophyletic with strong and good support, respectively.Thus, transcriptional regulators were recovered as polyphyletic.
In summary, congruence between all the phylogenies of the proteins involved in catechol degradation through the ortho-cleavage pathway among the tcb, clc and tfd I clusters could be concluded.The proteins of other clusters, tfd II , tfd III and tfd IV , recovered both congruent and incongruent phylogenies.The proteins of cluster tfd II were congruent with each other, as were the proteins of cluster tfd III , with one exception, namely, the protein maleylacetate reductase, which showed an incongruent phylogeny.Incongruence within cluster tfd IV was observed among chlorocatechol 1,2-dioxygenases and chlorodienelactone hydrolase proteins.

The New Classification Scheme and Nomenclature of tfd Gene Clusters
The obtained results clearly indicate that tfd I and tfd II type clusters are well conserved in relation to their structures and can continue to be classified as type I and type II, without any changes in classification.Meanwhile, the two groups of clusters are separate types of tfd clusters with substantive structural changes and different evolutionary origins.Taking this into account, they should be given independent designation numbers, type III and type IV.Thus, based on both the historical continuity of the designation of tfd gene clusters [15,17] and the results of both comparative genomics and protein phylogeny, a new classification scheme is proposed, categorizing tfd gene clusters into four types-I, II, III and IV (A, B and C)-alongside a new nomenclature (for details of the syntax, see the Section 4.

The Role of Horizontal Gene Transfer (HGT) and Gene Displacement in the Mosaic Nature of tfd Gene Clusters
It is generally accepted that horizontal gene transfer (HGT) is the major process in bacterial evolution; its role has been proved in numerous studies.Mosaic operons contain genes transferred by HGT, which are characterized by the incongruence of their phylogeny with other genes [74].Clusters, as well as operons, can also be mosaic in nature [75].Nevertheless, the analysis of tfd gene clusters from the standpoint of incongruence (or so-called discrepancies) in their phylogeny has only been performed in a few papers [33,76].
The congruence of the phylogeny of all tcb and clc cluster proteins clearly illustrates that they are not mosaic and are spread entirely by horizontal transfer.Moreover, the same conclusion can be drawn about the core five-gene part of cluster tfd I responsible for the catechol ortho-cleavage pathway.Previously, it has been pointed out that from a phylogenetic point of view these clusters diverged from a common ancestral ortho-cleavage pathway for chlorocatechols [33].
Very intriguing findings follow from the congruence of almost all the proteins of clusters tfd II and tfd III , with the exception of maleylacetate reductases (tfdF II and tfdF III ).These findings may be explained in light of the HGT event, followed by differential gene losses-displacement of an ancestral tfdF III (probably shared common ancestry with tfdF II ) to the functionally equivalent gene of maleylacetate reductase, homologous to those from tfdF IV .This is a phenomenon, a homolog displacement, which is probably selectively neutral [77] but one of those that led to the origin of a different type of tfd cluster.
Analyzing the four proteins of cluster tfd IV (TfdC IV , TfdD IV , TfdE IV and TfdF IV ) directly involved in the catechol pathway, it becomes obvious that subclusters A and B are not mosaic and are entirely distributed by HGT.In contrast, subcluster C, in the case of the chlorocatechol 1,2-dioxygenases (TfdC IVC ) and chlorodienelactone hydrolases (TfdE IVC ) proteins, had other evolutionary ancestors.These findings classify this subcluster as a mosaic.
Interestingly enough, the results of phylogenetic analysis clearly revealed that the tfdB gene for ancestral 2,4-DCP hydroxylase protein was assembled by clusters tfd I , tfd II , tfd III and tfd IV after distribution through HGT.However, with regard to gene tfdB, the results obtained in this work are inconsistent with an earlier conclusion about its independent evolution [78].
Thus, HGT between orders Burkholderiales and Sphingomonadales provided phenomenal linkage between tfd I , tfd II , tfd III and tfd IV type clusters.As a result, these orders have proven to be the most adapted to repeated exposures to the herbicide 2,4-D in soils.

Evolution Lineage including Homologous tfd I , tcb and clc Gene Clusters
All the obtained results point to the existence of the lineage including three types of analyzed clusters, namely, tcb, clc and tfd I , which are considered homologs.This conclusion is supported by at least five lines of evidence: (i) the clusters have similar core five-gene structures (genes, responsible for the ortho-cleavage of catechols) an identical order of genes, especially encoded chloromuconate cycloisomerases and chlorocatechol 1,2dioxygenases; (ii) the protein tree topologies suggest a common origin; (iii) high identities and similarities of DNA and protein sequences; (iv) the presence of ORF3 and the flanking ORF encoding the AraC family conserved domain in tcb and clc; and (v) the clusters are widespread mainly across the order Burkholderiales (and even in a genome of the same strain, for example, B. petrii DSM 12804).Previous studies have suggested gene homology and possible evolutionary relatedness between genes of the first-described clusters tcb, clc and tfd I [27,29,31].
The results apparently suggest that tcb, clc and tfd I clusters are descendants of a single ancestral cluster, which further evolved by adapting to different substrates.The absence of ancestral cluster data, high conservation, and almost complete absence of genetic rearrangements among tcb, clc and tfd I clusters prevent the causes and mechanisms of the main models of cluster formation from being delineated, with one exception.Nevertheless, the prevalent plasmid localization (with the exception of clc cluster) clearly suggests that they are the hot spots of cluster evolution and propagation.In turn, this indicates a possible assemblage of these clusters according to the 'Scribbling Pad' model [79].The clc clusters are relatively poorly localized on plasmids compared with tfd I and tcb, while at the same time remaining more conserved at the nucleotide sequence level.This also correlates well with this model.The tcb and clc gene clusters retained a high synteny of their own and flanking gene structure suggesting this is the ancestral state.Interestingly, the conservation of tcb cluster genes was higher than flanking ORFs that could suggest their high priority to bacterial hosts.
Perhaps tfd I type clusters evolved from a common ancestor with clusters tcb and clc by clustering the tfdB I gene (described above) and eliminating ORF3, resulting in the acquisition of the ability to hydroxylate 2,4-dichlorophenol (2,4-DCP) to 3,5-dichlorocatechol (3,.This is supported by the absence of this gene in the clusters of the pNK84 plasmid and the P. phytofirmans OLGA172, which, according to the topology of all orthopathway proteins, diverged from a common ancestor earlier than the main group.Thus, they probably diverged before the tfdB I gene was assembled.This allowed this cluster to more narrowly specialize in the degradation of 2,4-DCP into 3,5-DCC.

Evolution Lineage including Homologous tfd II and tfd III Type Clusters
This lineage of tfd gene clusters is comprised of two types, tfd II and tfd III , that have a similar structure and whose proteins are almost entirely homologous.The exception is the chloromaleylacetate reductase protein, which has a different evolutionary origin to those types (described above).There is no doubt that these types of clusters originated from the division of a common prototype cluster into two branches, as confirmed by the obtained results.By analogy with the evolutionary lineage including clusters tcb, clc and tfd I , only one of the potentially possible models of its formation can be distinguished for the prototype cluster, namely, the 'Scribbling Pad' model [79], since, obviously, exclusively plasmid localization is a consequence of plasmid-mediated clustering and propagation.Apparently, the crucial point in the evolutionary division of this lineage into two types was the homolog displacement of the tfdF gene in the common prototype cluster (described above).
Currently, it is obvious that the full tfd II cluster with a complete tfd II AKBFECDR eight-gene structure is more widely distributed among plasmids than the tfd II KBFECDR seven-gene structure of the pJP4 plasmid.Nevertheless, the recent report about pPO line plasmids with their overall structure essentially the same as the pJP4 plasmid and with deletion variations of cluster tfd II [44] puts much into perspective.Apparently, one consequent series of genetic rearrangements, primarily deletions, in pJP4 or pJP4-like plasmids, has resulted in the current diversity and propagation of these cluster variations.Since, of all the resulting variations, the seven-gene pJP4 cluster was the first to be explored, it became canonical.
The features typically associated with tfd III type clusters can be derived through a series of rearrangements leading to incomplete clusters with tfd III AKBECR structure (excluding pRK1-5 and D. acidovorans P4a).It is probable that the ancient tfd II AKBFECDR exposed: (i) successful tfdF II gene displacement that has resulted in a future tfd III FAKBECDR structure of the cluster (plasmids pkk4 and pM7012); (ii) simultaneously or sequentially successful displacement of the tfdF II gene and unsuccessful displacement of tfdD II that has resulted in a future tfd III F,AKBECR structure of the cluster (plasmids pIJB1 and pEST4011); and (iii) simultaneously or sequentially unsuccessful displacement of the tfdF II and tfdD II gene that has resulted in a future tfd III AKBECR structure of the cluster (plasmids pAKD25, pAKD26, pTV1).It is important to note that plasmid pTV1 also has the tfd III AKBECR structure, but the tfdA III gene was partially sequenced in further study [78].These findings are not supported by the version proposed by Sakai et al. [42] about the probability of the assemblage of the tfd II type cluster recruiting genes from tfd I .
Moreover, the structure of tfd III AKBECR has undergone further rearrangements.In D. acidovorans P4a, the two-gene cluster tfd III CR was replaced with a two-gene cluster tcbCR which encoded the functionally identical proteins of the tcb cluster.This led to the emergence of a hybrid cluster tfd III AKBEtcbCR.Obviously, recombination has occurred between the tfd III cluster and the tcb cluster located almost immediately downstream.Apparently, the plasmid pRK1-5 also had the structure tfd III AKBECR, but genes tfd III AK were deleted.Interestingly, plasmids pRK1-5, pAKD26 and D. acidovorans P4a have an almost similarly close proximity in terms of location to the tfd III and tcb clusters, indicating that their ancestral state was tcbRCDEF and tfd III AKBECR and the other structure rearrangements occurred later.The latter is also indicated by the fact that plasmids pEST4011, pIJB1, pAKD25 and pTV1 have tfd III AKBECR and tcb clusters consisting of the ORF3 and tcbD gene instead of a complete tcb cluster.Finally, the ancestral state of tcbRCDEF and tfd III AKBECR is confirmed by the fact that plasmids pAKD25 and pAKD26 were isolated from the same bacterial host [39].

Evolution Lineage including Unique tfd IV Type Clusters
This tfd IV clusters have a gene structure with incomplete clustering, which significantly distinguishes them from the above-mentioned tfd I , tfd III , clc and tcb gene clusters.The second shared feature is several gene structures, and the absence of the tfdA gene in their structures.If the above clusters have a complete set of genes, then tfd IV members have only five genes in the cluster.The results of both synteny and protein phylogenetic analysis revealed the three sublineages inside tfd IV type clusters which evolved separately, but shared a common ancestor for the majority of encoded proteins.
The major subtype A involves seven clusters, three of which were plasmid-localized.The five of them share a five-gene structure and near localized tfdB IVA gene tfd IVA ECFRD,B (Figure 2b) suggesting a common path for that sublineage clustering.The induction of genes of that tfd IV -cluster sublineage in Sphingomonas sp.TFD44 in the presence of 2,4-D was previously noted by Thiel and colleagues (2005) [38].
The members of the second subtype, B, of tfd IV type clusters with the tfd IVB D,FEC/FEC/ D,FEC,B structure for the cluster, to date, evolved as parts of pathways responsible for the degradation of multiple compounds.The presented results indicate that tfd IVB type clusters are assembled with several genes encoding the anthranilate dioxygenase (AntDO-3C) enzyme of B. cepacia DBO1.Those genes are responsible for the degradation of anthranilate (3-aminobenzoate) [80].Analogous recruitment was found by Liu et al. (2011) [64] for 2-chloronitrobenzene (2CNB) degrading genes.Nevertheless, the same cluster in Sphingomonas sp.TFD44 (tfd IVB FEC cluster) encoded 2,4-D degradation and had the tfdD gene in another place in the genome [38].Since the above-mentioned clusters tfd IVB D,FEC/FEC/D,FEC,B have an almost identical structure, high identity and synteny, apparently, they were distributed initially amongst order Sphingomonadales, similar to other tfd IV type clusters, and later were acquired by P. stutzeri ZWLR2-1.The plasmid pDB-4 has the most complete tfd IVB D,FEC,B cluster and the presence of the tfdB IV gene also indicates the relatedness of the first and second subtype of tfd IV type clusters (Figure 2a).Additionally, it is probable that the above-mentioned dioxygenases show broad substrate specificity.Additionally, tfd IVB D,FEC/FEC/D,FEC,B are capable of degrading modified catechols which occur as a result of the intermediate state of the catabolism of many compounds, including anthranilate, 2CNB, 2,4-D and others.
The third sublineage of tfd IVC type clusters, tfd IVC CEDRF,B/CE,D,R,F, to date, can be defined by two clusters with different levels of clustering in contigs of Bradyrhizobium sp.RD5-C2 (BOVL01000048) and Sphingomonas histidinilytica BT1 5.2 (WMBU01000019), respectively.The first one was designated as tfdBaFRDEC and is involved in 2,4-D degradation from dichlorophenol with the same degradation pathway as that of C. pinatubonensis JMP134 [10].The S. histidinilytica BT1 5.2, which is capable of degrading 2,4-D, possesses the second cluster, tfd IVc CE,D,R,F, which was annotated by Nguyen and colleagues (2021) as tfdF,S,D,EC [7].
The main structural feature, incomplete clustering, across members of this lineage highlight the possible directions, driving forces and models of clustering in contrast with other lineages.The extreme rarity of genes of this evolutionary lineage among bacteria may lead them to form clusters as a protective reaction in order to avoid evolutionary loss and promote propagation across bacteria.This would be very consistent with the 'Selfish operon model' [81] and apparently contradicts persistence as a driving force of their clustering [82].Moreover, the proven HGT, the absence of adjacent insertion elements (IS) with one exception, and duplicated genes, actually oppose co-regulation theory [83], the 'IDE model' [84] and SNAP hypothesis [85], respectively.Interestingly, the revealed direct contribution of plasmids to genetic rearrangements in subtype A proves that clustering can proceed in two different models.
Currently, Sphingomonas and Bradyrhizobium genera are classified as I and III classes of 2,4-D degraders possessing the second system, cadABCD gene cluster, responsible for the initiation of degradation of chlorophenoxyacetic acids [86].The cad clusters were identified in members of all tfd IV subtypes.It was assumed that they were involved in the initial stages of 2,4-D degradation [7,10,41,63].Nevertheless, some members of Sphingomonas and Bradyrhizobium could have their own versions of the first system, the tfdA gene and its homologs, named tfdA-like and tfdAα, respectively [34].
Thus, tfd IV clusters are unique cad-dependent 2,4-D degradation clusters which have evolved into three subtypes which provide Sphingomonas and Bradyrhizobium with great competitive advantages.

Databases and Data Collection
All DNA and protein sequences, as well as additional information were obtained from both finished and unfinished genome sequencing records which are publicly available at NCBI resources [87].The plasmids and genomes of 56 bacterial species were analyzed.Most belonged to the order Burkholderiales from Betaproteobacteria and a few belonged to Alpha-and Gammaproteobacteria.

Gene Annotation
Two ORF prediction software, the web version of the NCBI ORF finder and SnapGene Viewer v.4.2.1, were used.Then the predicted ORFs were annotated using NCBI blastp [87] similarity searches against the UniProtKB/Swiss-Prot (swissprot) database.

Comparative Genomic Analyses
The canonical gene clusters of the plasmids pJP4 (AY365053), pP51 (M57629) and pAC27 (M16964) were used as templates for NCBI blastn [87] similarity searches against the nucleotide collection (nr/nt) and whole-genome shotgun contigs (wgs) databases.Adjacent to the clusters, genomic regions were searched and analyzed in order to determine synteny between clusters and flanked genes.For incomplete genome sequence projects in contig format, this strategy was not possible and gene cluster architecture was supposed based on nucleotide sequence identity.
Multiple sequence alignments of DNA and protein sequences of analyzed gene clusters were obtained using MAFFT at default parameter settings [88].Values of identity and similarity between corresponding nucleotide and protein sequences of analyzed clusters were calculated using SIAS (http://imed.med.ucm.es/Tools/sias.html,accessed on 15 May 2023 with default parameters.
Comparative synteny analysis was performed by comparison between linearized maps and oriented according to the structure of gene clusters to depict gene arrangements, gene retention and orientation.Detailed synteny maps were visualized and blastn identity comparisons were generated using Easyfig v.2.2.2 [89].
The new classification scheme for tfd gene clusters into four types-I, II, III and IV (A, B and C)-and nomenclature, were proposed.An updated nomenclature of tfd gene clusters was formulated based on the following syntax: the three italicized lowercase letters followed by Roman numerals as a subscript refer to the number of the cluster type.For example, tfd I BFEDCT (in short, tfd I ), tfd II AKBFECDR (in short, tfd II ), tfd III FAKBECDR (in short, tfd III ), tfd IVA ECFRD,B/tfd IVB CEF,D/tfd IVC CEDRF,B (in short, tfd IVA , tfd IVB , tfd IVC ).It is worth noting that there is no need to designate each gene as belonging to a certain type in the cluster structure (for example, tfdB I F I E I D I C I T I ), since tfd clusters do not form hybrid clusters among themselves.
For clusters tcb and clc, a reverse spelling of the gene order, tcbFEDCR and clcEDBAR, was proposed, to correspond to the order of genes in other types of ortho-pathway clusters (tfd I , tfd II and tfd III ).

Alignments and Phylogenetic Analyses
Multiple sequence alignments for all trees were performed with MAFFT using default parameter settings [88].The evolutionary history was inferred by using the maximum likelihood method based on the JTT matrix-based model.Initial tree(s) for the heuristic search were obtained automatically by applying Neighbor-Join and BioNJ algorithms to a matrix of pairwise distances estimated using a JTT model, and then selecting the topology with a superior log likelihood value.Evolutionary analyses were conducted in MEGA7 [90] with 1000 bootstrap replicates.

Conclusions
The tfd is a unique family of diverse mosaic gene cluster types interlinked by their activity with regard to 2,4-D and other chlorinated aromatic compounds.Widespread mainly across the orders Burkholderiales and Sphingomonadales, the extraordinary reservoirs of genes involved in the ortho-cleavage pathway of 2,4-D and catechols, these diverse tfds as well as highly conserved tcb and clc gene clusters enable microbes to exploit a variety of xenobiotic-polluted niches.Systematization of both sequenced and published data for over forty years and subsequent analysis through comparative genomic and protein phylogeny approaches has resulted in new insights into the evolution, classification and nomenclature of these clusters.Application of these work classification schemes provides a powerful approach for future exploration, especially for the correct annotation of tfd, tcb and clc clusters, as well as in the field related to their distribution and evolution across diverse bacteria.

Figure 1 .
Figure 1.Comparative genomic analysis of (a) tfdI, (b) tfdII, and (c) tfdIII gene clusters showing their genomic rearrangements and evolutionary relationships between themselves and tcb gene clusters.Clusters and adjacent regions shown by linear visualization and open reading frames (ORFs) represented by arrows; the clusters are indicated by the color key (bottom).The degree of identity between clusters is indicated by the intensity of grayscale-shaded regions according to blastn as shown in the heat key (bottom right).The scale in kilobase pairs (kbp) is shown at the bottom right of each cluster.

Figure 1 .
Figure 1.Comparative genomic analysis of (a) tfd I , (b) tfd II , and (c) tfd III gene clusters showing their genomic rearrangements and evolutionary relationships between themselves and tcb gene clusters.Clusters and adjacent regions shown by linear visualization and open reading frames (ORFs) represented by arrows; the clusters are indicated by the color key (bottom).The degree of identity between clusters is indicated by the intensity of grayscale-shaded regions according to blastn as shown in the heat key (bottom right).The scale in kilobase pairs (kbp) is shown at the bottom right of each cluster.

Figure 2 .
Figure 2. Comparative genomic analysis of tfdIV gene clusters.(a) Genomic rearrangements and evolutionary relationships between subtypes A, B and C. (b) Synteny analysis showing putative gene assembly of the tfdIV gene cluster (subtype A).Clusters and adjacent regions shown by linear visualization and open reading frames (ORFs) are represented by arrows; the clusters are indicated by the color key (bottom).The degree of identity between clusters is indicated by the intensity of grayscale-shaded regions according to blastn as shown in the heat key (bottom right).The scale in kilobase pairs (kbp) is shown at the bottom right of each cluster.

Figure 2 .
Figure 2. Comparative genomic analysis of tfd IV gene clusters.(a) Genomic rearrangements and evolutionary relationships between subtypes A, B and C. (b) Synteny analysis showing putative gene assembly of the tfd IV gene cluster (subtype A).Clusters and adjacent regions shown by linear visualization and open reading frames (ORFs) are represented by arrows; the clusters are indicated by the color key (bottom).The degree of identity between clusters is indicated by the intensity of grayscale-shaded regions according to blastn as shown in the heat key (bottom right).The scale in kilobase pairs (kbp) is shown at the bottom right of each cluster.

Figure 3 .
Figure 3. Comparative genomic analysis of (a) clc and (b) tcb gene clusters showing their sy Clusters and adjacent regions shown by linear visualization and open reading frames (ORF resented by arrows; the clusters are indicated by the color key (bottom).The degree of iden tween clusters is indicated by the intensity of grayscale-shaded regions according to blastn as in the heat key (bottom right).The scale in kilobase pairs (kbp) is shown at the bottom right cluster.

Figure 3 .
Figure 3. Comparative genomic analysis of (a) clc and (b) tcb gene clusters showing their synteny.Clusters and adjacent regions shown by linear visualization and open reading frames (ORFs) represented by arrows; the clusters are indicated by the color key (bottom).The degree of identity between clusters is indicated by the intensity of grayscale-shaded regions according to blastn as shown in the heat key (bottom right).The scale in kilobase pairs (kbp) is shown at the bottom right of each cluster.
represents the comparative genomic analysis of the clc and tcb gene clusters of B. petrii DSM 12804, tfd I cluster of plasmid5, tfd II cluster of pEMT3, and tfd III cluster of pkk4 plasmid.The clc and tcb clusters were closely related structurally.Both possessed an ORF3 open reading frame with an unknown function and an identical order of genes encoding proteins with the same activity.Additionally, the tfd I clusters were closely related to them, especially by tandem localization of genes encoding chloromuconate cycloisomerases (clcD, tcbD and tfdD I ) and chlorocatechol 1,2-dioxygenases (clcA, tcbC and tfdC I ).Nevertheless, there were two differences between clc and tcb and the tfd I cluster, namely, in the presence of the tfdB I gene and the absence of ORF3.The results of multiple alignments of complete clc, tcb and tfd I clusters revealed that clc and tcb shared between 46.2% and 50.2% identity and between 47.5% and 50.2% identity at nucleotide level with tfd I , respectively.When comparing complete clc and tcb gene clusters, they were shown to have a similarity in the range of 62.0-63.2%protein in the opposite direction.Figure4represents the comparative genomic analysis of the clc and tcb gene clusters of B. petrii DSM 12804, tfdI cluster of plasmid5, tfdII cluster of pEMT3, and tfdIII cluster of pkk4 plasmid.The clc and tcb clusters were closely related structurally.Both possessed an ORF3 open reading frame with an unknown function and an identical order of genes encoding proteins with the same activity.Additionally, the tfdI clusters were closely related to them, especially by tandem localization of genes encoding chloromuconate cycloisomerases (clcD, tcbD and tfdDI) and chlorocatechol 1,2-dioxygenases (clcA, tcbC and tfdCI).Nevertheless, there were two differences between clc and tcb and the tfdI cluster, namely, in the presence of the tfdBI gene and the absence of ORF3.The results of multiple alignments of complete clc, tcb and tfdI clusters revealed that clc and tcb shared between 46.2% and 50.2% identity and between 47.5% and 50.2% identity at nucleotide level with tfdI, respectively.When comparing complete clc and tcb gene clusters, they were shown to have a similarity in the range of 62.0-63.2%

Figure 4 .
Figure 4. Comparison of gene structure of complete tfd, clc and tcb clusters.Clusters and adjacent regions shown by linear visualization and open reading frames (ORFs) represented by arrows; the clusters are indicated by the color key (bottom).The degree of identity between clusters is indicated by the intensity of grayscale-shaded regions according to blastn as shown in the heat key (bottom right).The scale in kilobase pairs (kbp) is shown at the bottom right of each cluster.

Figure 4 .
Figure 4. Comparison of gene structure of complete tfd, clc and tcb clusters.Clusters and adjacent regions shown by linear visualization and open reading frames (ORFs) represented by arrows; the clusters are indicated by the color key (bottom).The degree of identity between clusters is indicated by the intensity of grayscale-shaded regions according to blastn as shown in the heat key (bottom right).The scale in kilobase pairs (kbp) is shown at the bottom right of each cluster.

22 Figure 5 .
Figure 5. Phylogenetic classification of (a) chlorocatechol 1,2-dioxygenases, (b) chlormuconate cycloisomerases, (c) dienelactone hydrolases, and (d) maleylacetate reductases of tfd, tcb and clc gene clusters.Each color corresponds to one cluster.Bootstrap support from the maximum likelihood analyses (ML) higher than 50% are indicated above branches.The trees are drawn to scale, with branch lengths measured in the number of substitutions per site.

Figure 5 .
Figure 5. Phylogenetic classification of (a) chlorocatechol 1,2-dioxygenases, (b) chlormuconate cycloisomerases, (c) dienelactone hydrolases, and (d) maleylacetate reductases of tfd, tcb and clc gene clusters.Each color corresponds to one cluster.Bootstrap support from the maximum likelihood analyses (ML) higher than 50% are indicated above branches.The trees are drawn to scale, with branch lengths measured in the number of substitutions per site.