Pan-Cellulosomics of Mesophilic Clostridia: Variations on a Theme

The bacterial cellulosome is an extracellular, multi-enzyme machinery, which efficiently depolymerizes plant biomass by degrading plant cell wall polysaccharides. Several cellulolytic bacteria have evolved various elaborate modular architectures of active cellulosomes. We present here a genome-wide analysis of a dozen mesophilic clostridia species, including both well-studied and yet-undescribed cellulosome-producing bacteria. We first report here, the presence of cellulosomal elements, thus expanding our knowledge regarding the prevalence of the cellulosomal paradigm in nature. We explored the genomic organization of key cellulosome components by comparing the cellulosomal gene clusters in each bacterial species, and the conserved sequence features of the specific cellulosomal modules (cohesins and dockerins), on the background of their phylogenetic relationship. Additionally, we performed comparative analyses of the species-specific repertoire of carbohydrate-degrading enzymes for each of the clostridial species, and classified each cellulosomal enzyme into a specific CAZy family, thus indicating their putative enzymatic activity (e.g., cellulases, hemicellulases, and pectinases). Our work provides, for this large group of bacteria, a broad overview of the blueprints of their multi-component cellulosomal complexes. The high similarity of their scaffoldin clusters and dockerin-based recognition residues suggests a common ancestor, and/or extensive horizontal gene transfer, and potential cross-species recognition. In addition, the sporadic spatial organization of the numerous dockerin-containing genes in several of the genomes, suggests the importance of the cellulosome paradigm in the given bacterial species. The information gained in this work may be utilized directly or developed further by genetically engineering and optimizing designer cellulosome systems for enhanced biotechnological biomass deconstruction and biofuel production.


Introduction
The plant cell wall forms a complex structure of cellulose fibers embedded into a colloidal mixture of hemicellulose, pectin, and lignin [1]. Cellulolytic microorganisms are prevalent in natural lignocellulose-containing habitats abundant in plant cell walls, such as soil, wood, rumen, and termite guts, or in man-made sewage sludge or compost piles [2][3][4]. They employ various strategies to efficiently hydrolyze cellulose and hemicellulose of wood and plants into simple hexose and pentose sugars that will be directed to their carbohydrate metabolism and cell construction [5]. One strategy for fiber deconstruction selected by various aerobic or anaerobic bacteria and fungi, is the secretion of
For profiling the cellulosomal system of each genome, we focused on its specific properties: the number of scaffoldins and dockerin-containing proteins which are potentially coded; the nature of the cellulosomal protein modules (i.e., types of cohesins, dockerins, and breakdown of CAZymes into categories); and genomic organization and sequence conservation of genes coding for cellulosomal components.
The cellulosomal systems that were observed reflect different degrees of complexity (Table 1). Small variations were observed in the number of cohesins, ranging from 3 cohesins in C. saccharoperbutylacetonicum up to 15 in C. sufflavum, with the number of scaffoldins varying from 2 to 7.
The majority of the examined species code for only 2-3 scaffoldins, while C. cellulovorans, C. termitidis and C. sufflavum code for more than 5 scaffoldins (although some of which may result from incorrect or inadequate assembly of the genome). However, great variation was observed in the number of dockerin-bearing proteins, whereby C. saccharoperbutylacetonicum, C. bornimense, and C. acetobutylicum code for strikingly few dockerins (≤10), whereas other species contain a range of 28-88 dockerin-containing proteins. Similarly, 2-3-fold variation was also observed in the total number of CAZymes coded in the genome, ranging from 60 enzymes (C. bornimense) to 218 (C. termitidis). Nevertheless, when considering draft genomes, the number of scaffoldins may have been underestimated; dockerin-containing protein numbers would also be affected to a lesser extent.
Moreover, assembly issues (especially in draft genomes) may result in gene duplication and distortion in numbers and disposition of repeated modular components, such as cohesin and X2 modules.

Conserved Patterns in the Orthologous Sca Gene Cluster
In all the examined species, the major scaffoldin gene, termed cip (originally referred to as "cellulosome-integrating protein"), is typically organized on the chromosome in a large cluster of 5 to 16 genes, with most species having 10 to 12 genes (Figure 1), in which the cip gene is the first gene. It is followed downstream by genes coding for cellulolytic enzymes, belonging to GH families 48, 9, and 5, which play key roles in cellulose cellulosomal degradation [48][49][50]. In between the genes of the cluster lies a conserved gene, termed orfX, which codes for a cohesin-containing protein (up to 97% sequence similarity among the mesophilic bacterial species). The overall gene organization of the cluster is comparable in all species, suggesting that the cellulosomes of the mesophilic bacteria originated from a common ancestor. Nevertheless, we still observed two patterns of gene architectures among the different bacteria. We divided the species in two groups based on this gene cluster organization ( Figure 1). The Group I mesophilic clostridia have an identical organization of their six first genes, which encode for the major scaffoldin (Cip), followed by the GH8 enzyme, two GH9s, and the mysterious cohesin-containing OrfX protein. Thereafter, minor swapping of GH5 and GH9 enzymes ensue. An additional gene could be found in unique species, such as the C. cellulolyticum gene cluster that contains a singular PL11 gene at the 3 -end of the cluster. The cluster organization is more conserved in the genomes of closely related cellulolytic bacteria, such as C. cellulolyticum, Clostridium sp. BNL1100 and C. josui. Intriguingly, C. sufflavum presents two copies of the cip and the GH48 genes, which may be the result of a gene duplication event. In contrast, group II species do not contain a GH8 gene, and instead display a GH74 or GH44 gene. Remarkably, C. bornimense is the only species coding for an enzyme at the 5 -end of the cluster, upstream to the cip gene [51].

Conserved Patterns in the Orthologous Sca Gene Cluster
In all the examined species, the major scaffoldin gene, termed cip (originally referred to as "cellulosome-integrating protein"), is typically organized on the chromosome in a large cluster of 5 to 16 genes, with most species having 10 to 12 genes (Figure 1), in which the cip gene is the first gene. It is followed downstream by genes coding for cellulolytic enzymes, belonging to GH families 48, 9, and 5, which play key roles in cellulose cellulosomal degradation [48][49][50]. In between the genes of the cluster lies a conserved gene, termed orfX, which codes for a cohesin-containing protein (up to 97% sequence similarity among the mesophilic bacterial species). The overall gene organization of the cluster is comparable in all species, suggesting that the cellulosomes of the mesophilic bacteria originated from a common ancestor. Nevertheless, we still observed two patterns of gene architectures among the different bacteria. We divided the species in two groups based on this gene cluster organization (Figure 1). The Group I mesophilic clostridia have an identical organization of their six first genes, which encode for the major scaffoldin (Cip), followed by the GH8 enzyme, two GH9s, and the mysterious cohesin-containing OrfX protein. Thereafter, minor swapping of GH5 and GH9 enzymes ensue. An additional gene could be found in unique species, such as the C. cellulolyticum gene cluster that contains a singular PL11 gene at the 3′-end of the cluster. The cluster organization is more conserved in the genomes of closely related cellulolytic bacteria, such as C. cellulolyticum, Clostridium sp. BNL1100 and C. josui. Intriguingly, C. sufflavum presents two copies of the cip and the GH48 genes, which may be the result of a gene duplication event. In contrast, group II species do not contain a GH8 gene, and instead display a GH74 or GH44 gene. Remarkably, C. bornimense is the only species coding for an enzyme at the 5′-end of the cluster, upstream to the cip gene [51]. Schematic representation of the gene cluster harboring the major scaffoldin, and followed by genes coding for dockerin-containing cellulolytic enzyme, which are organized in a similar sequence along the gene cluster of the marked species. The major scaffoldin gene is represented by cip; numbers denote the family of glycoside hydrolases; X stands for the orfX gene; asterisks (*) mark draft genomes that have more than two contigs; slashes (//) indicate that the ORF may not be complete, because it was located at the end of contig.

Modular Organization of the Major Scaffoldin Gene
The modular organization of the mesophilic clostridia shows both striking similarity and intriguing variety among the species. Evaluation of the relationship between cohesins can be exemplified for C. papyrosolvens, in which sequence analysis identified a 137 kDa scaffoldin protein, Schematic representation of the gene cluster harboring the major scaffoldin, and followed by genes coding for dockerin-containing cellulolytic enzyme, which are organized in a similar sequence along the gene cluster of the marked species. The major scaffoldin gene is represented by cip; numbers denote the family of glycoside hydrolases; X stands for the orfX gene; asterisks (*) mark draft genomes that have more than two contigs; slashes (//) indicate that the ORF may not be complete, because it was located at the end of contig.

Modular Organization of the Major Scaffoldin Gene
The modular organization of the mesophilic clostridia shows both striking similarity and intriguing variety among the species. Evaluation of the relationship between cohesins can be exemplified for C. papyrosolvens, in which sequence analysis identified a 137 kDa scaffoldin protein, bearing an N-terminal CBM3 followed by six type-I cohesin modules, which are interspersed with conserved X2 modules ( Figure 2). Most of the scaffoldins from the other clostridial species contain a CBM3 at the N-terminus, six scaffoldins, and exhibit modular protein architectures strikingly similar to that of C. papyrosolvens, with permutations in the number and position of the X2 modules. A similar architecture is also conserved in the CipC protein of C. cellulolyticum and CbpA of C. cellulovorans, but the latter scaffoldins contain eight and nine cohesins, respectively. In C. saccharoperbutylacetonicum and C. bornimense, the scaffoldins contain two and three cohesins respectively. The number of scaffoldin-borne X modules range from one in C. josui to eight in C. sufflavum. Most of the scaffoldins exhibit a trimodular CBM3-X2-Coh at their N-terminus, except C. acetobutylicum and C. saccharoperbutylacetonicum, that bear two X2 domains between their CBM3 and Coh modules. Intriguingly, the scaffoldins of C. saccharoperbutylacetonicum and C. bornimense exhibit two copies of the CBM3 at the N-terminus. The cip gene is incomplete in the draft sequences of C. cellobioparum and C. termitidis, where their sequences are either interrupted or truncated at the end of the contig of the draft genome. bearing an N-terminal CBM3 followed by six type-I cohesin modules, which are interspersed with conserved X2 modules ( Figure 2). Most of the scaffoldins from the other clostridial species contain a CBM3 at the N-terminus, six scaffoldins, and exhibit modular protein architectures strikingly similar to that of C. papyrosolvens, with permutations in the number and position of the X2 modules. A similar architecture is also conserved in the CipC protein of C. cellulolyticum and CbpA of C. cellulovorans, but the latter scaffoldins contain eight and nine cohesins, respectively. In C. saccharoperbutylacetonicum and C. bornimense, the scaffoldins contain two and three cohesins respectively. The number of scaffoldin-borne X modules range from one in C. josui to eight in C. sufflavum. Most of the scaffoldins exhibit a trimodular CBM3-X2-Coh at their N-terminus, except C. acetobutylicum and C. saccharoperbutylacetonicum, that bear two X2 domains between their CBM3 and Coh modules. Intriguingly, the scaffoldins of C. saccharoperbutylacetonicum and C. bornimense exhibit two copies of the CBM3 at the N-terminus. The cip gene is incomplete in the draft sequences of C. cellobioparum and C. termitidis, where their sequences are either interrupted or truncated at the end of the contig of the draft genome.

Regulation of the Sca Gene Cluster by a Conserved σ A -Dependent Promoter
Remarkably, the 5′-upstream region of the first cip gene in each cluster is conserved among all the species. This region was previously reported as the cip-cel operon promoter, which undergoes transcriptional regulation [52]. This conserved putative promoter sequence, upstream of the major scaffoldin gene, ranges from 862 bp in C. termitidis to 1286 bp in C. cellobioparum ( Figure 3). Previously, Abdou and colleagues [52] reported an unusually remote promoter of the cipC gene in C. cellulolyticum ATCC 35319 (ortholog of the H10 strain). In that study, a single σ A -dependent promoter (P1) was determined between nucleotides -671 and -643 with respect to the ATG start codon, generating a 638 nt 5′-UTR (untranslated region) of the cipC mRNA. A recent mRNA-seq study suggests that the C. cellulolyticum sca gene cluster functions as an operon, and confirms that a single promoter is located at the 5′-end of cipC [28]. The primary cip-cel transcript harbors at least five post-

Regulation of the Sca Gene Cluster by a Conserved σ A -Dependent Promoter
Remarkably, the 5 -upstream region of the first cip gene in each cluster is conserved among all the species. This region was previously reported as the cip-cel operon promoter, which undergoes transcriptional regulation [52]. This conserved putative promoter sequence, upstream of the major scaffoldin gene, ranges from 862 bp in C. termitidis to 1286 bp in C. cellobioparum ( Figure 3). Previously, Abdou and colleagues [52] reported an unusually remote promoter of the cipC gene in C. cellulolyticum ATCC 35319 (ortholog of the H10 strain). In that study, a single σ A -dependent promoter (P1) was determined between nucleotides -671 and -643 with respect to the ATG start codon, generating a 638 nt 5 -UTR (untranslated region) of the cipC mRNA. A recent mRNA-seq study suggests that the C. cellulolyticum sca gene cluster functions as an operon, and confirms that a single promoter is located at the 5 -end of cipC [28]. The primary cip-cel transcript harbors at least five post-transcriptional processed sites, and suggests a post-transcriptional regulatory model for cellulosomal loci.
We used the C. cellulolyticum cipC 5 UTR sequence as a query to mine available genomes of mesophilic cellulosome-producing bacteria, and found an extraordinary conservation of a putative promoter motif very far from the predicted start codon of the major scaffoldin gene in the following species: C. josui, C. papyrosolvens, C. cellobioparum, Clostridium sp. strain BNL1100, and C. termitidis. We also observed additional putative SigI-associated promoters upstream of the main scaffoldin gene in C. thermocellum, C. straminisolvens JCM21531, C. cellulovorans 743B (ATCC 35296, DSM 3052), and C. acetobutylicum ATCC 824 [53]. Figure 3 shows a strong conservation of the aligned promoter sequences, and supports the hypothesis of a possible regulatory role of an extended 5 -UTR in the regulation of post-transcriptional events, which might indicate a translation step of scaffoldin expression. transcriptional processed sites, and suggests a post-transcriptional regulatory model for cellulosomal loci.
We used the C. cellulolyticum cipC 5′UTR sequence as a query to mine available genomes of mesophilic cellulosome-producing bacteria, and found an extraordinary conservation of a putative promoter motif very far from the predicted start codon of the major scaffoldin gene in the following species: C. josui, C. papyrosolvens, C. cellobioparum, Clostridium sp. strain BNL1100, and C. termitidis. We also observed additional putative SigI-associated promoters upstream of the main scaffoldin gene in C. thermocellum, C. straminisolvens JCM21531, C. cellulovorans 743B (ATCC 35296, DSM 3052), and C. acetobutylicum ATCC 824 [53]. Figure 3 shows a strong conservation of the aligned promoter sequences, and supports the hypothesis of a possible regulatory role of an extended 5′-UTR in the regulation of post-transcriptional events, which might indicate a translation step of scaffoldin expression. The σ A (RpoD)-dependent promoter and cognate transcription start site (S1) have been experimentally identified as a major region of the C. cellulolyticum H10 cipC gene [52] and its orthologs [54]. The two T nucleotides of S1 are underlined, as well as sequences predicted to be −35, −16 and −10 elements of the cipC promoter; (B) aligned sequences are related to the recently identified RpoD-dependent promoter of the C. thermocellum cipA gene [54]. TSS2 is a transcriptional start site position, while −35 and −10 elements are elements of the cipA promoter. In both panels (A and B), 5′ UTR (untranslated regions) are shown partially, and numbers between the last nucleotide of each sequence and the predicted initial codon for methionine (Met) are provided. The two WebLogos were generated, with the sequences shown in each alignment, and they suggest putative promoter consensuses in the two groups of cellulolytic species. The promoter has two patterns of conservation, one in the related mesophiles, and a second in thermophiles and other complex cellulosomes (denoted † in designated species as follows  The σ A (RpoD)-dependent promoter and cognate transcription start site (S1) have been experimentally identified as a major region of the C. cellulolyticum H10 cipC gene [52] and its orthologs [54]. The two T nucleotides of S1 are underlined, as well as sequences predicted to be

Sequence Conservation in Cohesins and Dockerins Suggest Cross-Species Recognition
In order to compare the sequence conservation of key cellulosomal components among the mesophilic cellulolytic clostridia, namely the cohesins and dockerins, we searched bioinformatically for cohesins of the major scaffoldins from newly sequenced genomes by BLAST, using known modules as query sequences. Overall, most bacteria harbor more than~70 dockerin-containing proteins, and less than a dozen cohesin modules, organized in a handful of scaffoldins, with the exception of C. sufflavum having 15 cohesins, with C. saccharoperbutylacetonicum having 8 dockerin-bearing enzymes, and C. bornimense only 5 dockerins identified in its genome ( Table 1).
Analysis of the phylogenetic relationship among the 59 cohesins from the major scaffoldins of all examined species supports the distinction of two major evolutionary groups of species (red and blue branches in Figure 4). This may suggest a common ancestor for all these species, which further evolved into two distinct routes, distinguishing between the scaffoldin cohesins of C. acetobutylicum, C. cellulovorans, C. bornimense, and C. saccharoperbutylacetonicum (Group I in Figure 4) from the other mesophiles (Group II in Figure 4), and with the C. acetobutylicum cohesins representing the most remote group of outliers. This is in accordance with previous 16S rDNA analysis showing a distinction between C. cellulovorans and related sequences [55]. The dendrogram indicated that C. papyrosolvens cohesins are similar to those of C. cellulolyticum and C. josui (suggesting cross-species recognition), and are distinct from C. acetobutylicum and C. cellulovorans (the later are separated on different branches of the tree).
We next compared the sequence conservation of dockerin modules. The dockerin is typically a protein of~70 amino acids long, that resides within carbohydrate-degrading enzymes, usually at the N terminus, and serves to anchor the enzyme into the cellulosome by direct interaction with cohesin modules on the scaffoldin ( Figure 5). In general, the dockerin modules of the different species share high sequence similarity, and the dockerin modules of C. cellulolyticum and C. papyrosolvens show greater than 90% sequence similarity. In Figure 5, we observed that the dockerin organization is maintained among all species examined. This includes the two typically conserved repeats of calcium-binding loops followed by an "F helix", that are connected by a variable linker region [56]. A conserved N-terminal Gly residue and the canonical pattern of Asp/Asn are kept within the cellulosomal clostridial mesophiles at the calcium-coordinating positions 1, 3, 5, 9, and 12 ( Figure 5).
The nature of the "specificity determinants" (i.e., residues at positions 10, 11, 17, 18, and 22 within the repeated segment) is also preserved among the mesophiles [57][58][59]. Yet, while in the complex cellulosome of the thermophile C. thermocellum residues, 10/11 are usually occupied by conserved Ser/Thr ( Figure 5), comparison of dockerin profiles of the mesophilic cellulosome-producing bacteria indicates conservation of Ala/Leu(Ile) in these positions instead, suggesting general, similar dockerin-binding specificities ( Figure 5). Interestingly, C. cellulovorans shows a similar pair of residues at the 10/11 position, whereas C. acetobutylicum has unique residues in that position, as does C. saccharoperbutylacetonicum.  Bootstrap values are denoted, and branches below 80% bootstrapping were collapsed. Two major branches of the dendogram (red and blue) separate between C. acetobutylicum, C. cellulovorans, C. bornimense, and C. saccharoperbutylacetonicum from the other mesophiles.

Sporadic Spatial Organization of the Cellulosomal Genes Along the Bacterial Chromosome
The physical organization of the cohesin-and dockerin-containing proteins was evaluated using BLAST sequence search against each genome ( Figure 6). Such an analysis was applied only on complete genome sequences or those bearing two large assembly contigs (thus excluding C. papyrosolvens, C. sufflavum, C. cellobioparum, and C. termitidis from this analysis). Most dockerincontaining genes were sporadically distributed along the chromosome in species with a high (>10) copy number of dockerins (Clostridium sp. BNL1100, C. josui, C. cellulolyticum, and C. cellulovorans), except for two gene clusters. One cluster, which appears in all species, is the sca gene cluster, which contains cohesins coded in the Cip scaffoldin and in the orfX gene, together with dockerin-containing enzymes of that operon ( Figure 6). An additional cluster is the "xyl-doc" cluster, encoding 14 dockerin-containing hemicellulases, which was originally reported in C. cellulolyticum (Ccel_1229-1242) [60]. BLAST searches using this cluster showed that it is also conserved in Clostridium sp.

Sporadic Spatial Organization of the Cellulosomal Genes along the Bacterial Chromosome
The physical organization of the cohesin-and dockerin-containing proteins was evaluated using BLAST sequence search against each genome ( Figure 6). Such an analysis was applied only on complete genome sequences or those bearing two large assembly contigs (thus excluding C. papyrosolvens, C. sufflavum, C. cellobioparum, and C. termitidis from this analysis). Most dockerin-containing genes were sporadically distributed along the chromosome in species with a high (>10) copy number of dockerins (Clostridium sp. BNL1100, C. josui, C. cellulolyticum, and C. cellulovorans), except for two gene clusters. One cluster, which appears in all species, is the sca gene cluster, which contains cohesins coded in the Cip scaffoldin and in the orfX gene, together with dockerin-containing enzymes of that operon ( Figure 6). An additional cluster is the "xyl-doc" cluster, encoding 14 dockerin-containing hemicellulases, which was originally reported in C. cellulolyticum (Ccel_1229-1242) [60]. BLAST searches using this cluster showed that it is also conserved in Clostridium sp. BNL1100. The sporadic spatial organization of the numerous dockerin-containing genes in the genome suggests the importance of the cellulosomal paradigm in those bacterial species.
However, such a conclusion could not statistically be validated for species with only a few dockerins (C. acetobutylicum, C. saccharoperbutylacetonicum, and C. bornimense).
BNL1100. The sporadic spatial organization of the numerous dockerin-containing genes in the genome suggests the importance of the cellulosomal paradigm in those bacterial species. However, such a conclusion could not statistically be validated for species with only a few dockerins (C. acetobutylicum, C. saccharoperbutylacetonicum, and C. bornimense). Figure 6. Arrangement of cohesins and dockerins along the bacterial chromosomes of cellulosomeproducing mesophiles. Cohesins (blue triangles) and dockerin modules (red triangles) were searched by BLAST and located on the bacterial chromosome. Known clusters of dockerins (the xyl-doc cluster) and the sca gene cluster are marked in blue and black rectangles, respectively, whereas most other dockerin-containing genes were distributed along the chromosome.

Profiling the Carbohydrate-Active Enzymes in the Cellulosome-Producing Mesophiles
The identification of cellulosome-related carbohydrate active enzymes (CAZymes) is key for understanding the complex functions of carbohydrate degradation in cellulolytic bacteria. We profiled the elaborate reservoir of dockerin-containing cellulases using the comprehensive CAZy classification system [30]. This enabled the identification of numerous glycoside hydrolases (GHs), carbohydrate esterases (CEs), polysaccharide lyases (PLs), and in proteins bearing carbohydratebinding modules (CBMs) (Figure 7). In cases in which the proteins bear a dockerin module, the latter mediates the incorporation of the cellulase into the cellulosomal scaffoldin via cohesin-dockerin interaction.
Closer analysis reveals that glycoside hydrolases (GH) contribute the major fraction to the total number of CAZymes ( Figure 7A and Table 1). Notably, C. cellulovorans has an exceptionally high number of 15 polysaccharide lyases (PLs). Differences are also observed in the number of CBMs, ranging from 22 copies in C. saccharoperbutylacetonicum to 95 in C. termitidis, and the variation is even more pronounced regarding the cellulose-binding family 3 alone. Among the genomes analyzed, the varying number of non-catalytic modules (cohesins, CBMs) did not correlate with the number of catalytic modules (CAZymes, either with or without dockerins) ( Figure 7A and Table 1). This may suggest that the complexity of a cellulosome is not a trivial statistical function of the number of modules, and that additional parameters may be involved, such as gene organization, types of binding modules or gene regulation. Arrangement of cohesins and dockerins along the bacterial chromosomes of cellulosome-producing mesophiles. Cohesins (blue triangles) and dockerin modules (red triangles) were searched by BLAST and located on the bacterial chromosome. Known clusters of dockerins (the xyl-doc cluster) and the sca gene cluster are marked in blue and black rectangles, respectively, whereas most other dockerin-containing genes were distributed along the chromosome.

Profiling the Carbohydrate-Active Enzymes in the Cellulosome-Producing Mesophiles
The identification of cellulosome-related carbohydrate active enzymes (CAZymes) is key for understanding the complex functions of carbohydrate degradation in cellulolytic bacteria. We profiled the elaborate reservoir of dockerin-containing cellulases using the comprehensive CAZy classification system [30]. This enabled the identification of numerous glycoside hydrolases (GHs), carbohydrate esterases (CEs), polysaccharide lyases (PLs), and in proteins bearing carbohydrate-binding modules (CBMs) (Figure 7). In cases in which the proteins bear a dockerin module, the latter mediates the incorporation of the cellulase into the cellulosomal scaffoldin via cohesin-dockerin interaction.
Closer analysis reveals that glycoside hydrolases (GH) contribute the major fraction to the total number of CAZymes ( Figure 7A and Table 1). Notably, C. cellulovorans has an exceptionally high number of 15 polysaccharide lyases (PLs). Differences are also observed in the number of CBMs, ranging from 22 copies in C. saccharoperbutylacetonicum to 95 in C. termitidis, and the variation is even more pronounced regarding the cellulose-binding family 3 alone. Among the genomes analyzed, the varying number of non-catalytic modules (cohesins, CBMs) did not correlate with the number of catalytic modules (CAZymes, either with or without dockerins) ( Figure 7A and Table 1). This may suggest that the complexity of a cellulosome is not a trivial statistical function of the number of modules, and that additional parameters may be involved, such as gene organization, types of binding modules or gene regulation. The vast majority (91%) of the dockerin-containing proteins are secreted enzymes, wherein the proteins possess a signal peptide sequence (some bacteria have a unique signal peptides sequences which are often not identified by the SignalP server). A wide variety of carbohydrate-degrading modules, i.e., GHs, CEs, and PLs, can be identified in the dockerin-encoding genes, suggesting The vast majority (91%) of the dockerin-containing proteins are secreted enzymes, wherein the proteins possess a signal peptide sequence (some bacteria have a unique signal peptides sequences which are often not identified by the SignalP server). A wide variety of carbohydrate-degrading modules, i.e., GHs, CEs, and PLs, can be identified in the dockerin-encoding genes, suggesting diversity in enzymatic activity. Of note are the genomes of C. termitidis, C. cellobioparum and C. saccharoperbutylacetonicum, which bear more than 140 GH enzymes. The catalytic modules are collectively associated with dozens of different non-catalytic CBMs, which were identified in each mesophile, and notably expanded in C. termitidis ( Figure 7B and Table 1). In C. papyrosolvens, the most abundant GH families are GH5, GH9, and GH43, which constitute over 50% of the enzymatic domains identified.
While comparing the CAZomes of two very closely related cellulosome-producing mesophilic bacteria-C. cellulolyticum and C. papyrolsovens (which exhibit 98.9% similarity in their 16S rRNA sequences)-several differences could be noted. Whole-genome analysis of C. papyrosolvens revealed 98 GH domains and 66 CBMs, representing a notable increase, compared to the 91 and 54 domains observed in C. cellulolyticum. Included in the C. papyrosolvens GH families are GH25 and GH36, of which there are no representatives in the C. cellulolyticum genome ( Figure 7B). Conversely, GH65 and GH73 domains are each found in single copies in C. cellulolyticum, but are absent in C. papyrosolvens. The differences in numbers may be attributed to the size of the genomes, which are 4.92 Mb for C. papyrosolvens, and 4.07 for C. cellulolyticum. Yet, these data indicate pointed diversity of CAZymes and related domains beyond the cellulosome-associated components, and suggest that, like other cellulolytic bacteria, the various individual mesophilic clostridial species have evolved several specific strategies for carbohydrate degradation, some similar to, but others distinct from those of their intimate relatives.

Discussion
In early work, selected anaerobic mesophilic bacteria were found to exhibit distinctive characteristics consistent with the production of cellulosomes [41,61]. The list of such bacteria was later extended in additional studies by the sequencing of scaffoldin genes in other mesophilic cellulolytic clostridia [37,39]. With the advent and progression of the era of genome sequencing, additional cellulosome-producing, mesophilic clostridia were discovered. Surprisingly, the different species display great similarity in their cellulosomal components, which includes the nature of their enzyme-integrating scaffoldin subunit, the types (and usually number) of dockerin-bearing enzymes, and the amino acid residues that occupy positions in the dockerin consistent with recognition of the cohesin counterpart. Moreover, several very basic cellulosome genes are contained in a telltale gene cluster on the chromosome in all of the mesophilic clostridial species, which includes genes coding for the major scaffoldin subunit, the mysterious single-cohesin-containing OrfX, the major family 48 cellulase, and other cellulases from families 5 and 9.
Our work herein links the need of a given cellulolytic bacterium to express various fibrolytic activities and the genome-wide coding of key cellulosomal components in different mesophilic cellulosome-producing bacteria. On the one hand, the current work further demonstrated the relatedness among the cellulosome-producing mesophiles, each of which possesses a simple cellulosomal architecture compared to the complex multi-scaffoldin cellulosomes of other clostridia and ruminococci. The mesophilic clostridia share common features which distinguish their cellulosomes from those of other species, such as, the similar organization of the sca gene cluster which was observed for C. cellulolyticum, Clostridium sp. BNL1100, C. papyrosolvens and C. josui, along with conserved functional and sequence profiles of their cohesin and dockerin modules. This similarity suggests that the sca gene cluster, with its collection of cellulosomal component genes, was horizontally transferred among these mesophiles from a common ancestor [26]. On the other hand, we noted differences in the type and proportions of key CAZyme components among the mesophiles. These may reflect a specialized repertoire of carbohydrate-degrading strategies, which have evolved in each bacterium, tailored for its diverse habitat, lifestyle, its physical conditions or interaction with other organisms.
The dockerin profile of the mesophilic cellulosome-producing bacteria includes the definitive repeated calcium-binding loop and adjacent helix segment, but differs in their conserved putative recognition residues from those of the complex cellulosome-producing bacteria, e.g., C. thermocellum, B. cellulosolvens, C. clariflavum, and R. flavefaciens. This may suggest collective species-specific preferences, to eliminate cross-species binding with the cohesin-bearing scaffoldins of the complex cellulosome-producing bacteria, as was observed in [62]. In contrast, most, but not all, of the cohesin-dockerin interactions of the mesophilic clostridia appear to share the same general recognition residues, which may indicate general cross-species interaction of their scaffoldins and enzyme subunits in nature, and would imply their coexistence in the same ecological niche. In any case, evolutionary forces have not proved fit to change them during speciation processes [5,63]. Unlike the majority of the mesophilic clostridia, however, distinct alternative recognition residues are evident in C. saccharoperbutylacetonicum and C. bornimense. It is also currently enigmatic why these two species have two CBM3s in their respective scaffoldin with a reduced number of cohesins and similarly reduced number of dockerin-bearing enzymes. It seems that their abridged cellulosomes would assume a supportive role to the much larger collection of free enzymes in these species. Nevertheless, the presence of typical cellulosome-based cellulases, i.e., GH48, GH9, and GH5 enzymes, may indicate their significance for the parent bacterium in the degradation of recalcitrant forms of cellulosic biomass.
Further studies are needed to elucidate the interactions of the cellulosomal components of the newly described species, such as C. sufflavum, C. termitidis, C. saccharoperbutylacetonicum and C. bornimense. This is also true for a full understanding of the role of the "inactive" cellulosome complex of C. acetobutylicum, which has little or no detectable cellulolytic activities, but maintains a conserved scaffoldin, dockerins, and CAZymes (including the dominant GH48 enzyme and other long-established types of cellulases [9]). Genomes of a second C. papyrosolvens and several other strains of C. acetobutylicum have also been sequenced, but were omitted from this study. Likewise, additional related cellulosome-producing mesophilic clostridia, such as Clostridium puniceum, Herbinix luporum, Clostridium hungatei, Clostridium roseum, etc., have not been included herein. Moreover, the contribution of additional, recently sequenced mesophilic, but complex, multi-scaffoldin cellulosome-producing bacteria, such as Clostridium alkalicellulosi and Bacteroides (Pseudobacteroides) cellulosolvens [19], will also shed light on the cellulosomal models of the mesophilic bacteria.
Hydrolysis of cellulosic substrates is a major biotechnological challenge. Reconstitution of the biological principle of native cellulosomes and their application as components for chimeric designer cellulosomes [64][65][66][67][68] may provide a basis for improved cellulolytic activity. The cellulosome complexes of the mesophilic clostridia contain a wealth of polypeptide modules that can be utilized for numerous applications. Cohesin and dockerin modules can also be fused to various non-cellulolytic biologically active macromolecules for use in a large range of affinity-based systems. The developing nanotechnologies will require a diversity of such "Lego"-like molecular adaptors or connecting modules. The components discovered and analyzed in each cellulsome-producing bacterium now joins the growing library of divergent cohesins, dockerins, and other cellulosome-related modules, and may contribute to future application as "spare parts" for fabrication of defined nanoassemsblies.