De Novo Metagenomic Analysis of Microbial Community Contributing in Lignocellulose Degradation in Humus Samples Harvested from Cuc Phuong Tropical Forest in Vietnam

: Weaimedtoinvestigatethemicrobialdiversity, minelignocellulose-degradingenzymes/proteins, and analyze the domain structures of the mined enzymes/proteins in humus samples collected from the Cuc Phuong National Park, Vietnam. Using a high-throughput Illumina sequencer, 52 Gbs of microbial DNA were assembled in 2,611,883 contigs, from which 4,104,872 open reading frames (ORFs) were identiﬁed. Among the total microbiome analyzed, bacteria occupied 99.69%; the ﬁve ubiquitous bacterial phyla included Proteobacteria, Bacteroidetes, Actinobacteria, Firmicutes, and Acidobacteria, which accounted for 92.59%. Proteobacteria (75.68%), the most dominant, was 5.77 folds higher than the second abundant phylum Bacteroidetes (13.11%). Considering the enzymes/proteins involved in lignocellulose degradation, 22,226 ORFs were obtained from the annotation analysis using a KEGG database. The estimated ratio of Proteobacteria/Bacteroidetes was approximately 1:1 for pretreatment and hemicellulases groups and 2.4:1 for cellulases. Furthermore, analysis of domain structures revealed their diversity in lignocellulose-degrading enzymes. CE and PL were two main families in pretreatment; GH1 and GH3-FN3 were the highest domains in the cellulase group, whereas GH2 and GH43 represented the hemicellulase group. These results validate that natural tropical forest soil could be considered as an important source to explore bacteria and novel enzymes/proteins for the degradation of lignocellulose.


Introduction
Lignocellulose, which is composed of celluloses, hemicelluloses, and lignin, is derived from numerous sources that include agricultural crops and forest residues, along with bioenergy crops and forest products [1]. Lignocellulose, an abundant, sustainable, and renewable biomass, is used for the production of biofuels and other valuable products [2]. Biomass, an environmentally friendly entity, represents an inexpensive alternative to depleted fossil fuel resources. Hence, it can decrease global climate change and contribute to a sustainable and greener future [3]. Thus, the degradation of lignocellulose has gathered considerable attention from scientists and governments worldwide [4][5][6]. The hydrolysis of lignocellulose by chemical and physical methods effectively breaks down the rigid bonds. Nevertheless, this usually generates secondary products that are possibly environmentally hazardous, which inhibit consequent steps in the process [7,8]. Therefore, the approaches Humus was sampled at sites (GPS at 20.27776; 105.71137, within 10 km radius) in Cuc Phuong. Cuc Phuong National Park is located in Ninh Binh province, in the region of the Red River Delta in Vietnam. Furthermore, Cuc Phuong Park is the largest nature reserve and one of the primary biodiversity sites in Vietnam. The annual average temperature in Cuc Phuong is 20.6 • C. The annual humidity and precipitation are 90% and 2138 mm, respectively. Forty-five humus samples were collected from the degraded deadwood points with the growth of white-rot fungi ( Figure 1). Each sample was taken about 100 g surrounding white-rot fungi. The pH values of all samples range between 6.9 and 7.3. After collection, the humus samples were transferred into an ice box and then stored at 4 • C.
The humus samples were pooled and homogenized in PBS buffer (137 mM NaCl, 2.7 mM KCl, 1.4 mM KH 2 PO 4 , and 10 mM Na 2 HPO 4 , pH 7.4). Consequently, the samples were centrifuged to remove the impurities, once at low-speed (500 rpm) for 10 min and then twice at 600 rpm for 10 min. Thereafter, microorganisms in the samples were harvested by centrifuging at 5000 rpm for 1 min. The obtained pellets were suspended in PBS buffer (pH 7.4) supplemented with 20% glycerol and stored at −80 • C. Metagenomic DNA was extracted from bacterial samples prepared from 10 g humus. Each sample was suspended in 20 mL lysis buffer, containing 100 mM Tris, 100 mM EDTA, 100 mM Na 2 HPO 4 , 1.5 M NaCl, and 1% CTAB, (pH 8) with 0.1 mg/mL protease K and incubated at 37 • C for 30 min with gentle shaking. After incubation, the sample was treated with 3 mL 20% SDS and continually incubated at 65 • C for 30 min. Subsequently, the supernatant was collected by centrifuging the samples at 7000 rpm for 5 min. Then, phenol/chloroform/isoamyl alcohol (25:24:1, v/v) mixture was added to purify the DNA samples. The upper aqueous layer was harvested after centrifugation at 6500 rpm for 10 min at 4 • C. The DNA sample was precipitated using 6 mL isopropanol, followed by centrifugation at 13,000 rpm for 10 min. The DNA pellet was washed with 70% ethanol and resuspended in 300 µL of distilled water. The extracted DNA was verified using agarose gel electrophoresis, and the DNA concentration and quality were measured using a Nanophotometer P330 (Implen GmbH, Munich, Germany). Three preparations were combined together, and the mixed metagenomics DNA had a concentration of 113.25 µg/µL with A 260/280 and A 260/230 values of 1.925 and 2.235, respectively. In addition, the contamination by the inhibitors of DNA polymerase in the sample was assessed using PCR by amplifying 16S rDNA. A total DNA sample of approximately 100 µg was dispatched to BGI-Hong Kong Co. Ltd. for deep metagenome sequencing.

Metagenomic Sequencing
The metagenomic DNA from the humus samples was sequenced using Illumina HiSeq 2500 (Illumina, San Diego, CA, USA) to generate paired-end reads of 150 bps. The raw sequence data were filtered by SOAPnuke to remove noise, which was described as follows: reads containing 5% or more ambiguous base (N base); read containing adapter sequences (default: 15 bases overlapped by reads and adapter); reads containing 50% or lower quality (Q < 20) base. Subsequently, filtered reads were assembled de novo with both softwares IDBA (version 1.1.0) [22] http://i.cs.hku.hk/_alse/hkubrg/projects/idba_ud/, accessed on 3 August 2019 and MEGAHIT (version 1.0) softwares [23] https://github.com/voutcn/ megahit, accessed on 3 August 2019 with a series of different k-mer sizes in parallel. The optimal k-mer was used to assemble the clean reads to contigs. Then, the clean reads were mapped to the assembled contigs through Bowtie 2 [24] with parameters "-p8-verysensitive-local-k 100-score-min L,0,1.2" to get the coverage information for revising the contigs. MetaGeneMark (version 2.10, default parameters) was used to predict genes from assembled contigs [25]. The predicted genes were clustered using CD-HIT [26], with a sequence identity threshold of 95% and an alignment coverage threshold of 90% [27]. The metagenomics sequences are available in the sequence read archive (SRA) under the accession number PRJNA715592.

Taxonomic Assignment
Taxonomic assignment of genes was performed with Blastp by aligning them against the NR database (accessed on 9 August 2019). Analysis of Nr BLAST output files was performed using the program MEGAN (version 4.6) [26]. This software reads the results of a BLAST comparison as input and attempts to place each read on a node in the NCBI taxonomy using the LCA algorithm [26]. The NCBI taxonomy was displayed as a tree, and the size of each node was scaled to indicate how many reads have been assigned to the corresponding taxon. The relative abundance of each taxonomy level from the same taxonomy was summed. The taxonomic level correlation was drawn using the Krona complement tool in Excel.

Functional Annotation
All the predicted genes were blasted against public databases, including SwissProt, KEGG (Kyoto Encyclopedia of Genes and Genomes) [28], EggNOG (Evolutionary genealogy of gene: Non-supervised Orthologous Groups, Version: 3.0) [29], and Nr (nonredundant protein sequence database) with e-value lower than 10 −5 [26], and retrieved proteins with the highest sequence similarity with given genes along with their functional protein annotations.

Mining Genes Encoding Lignocellulose-Degrading Enzymes
In this study, we focused on the investigation of data obtained from the KEGG database. Compared to the other data, genes for lignocellulases were dominantly defined in KEGG. Therefore, genes encoding lignocellulose-degrading enzymes/proteins were initially mined according to KEGG's functional annotation and specially assigned according to Enzyme Commission number (EC) [28]. Based on annotated KEGG data, we obtained all the ORFs encoding enzymes/proteins related to the pretreatment and hydrolysis of lignocellulose. The code of the ORFs was linked with the total taxonomic profile of the humus to generate a combined function of lignocellulose degradation and taxonomy from the Nr and SwissProt database. In addition, the amino acid sequences of the lignocellulase were investigated for the presence of domain structures using Pfam (http://pfam.xfam.org, accessed on 22 December 2020) and HMMER from dbCAN databases [30] (Figure 2). Moreover, the metagenome data were annotated using the Pfam database containing multiple alignments and hidden Markov model-based profiles (profiles HMMs) of complete protein domains [31]. Profile HMMs of available domains related to lignocellulosedegrading enzymes (including pretreatment enzymes, such as lytic polysaccharide monooxygenase) from the Pfam database were collected. Using the collection of profile HMMs, the homologous sequences were scanned against the metagenomic data sequences. The scan was executed by HMMER software (version 3.1) with the criteria as follows: the domain-based score value no less than 30, profile coverage greater than 0.5, and bias/score ratio less than 0.1 ( Figure 2).

Metagenome Sequencing of Cuc Phuong Tropical Forest Soil
The metagenomic DNA of bacteria in the humus collected from the Cuc Phuong tropical forest sequenced using the Illumina HiSeq platform was applied to assess the diversity of the microbial community and the potential proteins/enzymes involved in biomass degradation. From approximately 100 µg of metagenomic DNA, we obtained 345,471,086 high-quality clean reads and approximately 52 Gbs clean base. The assembly of clean data yielded 2,611,883 contigs with a total length of 2346 Mbs. The longest contig was 611,845 bps, and the N50 contig size and average contig were 1117 and 898 bps, respectively. Based on the assembly data, MetaGeneMark identified 4,104,872 protein-coding genes equivalent to 2074 Mbs. The N50, average, and the maximum gene lengths were 615, 505, and 20,541 bps, respectively (Table 1).

Taxonomic Composition of Microbial Community in Cuc Phuong's Soil
From the 51.82 Gbs obtained from the metagenomic DNA data of Cuc Phuong tropical forest humus, which surrounded white-rot fungi that strongly degrade fallen forest trees, 4,104,872 genes were identified to encode proteins, among which 3,923,046 genes (equivalent to 95.57%) were annotated in the Nr database. The genes were classified using MEGAN (version 4.6) analysis, where 3,896,881 genes were assigned to bacteria, archaea, eukaryotes, and viruses. Bacteria were dominant with 3,884,879 ORFs accounting for 99.69% of the total identified ORFs, while the remaining belonged to archaea with 293 genes, eukaryotes with 1144 genes, and 10,565 genes representing the virus. The genes of the bacterial kingdom were affiliated to 111 phyla, 83 classes, 170 orders, 406 families, 1971 genera, with only 738 classified as species (Table 2). For deeper bacterial analysis, 93.29% of total genes were identified at the phylum level. Notably, Proteobacteria was the most abundant bacterial phylum, with 3,106,400 identified genes accounting for 75.68%, followed by Bacteroidetes at 13.11%. In addition, Actinobacteria, Firmicutes, and Acidobacteria accounted for 1.6%, 1.4%, and 0.8%, respectively. Thus, the five dominating phyla totally accounted for 92.59%, and the number of predicted genes originating from Proteobacteria was 5.77 folds higher than that of Bacteroidetes. The high abundance of these bacteria indicates that these phyla are important and play key roles in the humus bacterial community.

Functional Profile of DNA Metagenome from Cuc Phuong's Humus
In order to obtain additional information to assess the functional potential associated with the microbial community, a set of metagenomic DNA from the humus samples was applied in databases such as Nr BLAST, SwissProt, KEGG, and eggNOG. Among the 4,104,872 identified ORFs, 3,925,740 ORFs corresponded to 95.64% of the total genes and were predicted to have functional annotation in at least one database (Table 3). Data obtained from the KEGG database were further investigated, and the pathway results were summarized into categories: cellular processes (cluster I), environmental information processing (cluster II), genetic information processing (cluster III), human diseases (cluster IV), metabolism (cluster V). Metabolism was the most dominant, which related to the growth of the microbial community, representing approximately 70% of the total defined ORFs in KEGG, where carbohydrate metabolism was observed to have 297,103 ORFs ( Figure 4).
In order to gain further insight into the mechanism of lignocellulose degradation by the communities, we specifically observed the distribution of domains of the predicted lignocellulases using the Pfam and HMMER databases. Among the 22,226 ORFs encoding lignocellulases annotated in KEGG, only 3582 ORFs (16.12%) were complete, in which 3084 ORFs (equivalent to 86.1% of the completed ORFs) were assigned domains. For the pretreatment enzyme/protein group, 198 domain-containing complete ORFs were divided into 19 domain types. Carbohydrate Esterases (CE) and Polysaccharide Lyases (PL) were the main families that yielded approximately 62%; other domains such as abhydrolase, tannase, copper oxidase were also discovered. Family CE8 was the most predominant domain, accounting for 32% in pretreatment ( Figure 6A, Table S3).
Similarly, the completed 1058 ORFs containing domains belonging to the cellulase group were defined into 81 domain types. Major domains were GH families that accounted for more than 80%, representing GH1 and GH3-FN3 with 245 ORFs and 220 ORFs, respectively, followed by GH8, GH5, and GH4. Furthermore, other groups such as peptidase M42, FN3, GH3, GH43 were also identified in the data. Comparing the enzymes annotated by KEGG and domains, GH3 was the predominant domain in enzyme β-glucosidase. In particular, GH3 collocated with the FN3 module, which assists in enzyme conformation and activity, was the most predominant domain with 220 ORFs in the enzyme group. In addition, GH8 was attributed to the endoglucanase group, followed by GH5 at the second level. In particular, the 6-phospho-beta-glucosidase group had only two types of domains present, which included GH1 and GH4. Moreover, combining domain and taxonomy showed that the GH1, GH8, and GH4 domains prevailed in Proteobacteria with 77%, 91%, and 95%, respectively. In contrast, GH3-FN3 and GH5 were equally divided in both Proteobacteria and Bacteroideles. Overall, in the taxonomy for domain-containing complete enzymes involving cellulose degradation, Proteobacteria dominated Bacteroideles and other phyla ( Figure 6B, Table S4).
Additionally, our analysis showed that 151 domain types of 1828 completed ORFs were adopted from hemicellulase data, indicating the diversity of hemicellulase domains. In practically all domains, GH families were allocated diversely, which outperformed GH2 featured for xyloglucan-active β-D-galactosidase, followed by GH43 present in certain enzymes such as α-L-arabinofuranosidase, xylan 1,4-beta-xylosidase, oligosaccharide reducing-end xylanase, with the equivalent ratio in Bacteroidetes and Proteobacteria phyla; whereas CE3, abhydrolase, and GH28 were predominant in Proteobacteria phyla, and other abundant domains, such as GH29, GH95, were mainly present in Bacteroidetes ( Figure 6C, Table S5).
Alternatively, using profile HMMs of lytic polysaccharide monooxygenase (LPMO) and multiple-copper oxidase (MCO) collected from Pfam to search and annotate the predicted ORFs in the metagenome data, we found 69 hits and 901 hits that belonged to LPMOs and MCOs, respectively. All LPMO hits contained the LPMO10 domain, whereas the MCO hits consisted of at least one or more of the four copper oxidase. Similarly, 224 ORFs were annotated as catalase/peroxidase, which enclosed the catalase domain, and 53 ORFs were defined as feruloyl esterase. Approximately half of the annotated hits were complete ORFs.
Notably, 37 MCO hits, which were annotated by profile HMMs, could not be annotated with NCBI, KEGG, SwissProt, eggNOG databases (Table 5).

Discussion
In most soils, common phyla that are usually abundant are Proteobacteria, Bacteroidetes, Acidobacteria, Actinobacteria, and Firmicutes [13,32]. The ubiquitously dominant phyla were significantly found in the biomass study of Arundo donax, Eucalyptus amaldulensis, and Populus nigra using high-throughput sequencing of the 16S rRNA gene [11]. Using metagenomics analysis, two soil samples near phosphate rock chemical plants in Shuangsheng revealed that the most dominant bacteria were Proteobacteria at 38.56% and 57.85% [33]. Other taxa, including Acidobacteria, Verrucomicrobia, Cyanobacteria, and Planctomycetes were also present in the samples. Furthermore, other findings revealed that the phyla Acidobacteria and Proteobacteria have a higher relative abundance in soil environments, such as forest soil [18,21], crop soil [19], and lettuce soil [20]. In litter and deadwood samples, soils usually contain abundant copiotrophic bacteria from phyla Proteobacteria and Bacteroidetes. The bacterial communities exhibit successional stages along the decay process in litter and wood [13]. Consistent with our study, the five phyla, Proteobacteria, Bacteroidetes, Actinobacteria, Firmicutes, and Acidobacteria, were predominant, which accounted for 92.59% of total phyla. It is noted that in our study Proteobacteria was the most dominant phylum accounting for 75.68%, especially the most abundant of Gammaproteobacteria class (61.70%) belonging to this phylum in the metagenome from humus samples of Cuc Phuong tropical forest. It has been known that in soil samples, pH is one of the major factors affecting the composition and diversity of the bacterial community [34][35][36]. Each group of bacteria usually grows at a narrow optimal pH range. Some studies showed that the abundance of Proteobacteria subgroups positively relates to neutral pH, and some Acidobacteria subphylums grow at acid pH range, whereas the effect of pH on Bacteroides abundance is not clear [34,37]. Perhaps Bacteroides growth relates to other factors such as nutrient composition than pH. In our humus samples, the range of pH was 6.9-7.3. Therefore, it is possible to be suitable for the most abundant of Proteobacteria subgroups, typically Gammaproteobacteria.
Certain studies also showed that Bacteroidetes usually represent about 10% of the total microbiota in soils [38]. Additionally, bacteria belonging to the phylum Bacteroidetes are known to contain several genes encoding the enzymes for polysaccharide degradation [39,40]. A study of high-Arctic peat soils of Svalbard showed that most genes assigned to lignocellulose degraders belonged to phyla Bacteroidetes, Actinobacteria, and Verrucomicrobia, representing approximately 70% of these genes [41]. In our results, Proteobacteria was present in the soil microbiota, but Bacteroidetes appear to play a vital role in lignocellulose degradation. The majority of identified enzymes/proteins involved in lignocellulose degradation were assigned to Proteobacteria and Bacteroidetes. The estimated ratio of Proteobacteria/Bacteroidetes harboring genes coding lignocellulose enzymes was 1:1 in pretreatment and hemicellulase groups and 2.4:1 in cellulose. In particular, the result showed that Flavobacteriales of phylum Bacteroidetes were one of the most dominant orders harboring putative lignocellulose enzymes. Thus, both phyla contain members recognized for their role in the degradation, which were discovered in Cuc Phuong National Park humus that surrounded the sites prominent with white-rot fungi degrading deadwood were similar in microbial construction as observed in other soil investigations [11,42,43]. Other findings have described the lignocellulose degradation capabilities of bacterial strains belonging to Proteobacteria, Bacteroidetes, Actinobacteria, and Firmicutes [44][45][46][47][48]. Recently, the role of bacteria of phylum Bacteroidetes in polymeric carbon degradation in soils has been investigated using the KEGG database [17]. However, the analyzed results are noted to be highly dependent on the analytical method and the selected databases.
For enzyme classification and domain analysis, we identified several important enzymes involved in the degradation of three major components of lignocellulose, which include cellulose, hemicellulose, and lignin. For lignin degradation, CE and PL (815 of the 2,808,791 ORFs annotated by KEGG, accounting for 0.03%) were predominant domains, which were consistent with our previous data [49]. Additionally, we mined 10 genes encoding laccases, including 5 complete genes that encode proteins containing 3 domains of Cu3-Cu-Cu2 in the metagenomics data from the humus samples. Laccase is an important enzyme of the multicoperoxidase group involved in lignin degradation. Laccases are mainly found in fungi and plants, but few reports described bacterial laccases. Furthermore, studies showed that laccases are predominant in soil environments [12,18,50,51]. In previous investigations, this enzyme was not identified in our metagenomics data obtained from microbial termite's gut, goat's rumen [49,52]. The presence of laccase in the humus is probably due to deep sequencing and sufficient oxygen level available for the soil microbiota. However, the number of ORFs coding for enzymes/proteins involved in lignocellulose pretreatment in this study was lower than that of the other studies. For example, a study investigated bacterial genes coding for lignocellulases in soil samples collected from a tropical forest in the Luquillo Experimental Forest, Puerto Rico, by sequencing the metagenomic DNA using Roche 454 GS FLX Titanium technology (Branford, CT). Among 68,911 functional annotated genes, 3133 CE (accounted 4.48%), 413 lignin oxidases (0.59%), 282 PL (0.41%) and 240 lignin-degrading auxiliary oxidases (0.35%) were identified [53]. The low abundance of the lignolytic enzymes in our data may be attributed to the characteristics of the sample collected from the sites of white-rot fungi degrading deadwood. The composition of the bacterial community in deadwood can be considerably influenced by fungal communities [51]. In nature, white-rot fungi are the most efficient, completely degrading lignin into CO 2 and H 2 O [54]. This ability of fungi is due to the secretion of extracellular enzymes, [55] including lignin peroxidase, manganese peroxidase, oxy oxidoreductase/laccase, and other accessory enzymes [56] when growing under limited nutrients (C:N ratio), especially in wood and soil [57].
Once degraded, cellulases and hemicellulases can be attached to the loosened lignocellulose. The overview of lignocellulases in this study is similar to the abundance patterns across the rain forest soil [53] and compost [58]. In soil, fungi normally degrade polymers into organic compounds of low molecular mass, which is preferentially used by bacteria [59,60]. In our data, GH1, GH8, and GH4 representing β-glucosidase, endoglucanase, and 6-phospho-β-glucosidase, respectively, were the most frequent typical enzyme families derived from the phylum Proteobacteria. In contrast, in other metagenomic data, GH5 affiliated with endoglucanase was dominant and belonged to phylum Bacteroidetes [49]. Notably, the enzymes were not collocated with other activity domains or non-catalytic accessory domains. These domains were perhaps featured for cellulose-degrading enzymes from soil's metagenome with the predominance of Proteobacteria and contained only one catalytic domain. Remarkably, the domain analysis showed that approximately 90% of the GH3 domain in β-glucosidase was associated with an accessory module FN3, and the construct was the most abundantly associated domain type, and the GH3-FN3 architecture was observed in both Protoeobacteria and Bacteroideles. Furthermore, the FN3 module, a non-catalytic domain, is known to decompress cellulose surface [61] and assists in enzyme conformation and activity [62][63][64]. This result was also consistent with previous findings that β-glucosidase contains FN3 but not CBM [65,66]. Our previous investigation mined GH3 from goat rumen, in which the GH3 domain associated with the FN3 domain accounted for 90.9% [63]. This is the first study investigating the bacterial community and the diversity of lignocellulases from bacteria in humus samples surrounding white-rot fungi degrading deadwood in tropical forest Cuc Phuong, Vietnam. Interestingly, the bacterial community structure and abundance pattern of cellulases and hemicellulases in the humus were similar to the samples from the soil in the rainforest. Due to the enzymes secreted by white-rot fungi for lignin conversion, lignolytic enzymes in the humus were low in abundance.

Supplementary Materials:
The following supporting information can be downloaded at: https:// www.mdpi.com/article/10.3390/d14030220/s1, Table S1: Microbial community structure in humus samples surrounding white-rot fungi degrading deadwood in the Cuc Phuong forest at kingdom, phylum, order, and genus levels; Table S2: Analysis of community structure of the humus bacteria harboring genes for lignocellulose degradation at phylum and order level; Table S3: Bacterial phyla possess domain-containing complete ORFs encoding protein/enzyme involved in lignocellulose pretreatment; Table S4: Bacterial phyla possess domain-containing complete ORFs encoding cellulases; Table S5: Bacterial phyla possess domain-containing complete ORFs encoding hemicellulose.