PhyloFunDB: A Pipeline to Create and Update Functional Gene Taxonomic Databases

The increase in sequencing capacity has amplified the number of taxonomically unclassified sequences in most databases. The classification of such sequences demands phylogenetic tree construction and comparison to currently classified sequences, a process that demands the processing of large amounts of data and use of several different software. Here, we present PhyloFunDB, a pipeline for extracting, processing, and inferring phylogenetic trees from specific functional genes. The goal of our work is to decrease processing time and facilitate the grouping of sequences that can be used for improved taxonomic classification of functional gene datasets.


Introduction
Several studies employ the amplification of functional genes for taxonomic profiling of specific microbial communities, such as genes involved in nitrogen (nirK, nosZ, amoA) or carbon (pmoA, mcrA) cycling. However, the taxonomic classification relies on databases, which requires representatives of as many taxonomic groups as possible in order to obtain the most reliable taxonomic assignments [1]. Developments in high-throughput sequencing have increased the number of unclassified sequences in databases such as Genbank, as most sequences are derived from cultivation-independent studies [2][3][4]. Grouping sequences by similarity, analyzing their clusters in phylogenetic trees, and comparing the position of unknown sequences to known clades is a recognized approach for improving the taxonomy of currently unclassified sequences [4]. This process, however, demands the download and processing of sometimes large amounts of data using several different software, a procedure that we aim to facilitate and integrate with PhyloFunDB.
Here, we present PhyloFunDB, a pipeline for extracting specific genes of choice, processing sequences, and inferring phylogenetic trees that can be used to assign or confirm microbial taxonomic groups. Such groups can be employed for an improved taxonomic classification of functional gene datasets. Additionally, we provide a second configuration option to update the databases and trees generated by the pipeline, where the newest sequences can be extracted, processed, and added to the databases and phylogenetic trees that need updates.
In order to demonstrate the functionality of the pipeline, we chose the mcrA gene to perform the procedure. The mcrA gene produces the alpha subunit of the methyl coenzyme M reductase complex, which is fundamental in the archaeal methane metabolism [5,6]. Methanogenic communities play an important role in the global greenhouse gas budget, as they produce methane under anoxic conditions and have been observed in wetlands, values. Output files from PhyloFunDB include the .treefile with the OTU representatives, a .fasta file containing the complete set of sequences, and the associated .txt taxonomy file, which can be used for the manual curation of the database. Moreover, after building a specific gene database, updating with new sequences is also possible. A second configuration option is available (Figure 3), for downloading and processing the newest sequences uploaded to the NCBI database. In the config.yaml file (Figure 2), the user can set the parameter update to true and add a date range for downloading new sequences, which go through the same downstream processing as in the first workflow. At the end, however, only the new OTUs are added to the reference tree, and the fasta sequences are appended to the old fasta file. The new OTUs are placed in the reference tree by using the evolutionary placement algorithm (EPA) [21] from RAxML. In the config.yaml file, the user also provides the paths to the reference tree and database archives where the new sequences should be appended (path_to_tree, path_to_seqs, path_to_db and path_to_tax parameters).  Configuration file with parameters defined by the user. Configuration file for the database producing pipeline, containing parameters for query search, minimum sequence size, OTU and distance matrix cutoff, and FrameBot database status. When the update parameter is set to true, the user also has to input the date range for the download of the new sequences and paths to the files of the database to be updated.

Figure 2.
Configuration file with parameters defined by the user. Configuration file for the database producing pipeline, containing parameters for query search, minimum sequence size, OTU and distance matrix cutoff, and FrameBot database status. When the update parameter is set to true, the user also has to input the date range for the download of the new sequences and paths to the files of the database to be updated.
Moreover, after building a specific gene database, updating with new sequences is also possible. A second configuration option is available (Figure 3), for downloading and processing the newest sequences uploaded to the NCBI database. In the config.yaml file (Figure 2), the user can set the parameter update to true and add a date range for downloading new sequences, which go through the same downstream processing as in the first workflow. At the end, however, only the new OTUs are added to the reference tree, and the fasta sequences are appended to the old fasta file. The new OTUs are placed in the reference tree by using the evolutionary placement algorithm (EPA) [21] from RAxML. In the config.yaml file, the user also provides the paths to the reference tree and database archives where the new sequences should be appended (path_to_tree, path_to_seqs, path_to_db and path_to_tax parameters).

Results and Discussion
It took approximately 40 h to retrieve and process 31,202 sequences, and to generate the final archives using 8 CPU threads on a shared multicore server with ample compute power (80 Intel Xeon Gold 6230 CPU threads, 75 GB of RAM) running Ubuntu 20.04 LTS. Retrieving sequences from NCBI and frameshift correction by FrameBot [14] were the longest steps (approximately 24 and 14 h, respectively). Following this, we performed the manual checking of mcrA OTUs representatives and groups. After obtaining the database files, the manual curation is necessary to check the taxonomy/clustering of the sequences in the inferred tree and improve the classification of the unassigned/unclassified sequences. In addition to the appended taxonomy, it is also possible to download metadata from all sequences using the NCBI Entrez API and add that information to the created database. For downloading metadata associated with our mcrA gene sequences, we used the following command line: "esearch -db nucleotide -query "mcrA[gene]"|efetch -format gpc|xtract -insd source organism mol_type strain country isolation_source|sort|uniq >metadata_mcrA.txt".
After downloading the metadata, we checked whether there were cultivated representatives within the OTU groups, as the OTU representative sequence chosen by Mothur is not always a cultivated/known organism. It can be checked in the file "interm/{gene}.aligned.good.filter.unique.pick.good.filter.an.{cutoff_otu}.rep.names". Next, we used the taxonomy of the known sequences to improve the taxonomy of unknown sequences. Unclassified clusters were identified in the inferred tree based on the closest classified sequences (Figure 4). Many of the sequences were identified as order-like or family-like classifications, due to a lack of a close classified representative. In general, the analysis will also depend on how many sequences from the database are well classified, which varies depending on the gene. Using the metadata downloaded from NCBI, it is also possible to add the environmental origin of the sequences to their taxonomy string. After checking and refining or correcting the taxonomy strings of the OTU representatives, we ex-        After the manual curation, we obtained a database with 31,202 sequences, containing 1112 sequences classified at least to family level, which allowed the improvement the classification of the other 30,090 sequences. The tree possessed 142 representatives classified at genus level, which helped to improve the grouping and classification of other 219 OTUs. This analysis helped improving the classification of several unclassified sequences, which can be used as a support in amplificon studies to give a better overview of the taxonomy of mcrA datasets. However, supplementary analysis, such as protein sequence trees, and more information about the origin of the sequences should be included in order to properly resolve mcrA phylogeny.
It is worth noting that we are applying a distance clustering method, which groups taxa by perceived similarity among sequences, instead of evolutionary descent, which can, therefore, result in classification errors. This approach was chosen to decrease the complexity of the datasets, which, otherwise, would result in the individual analyses of thousands of sequences. Nonetheless, the user can avoid the clustering step, including every single sequence in the final tree; however, the processing time will depend on the final number of sequences and bootstrap test may take weeks or even months. Running the pipeline does not require a large number of threads or RAM memory, for instance, running mcrA gene used approximately 2.5 gb. The longer steps are due to the availability of NCBI servers and FrameBot software v 1.2.0 (East Lansing, MI, USA) [14]. FrameBot is a user-friendly java-based tool that, even though useful, has not been improved in recent years. An improvement in the performance of this tool would require the involvement of its original developers. Nonetheless, the speed of the pipeline will depend on the number of sequences employed. The tools we chose do not use considerable resources and were suitable for our purposes; however, comparisons with other clustering and alignment software could be performed, decreasing the processing time even more and increasing group placement accuracy in the phylogenetic trees. Among interesting options are the clustering software CD-HIT and Swarm [22], aligners PRANK [23] and Clustal Omega [24], and HMM-FRAME [25] for frameshift detection and correction.
To our knowledge, there are no pipelines for gene database creation that download sequences and produces phylogenetic trees. The most similar pipeline is SATé [26], which performs sequence alignment and tree estimation, and performance comparisons could be performed as future steps towards improvement of the pipeline. Further enhancements could include capacity to handle several genes at the same time, since the current best way to produce databases for more than one gene is to run the pipeline for the different genes in parallel. In addition, tests with different parameters and software for more accurate alignments, clustering, and tree generation could be executed.
Overall, PhyloFunDB allowed automated retrieval and processing of a considerable number of sequences, which were further used for the evaluation and improvement of mcrA gene taxonomy. Further analysis can be performed by the user to produce deeper analyses, including the metadata made available together with the sequences downloaded. In addition, PhyloFunDB can be used in parallel for the processing of several different genes at the same time, speeding up the production of a collection of gene databases.