Microbiome research has offered promising insights into microbial contributions to human health [1
] and environmental dynamics [2
]. Microbiome responses can be studied by a variety of approaches, including genome and transcriptome sequencing (metagenomics and metatranscriptomics, respectively), protein expression profiling (metaproteomics), and metabolite characterization (metabolomics). Over the years, the metagenomics-based approach has been the major approach for most microbiome studies, mainly because of the advances in sequencing technology [3
] and development of statistical and analytical tools [4
Recent trends in microbiome research have shown the promise of other “omic” approaches, with metaproteomics receiving much attention as an approach with great promise as a complement to more mature metagenomics approaches [5
]. Metaproteomic studies identify the proteins that are actively being expressed by a community of microbiota under specific conditions [7
]. Researchers have been promoting the potential benefits of metaproteomics for a better understanding of microbiome dynamics—particularly since it can provide insights into the functional state of the microbial community, beyond what can just be predicted by metagenomics [6
Although the metaproteomics approach has been used for more than a decade, it is still emerging and has not yet become an approach routinely utilized by the microbiome research community. This has been primarily due to the technical difficulties associated with the approach. However, with recent advances in sample preparation, improved sensitivity of protein detection by mass spectrometry (MS), and new informatics tools for data analysis and interpretation, more researchers are turning to metaproteomics and realizing its potential in microbiome research [9
Metaproteomics research holds promise in its ability to offer mechanistic insights into microbiome activity by performing functional analysis on identified peptides and proteins [10
]. For example, microbiome studies have shown that the suite of metabolic pathways within microbiota from different persons tends to remain relatively consistent, even though microbial taxa may display considerable variation between individuals [11
One of the key areas of advancement in metaproteomics over the past decade lies within the branch of informatics. New approaches continue to emerge across all the core areas of metaproteomics informatics, which include: (a) protein sequence database generation methods for microbial communities [12
]; (b) database search methods for matching tandem mass spectrometry (MS/MS) data to peptide sequences [17
]; and (c) interpretation methods and tools for taxonomic and functional analysis (Figure 1
Despite these many advances, metaproteomic informatics remains very much a work in progress because of many unresolved challenges. Unlike single-organism proteomics, the protein sequence FAST-All (FASTA) databases for metaproteomics, which contain the predicted proteomes of multiple organisms, can be extremely large and complex [18
]. It is not uncommon for the in silico translation of metagenome assemblies to a predicted metaproteome to contain hundreds of thousands to millions of predicted protein sequence entries. To reduce the possibility of mis-assigned spectra, it is common practice to include a FASTA-formatted host database and common laboratory contaminant proteins (e.g., skin keratins, proteases). For example, the study of the human oral microbiome would contain human epithelial cells proteins as part of the host database, in addition to microbial proteins from the consortia that form dental plaque. Algorithms and strategies for matching MS/MS to peptide sequences by database searching have been modified to address this challenge—in particular, addressing the decreased sensitivity of peptide matches due to increased false discovery rates in large databases [18
], and challenges of protein identity inference due to sharing of proteins across multiple organisms in the database (e.g., the meta-protein concept in [22
Another significant challenge presented by metaproteomic informatics is that many disparate, specialized software tools must be used within each of the core areas required for successful data analysis and interpretation. For most researchers, these programs are difficult to access, master and operate. This reality offers a significant barrier for many researchers who could otherwise benefit from using metaproteomics approaches in their research.
Here we introduce new resources aimed at increasing access to advanced metaproteomic informatics tools and facilitating training in their use, thereby breaking down the barriers that hold back many researchers seeking to utilize metaproteomics in their work. The tools are housed in the Galaxy for proteomics (Galaxy-P) platform [23
], which offers a user-friendly interface. Disparate software tools can be accessed and operated in an automated manner within a unified operating environment, which can be scaled to meet the demands of large-scale data analysis and informatics, as is often required in multi-omic approaches such as metaproteomics [24
]. These resources were developed via a unique community-based effort, which leverages a consortium of leading experts from the metaproteomics research community, including a mixture of developers, data scientists and wet-bench researchers. These researchers participated in a contribution-fest (see z.umn.edu/mphack2016
for more information), wherein specific software was selected, deployed, tested and optimized within the Galaxy framework. In this manuscript, we describe not only the resources we have made available through this community-based effort, but also the process used to successfully achieve our goals. The accessible resources should help to increase wider adoption of metaproteomic informatics tools, as well as provide a framework for future collaborative efforts to make cutting-edge metaproteomic informatics tools available to the greater research community.
2. The Metaproteomics Gateway
2.1. Description of the Accessible Resources
Metaproteomics analysis of mass spectrometry data involves multiple core steps including database generation, MS/MS spectral matching to peptide sequences, taxonomic analysis and functional analysis. Below, we describe the general strategies and software currently available within these core areas, along with the process by which our consortium selected tools for deployment and dissemination via Galaxy-P. Since the main goal of this work, was to provide documentation to facilitate training and mastery of these software and workflows, we have provided step-by-step training instructions and related information in Supplement S (z.umn.edu/supps1
). We have built a publicly accessible metaproteomics instance, or gateway (z.umn.edu/metaproteomicsgateway
), for the purposes of providing access to documentation and other instructional materials, and an opportunity for hands-on training using example datasets and optimized metaproteomics workflows (See Table 1
). Full instructions are provided at this site for registering in this gateway and gaining access to all materials.
2.2. The Playground: The Galaxy-P Platform
Galaxy-P is an extension of the open-source, Galaxy bioinformatics platform, which utilizes a web-based interface to access any instance, whether housed locally or remotely. The Galaxy interface includes a Tool menu
(on the left of the screen—Figure 2
), Central main viewing pane
and the History menu
(on the right side of the screen—Figure 2
2.3. The First Step: Protein Sequence Database Generation Using a Galaxy-Based Tool
The composition of the protein sequence database used to match MS/MS spectra to sequences has a profound effect on the depth and reliability of identified peptides and inferred proteins in metaproteomics [14
]. The source of the sample, sample preparation methods utilized, and the focus of the specific study all play a role in determining the composition of the protein sequence database. The results are only as good as the sequence database used—for example if a peptide sequence present in the sample is not present in the database, neither the peptide, nor the protein it is associated with can be identified. Conversely, if the protein sequence database includes many proteins that are not actually contained in the sample being analyzed (e.g., a database containing all known bacterial proteins), the database size can be so large that it decreases the sensitivity for identifying peptides that are truly in the sample. Thus, generating optimized databases for metaproteomics is not trivial. Ideally, the database would be constructed based on the known taxonomic makeup of the sample being analyzed—which can be achieved by metagenomic analysis of the sample or by selecting publicly available taxonomic metagenomics databases, if these exist for the sample in question.
During the contribution fest, several options for protein sequence database generation were considered. We first looked at options already available within the Galaxy-P platform. One option was the use of publicly available taxonomic repositories specific to certain sample types or environments [26
]. A tool in Galaxy-P (Protein Database Downloader) was already in place for automated generation of databases based on information available from repositories including the Human Microbiome Project, the Human Oral Microbiome database, and the EBI metagenomics resource.
Another option already available within the Galaxy-P suite of tools is a tool for generating customized protein sequence databases from a list of genera thought to be in a sample. In some cases, a list of genera is available through previous published studies and can be useful in generating a protein sequence database [32
]. In particular, 16S rRNA sequencing is used to assign operational taxonomic units (OTUs) in the form of species, genera or phyla. This can serve as a guide for generating a customized protein sequence database. Galaxy-P houses a tool to work through the UniProt Application Programming Interface (API) and extract protein sequences for all of the genera or phyla within a given list, generating a customized database for the metaproteomic analysis.
Given these already existing tools, we decided to direct our efforts to deploying more cutting-edge tools for database generation, which follows recent trends in using metagenomics information to generate more accurate protein sequence databases tailored to the taxonomic make-up of any given sample [33
]. In particular, whole metagenome sequencing offers increased taxonomic resolution over 16S rRNA sequencing, thus enabling more accurate taxonomic and functional categorization of identified sequences [38
Targeting tools that leveraged emerging methods in whole metagenome sequencing, we considered two approaches. One is the recently described Omega (overlap-graph metagenome assembler), a software tool for assembly of shotgun metagenome data that can be used along with the Sipros algorithm for database generation and matching to MS/MS data [39
]. The second was a novel method and software (called Sixgill) described by May et al. that uses a ‘metapeptide database’ derived from shotgun metagenomics sequencing [15
]. The database generated using this method is optimized for MS/MS data, thereby providing a more rapid and accurate peptide to spectrum matching. In the original publication, the method was used on two ocean samples that had undergone whole genome metagenomics sequencing, and was shown to offer a significant increase in the number of identifications (presumably due to a more accurate and compact database) as compared to a metaproteome sequence database assembled using standard methods, as well as using the comprehensive sequence database from the NCBI repository.
Given its demonstrated performance and optimized algorithm for utilizing large-scale, whole genome sequence data, we chose to implement the Sixgill software in Galaxy-P (Figure 3
). We have provided step-by-step instructions for the use of Sixgill to create a metapeptide database, as well as the necessary input data, as described in Supplement S1
. The deployed Sixgill tool provides a ‘build’ function, which generates a tab separated value (TSV) file containing the amino acid sequence of metapeptides along with other metrics. The Sixgill ‘makefasta’ function utilizes this information to generate a FASTA-formatted peptide database, which is compatible with database searching programs.
2.4. The Next Steps: Using a Galaxy Workflow
Galaxy also offers an option of generating a Galaxy ‘workflow’ which contains all the processing steps and software tool parameters for a particular analysis—except for the input or output data. Usually, workflows consist of multiple software tools, which are run in an automated, sequential manner, where outputs from one tool provide the input data for the next tool—ideally suited for multi-step analyses that are inherent to metaproteomic data analysis. Once built and optimized, workflows can be saved such that they become a main operational unit for analyzing different datasets in an efficient manner. Saved workflows can be also shared with other Galaxy users—thus promoting dissemination, reproducibility and collaboration.
The remaining three steps comprising our metaproteomics informatics resource (spectral matching, taxonomy analysis and functional analysis) are encapsulated in a single workflow (Figure 4
). The starting data inputs to this workflow are MS/MS data files (in the form of mascot generic files, MGFs) and the FASTA-formatted metapeptide sequence database generated in step 1 above. The second step (spectral matching) yields identified metapeptides that act as inputs for the third step (taxonomy analysis) and fourth step (functional analysis). For functional analysis, an additional input file with Gene Ontology (GO) terms is also required.
In our specific workflow built for training purposes, MGF files (from Bering Strait ocean samples) are searched against the metapeptide database (generated using Sixgill software on metagenomics data) as inputs. In order to save time, we have trimmed the MGF datasets and the Bering Strait metapeptide database from those provided in the manuscript by May et al. [15
]. Users are recommended to refer to Supplement S1
for detailed instructions on how to use the workflow on the example dataset.
2.5. The Second Step: Spectral Matching
Sequence database searching algorithms that are able to match MS/MS spectra to peptide sequences contained in large databases (e.g., 106
or more sequences) have also been developed specifically for metaproteomics applications [40
]. Selecting from the available software for metaproteomic sequence database searching must balance the following factors: (a) ability to effectively use large databases while still sensitively matching spectra to peptide sequences; (b) speed of the core algorithm, along with scalability for execution on parallel computing infrastructure, enabling the processing of large datasets using large sequence databases in a reasonable timeframe; and (c) the ability to generate outputs with robust false discovery rate (FDR) estimations, that are also compatible with downstream processing steps for taxonomic and functional analysis.
Multiple strategies have been suggested to increase the sensitivity of peptide identifications for the large sequence databases encountered in metaproteomics. This includes an iterative database searching workflow [42
], a cascaded database search method [43
] and a two-step method for searching large databases [44
]. Muth et al. have recommended using a database sectioning approach, such that searches against subsets of a large database may increase the number of high confidence identifications [18
]. The same group has proposed the use of de novo spectral matching in tandem with traditional sequence database-dependent methods [18
], as well as the use of multiple database search algorithms, such as those offered by the SearchGUI tool [46
], to increase the numbers of confident metapeptide identifications.
For the workflow deployed in our informatics resource, we chose a relatively straightforward approach for spectral matching. We used the SearchGUI tool already deployed in Galaxy, utilizing X!Tandem as the sequence database search algorithm of choice. Although the Galaxy-deployed SearchGUI tool offers the use of multiple database search algorithms (e.g., MS-GF+, Myrimatch, OMSSA, Comet, Myrimatch, MS-Amanda and Novor), X!Tandem was determined to have a balance of speed and sensitivity that made it a good choice, especially for a training resource. The outputs from SearchGUI are further filtered and statistically analyzed using the companion PeptideShaker tool [47
], which provides outputs compatible with downstream processing. Supplement S1
provides detailed instructions on the sequence database-searching step in this workflow, including a description of the small-scale input data we have provided for training purposes.
2.6. The Third Step: Taxonomic Classification
In metaproteomic studies, the identified microbial peptides can be used to determine the taxonomic composition of the sample. A number of options exist for taxonomic classification from the metapeptide data, some which were already deployed in Galaxy-P. The Unipept tool, deployed previously in Galaxy-P [24
], maps sequences to annotated microbial organisms contained in the UniProt knowledgebase and subjects these to lowest common ancestor (LCA) analysis to provide a list of taxon identifications (at the level of kingdom, phylum, genus or species, if possible). The BLAST-P tool, also previously implemented in Galaxy-P [23
], can match peptides to microbial proteins contained in the comprehensive NCBI non-redundant (nr) database, followed by taxonomic classification using MEGAN software [48
] for metaproteomics data analysis [44
During the metaproteomics contribution-fest, a number of new tools and extensions to new tools were considered for deployment in Galaxy. For example, taxonomy classification tools from the MetaProteomeAnalyzer [22
] were considered, which process peptides identified via multiple database searching engines using information from the UniProt and National Center for Biotechnology Information (NCBI) repositories. Another tool under consideration was Prophane (https://mikrobiologie.uni-greifswald.de/en/resources/metaproteomics-data-analyses/prophane/
), which uses the CLUSTAL W sequence alignment tool and other annotation tools to perform taxonomic classification.
Ultimately, the work stemming from the contribution fest focused on extending the functionality in Galaxy-P of the Unipept tool [19
]. As mentioned above, Unipept was already deployed in Galaxy-P, providing textual outputs of taxonomic classes (Figure 5
). We extended this function, adding the capability of visualizing taxonomic groups by packaging recently added visualization capabilities of Unipept into the Galaxy-based tool (Figure 5
). With this functionality, the outputs from the metaproteomics workflow run in Galaxy-P now offers the user the option of launching a visualization window of the taxonomic results. (Figure 4
). Details about this functionality within the workflow are provided in Supplement S1
2.7. The Fourth Step: Functional Analysis
Metaproteomics has a distinct advantage in determining the functional signature associated with a microbial community under a specific condition based on identification of the proteins that are actually being expressed [44
]. However, characterizing the functional state from a collection of expressed proteins is not trivial. Functional annotation based on a protein profile requires several components: a controlled vocabulary (or ontology) that represents protein function, databases containing annotations of known proteins or protein families with terms from these vocabularies, and alignment tools that map functional annotations within data repositories to the experimentally identified peptides or proteins. Many ontologies exist and often focus on different aspects of function: the Gene Ontology [51
] and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways [52
] are two of the most prominent. A number of databases use diverse methodologies to assign function to proteins and or its groups—these include the InterPro [53
], and “evolutionary genealogy of genes: Non-supervised Orthologous Groups” (eggnog) [54
] databases. Finally, tools to map functional annotations from these databases to experimentally identified proteins are often database-specific, such as the eggNOG-mapper [55
] and InterProScan [56
]. In addition, MEGAN6 can be used to carry out InterPro2GO, KEGG, SEED and EggNOG analysis to determine the distribution of functions amongst expressed proteins in the microbiome [48
Beyond mapping functional annotations to identified proteins, visualization of the collective functional categories is the next desirable step. Here, various options are available—with potential for deployment in the Galaxy platform. For single GO terms of interest, the QuickGO browser [57
] enables the user to view the full term definition, as well as to browse closely related terms. For large lists of GO terms, the ‘reduce and visualize Gene Ontology’ (REVIGO) tool [58
] allows the reduction of GO terms to a representative subset and several visualizations of the resulting smaller list. Moreover, the Prophane suite of tools can also be used to determine the distribution of functions in a microbiome sample and visualize them. MetaProteomeAnalyzer provides enzyme and pathway display options where proteins grouped by UniProt ontologies (e.g., biological process or molecular function), EC (Enzyme Commission) numbers and KEGG pathways can be visualized [22
Although all of these options have great potential for functional annotation and visualization, our community-based efforts focused on utilizing the Galaxy-deployed Unipept tool and its Pept2Prot option, which maps identified peptide sequences to proteins. The proteins are then mapped to GO terms for molecular function, biological processes and cellular localization, followed by using the GO term mapping information (Figure 6
). The grouping into functional categories was performed using a Galaxy tool, query tabular tailored to automate extraction and grouping of tabular data results. These results are presented as a tabular output for further downstream analysis, such as visualization software. Details are provided in Supplement S1
about the tools involved in this functional annotation step, along with instructions.
2.8. Links to Accessible Resources for Training
The main goal of our contribution fest was to provide an instrument for researchers to access and learn the operation of cutting-edge metaproteomics tools. We have provided several means for researchers to access and train in the operation of these tools (See Table 1
). We have established a Metaproteomics Gateway, composed of a publicly accessible Galaxy instance containing the tools, workflows and example data described in this manuscript. Supplement S1
provides a detailed description for the use of this gateway. We have also provided our documentation and training instructions within the Galaxy Training Network repository (http://galaxyproject.github.io/training-material/
), a central resource for providing documentation on Galaxy-based tools and platforms. Our tools and workflows have also been made openly available through the Galaxy Tool Shed and on GitHub. We hope that the available resources that also include an introductory video will encourage researchers to incorporate metaproteomics studies into their current expertise of research.
In conclusion, we have described accessible resources aimed at training researchers in the use of advanced metaproteomic informatics tools, with the intent of increasing the adoption of metaproteomics by the wider research community. These tools have been made available through a unique, community-based process, which has leveraged a community of metaproteomic informatics experts, as well as the powerful Galaxy platform. We would like to emphasize that the use of Galaxy was highly enabling for this work, as it provides a unified environment for operating many disparate tools required in metaproteomics, as well as a platform that can be used to promote training and usage by the larger community.
Several other points are worth noting from the work we have described here. It is evident that, for each of the core steps described in the metaproteomic data analysis pipeline, there are many valuable software tools that already exist. During our contribution fest, our consortium of researchers were only able to deploy, test and optimize a select few of these tools. Work is ongoing on implementing additional tools. In the future, we anticipate increased need for visualization, quantitation and statistical tools in metaproteomics research, which will aid in biological interpretation. It is our hope that this manuscript serves as an invitation to others to join our collaborative community and help to make additional high-value tools for metaproteomics available. Again, our usage of the open source Galaxy-P platform for deployment and dissemination provides a playground for developers to come ‘play’ in, and collaborate with other like-minded researchers from around the world. We also hope that the ‘shareable’ workflows developed will facilitate the undertaking of global scale research projects.
It is our hope that this manuscript will help establish a framework for continued, community-based efforts at making cutting-edge metaproteomics tools available to others, along with the necessary documentation and hands-on training resources to educate researchers in their use. Ultimately, we hope this approach will yield great dividends in increasing the adoption of metaproteomic approaches by more researchers, which will help catalyze a better understanding of the molecular characteristics of dynamic microbial communities and microbiome.