Research over the past years has established the importance of the gut microbiota for human health and that a disturbed equilibrium is involved in the development of disease [1
]. Therefore, scientists have begun characterizing the microbiome of the human gut in healthy and diseased states. Today, most microbiome studies rely on 16S rRNA or shotgun metagenome sequencing to provide a taxonomic description of the microbiome [2
]. This does, however, not necessarily reflect proteomic and metabolic activity and, thus, may lack direct functional information. Other omic technologies, such as metaproteomics, metatranscriptomics or metabolomics, can supplement the genomic approaches by providing a molecular view on cellular processes at a more direct functional level [3
]. The term metaproteomics was introduced by Wilmes and Bond [4
] as well as Rodriguez-Valera [5
] as “the large-scale characterization of the entire complement of environmental microbiota at a given point in time”. The main conceptual advantage of metaproteomics is that it can add functional annotations to the description of the microbiome. In addition, metaproteomic can detect proteins from both the host and microbiota simultaneous and, thus, aid in the characterization of host-microbiome interactions [6
So far, metaproteomic studies have mainly reported variations in the microbiome of healthy people [7
], changes as a result of antibiotic treatment [9
] or the role in chronic gut inflammation, such as Crohn’s disease, inflammatory bowel disease or ulcerative colitis [10
], as well as obesity [12
] and diabetes [14
]. Here, we present the first metaproteomic study of the gut microbiome in leukemia patients colonized with multidrug-resistant Enterobacteriaceae (MRE). Infections with multidrug-resistant pathogens during hospitalization are becoming critical. Leukemia patients especially are frequently affected since they have a compromised immune system and are regularly exposed to pathogens during extended periods in hospitals. In addition, leukemia patients are frequently treated with antibiotics altering the microbiome that confers resistance to intestinal colonization by exogenous bacteria [15
]. Hence, leukemia patients frequently acquire secondary infections during hospitalization. Therefore, it is important to better understand how the gut microbiota may prevent colonization by pathogens and how such information may be utilized in the clinical management of patients.
Although sample preparation protocols in metaproteomics are becoming standardized for clinical studies [16
] and the very high performance of liquid chromatography-mass spectrometry (LC-MS/MS) allows the efficient collection of metaproteomic data, the actual analysis of this data is still facing major challenges [17
]. These include the lack of truly comprehensive bacterial sequence databases, the demand for considerable computational power, and a shortage of functional and taxonomic annotation [18
]. Estimations for fecal samples suggest the potential presence of up to 1,000,000 possible unique proteins [19
], leading to sequence databases that are enormous in terms of size. On top of requiring high computational power and large storage systems for data handling and processing, such excessive search spaces result in a significant loss of peptide identification sensitivity. While this issue can be partially addressed by using sample-specific databases generated by genomics or transcriptomics, the absence of annotations for these creates the need for large-scale sequence similarity (e.g., basic local alignment search tool (BLAST) [20
]) searches to obtain taxonomic and functional information, which again requires great computational efforts. Furthermore, mapping peptides to proteins and taxa is not trivial due to the many (usually tryptic) peptides that are shared by homologous proteins [21
]. In addition, metaproteomic analysis is further challenged by high levels of proteomic sample complexity, dynamic range of the species present and their protein expression levels and, importantly, by large inter- and intra-patient variability [22
Despite these challenges, analysis of the metaproteome is important. Therefore, we embarked on the first gut metaproteome study of MRE gut colonized leukemia patients. We analyzed 212 fecal samples from 56 patients and provide, to our knowledge, one of the largest clinical metaproteomic datasets to date. In the present manuscript, we report on the analysis of this data, highlight the main challenges and draw some conclusions that may guide scientists and clinicians when designing and conducting metaproteomic projects.
Metaproteomics is a young and developing field of research. PubMed currently (20th October 2018) lists a total of ~500 publications, whereas ~7500 scientific reports are published per year in proteomics as a whole (Supplementary Figure S7
). Although metaproteomic analysis has seen substantial progress, major challenges remain to be overcome. When designing future clinical metaproteomic studies, our data suggest that it is advisable to include longitudinal sampling systematically for each patient and to keep sampling intervals short and consistent within the cohort. Clinical studies often suffer from small sample sizes and, therefore, poor statistical power [53
]. We emphasize this point for future clinical study designs as generating the actual metaproteomic data is no longer a bottleneck. As in all clinical studies, it is important to record as much (and correct) meta information as possible about the patients and their treatments to be able to account for potential confounding factors and to distinguish interesting effects from uncontrolled factors.
For the time being, the generation of sample-specific databases is highly recommended to support comprehensive peptide and protein identification. While this requires metagenome sequencing of each sample, it mitigates the inevitable loss in confident peptide identifications when using community-based resources, such as IGC. In addition, we propose that transcriptomics data from RNA-Seq could be another way to generate databases and would likely assess even better the contribution of individual species and protein to the overall protein expression. These approaches do, however, come at the price of having to generate metagenomes for each sample and to process each of these into a list of protein sequences. That may not always be feasible in terms of cost and time, in which case, IGC is the next best alternative. However, large database sizes come with several issues. First, the ability to distinguish correct from incorrect matches is strongly impaired. The concept of FDR estimation as defined by Elias and Gygi [54
] comes with the assumption that the database is a comprehensive representation of the real search space. This assumption is, most of the time, not justified in metaproteomic samples. One option to overcome such artificial loss of identifications is to include semi-supervised machine learning algorithms like Percolator [28
] or Nokoi [55
] for the PSM scoring. Another approach to circumvent the issue of large search spaces is the clustering of peptides [56
] or the use of 2-step searches like proposed by Jagtap et al. [58
]. However, some controversy exists in the field of metaproteomics as to what degree the latter method leads to an under-estimation of the true FDR. To solve this problem for metaproteomics, a major rethinking of peptide match scoring is necessary. We anticipate that substantial progress will be made when using synthetic peptides as a ground truth for training predictors of tandem mass spectra. Large collections of synthetic peptides are becoming available by initiatives, such as the ProteomeTools project [59
]. The use of sample-specific sequence databases for peptide identification also controls, at least to some degree, demand for large computational power and storage capacity.
Another obvious challenge of metaproteomics is sample variability and the high proportion of missing values which impairs the use of many statistical methods that require complete data matrices. Due to their high species complexity, metaproteomic samples show generally high variability. Sample variability could be enhanced in the present patient cohort due to the administration of chemotherapy, increasing mutational load in bacteria [60
], and the administration of antibiotics, altering the intestinal microbiome composition [51
]. This variability may be attenuated by increasing the dynamic range of the analytical workflow, e. g., using deep fractionation of peptides prior to LC-MS/MS analysis or depletion of non-bacterial contaminants. This would, however, imply increased time requirements and cost, and the production of an even higher ‘data mountain’. In addition, feasibility may not always be ensured especially for large clinical studies and low sample availability [61
]. In contrast to mainstream proteomics, which makes strong use of intensity-based abundance estimation, metaproteomics is still largely confined to spectral counting methods because only few peptides are detected in many samples, which limits the accuracy with which changes can be measured [62
]. In addition, most quantification methods require robust normalization of the data. In microbiology, samples are regularly normalized on the sample input weight (here feces). This may or may not be a fair representation of actual bacterial/protein amount/variation in a sample. It has become standard procedure in proteomics to normalize input material for LC-MS/MS measurements on the basis of total peptide or protein content to ensure equal depth of analysis and reproducible identification [63
]. This may not be possible in metaproteomics: Comparability and normalization may be compromised because the feces may contain proteineous material other than from bacteria and host contributions. Many researcher are turning attention to data-independent-acquisition (DIA) strategies because it promises to improve reproducibility and quantification and could decrease the level of missing values. Yet, spectrum annotation and availability of suitable spectral libraries for DIA is still challenging for single proteomes and, in our view, the concepts and tools need to be much improved before DIA is applicable for complex metaproteomic samples.
An entirely different option to circumvent missing values on peptide/protein level is to compare abundances of GO terms and taxonomic distributions. Although, this shows promising results, clear taxonomic and functional annotation is not always feasible in metaproteomics. Because peptides can be shared between different proteins of the same organism or between multiple organisms, the protein inference problem in metaproteomic is even more pronounced than in single organism proteomics [21
]. Unequivocal annotation to one species is, therefore, often not possible. To circumvent this problem, peptides and proteins are often mapped to the lowest common ancestor (LCA) as first described by Huson DH et al. [64
] However, this clearly results in loss of information and potentially ambiguous annotations, limiting its applicability to higher phylogenetic levels, such as classes or phyla. Still, it is a very practical approach that does provide functional annotation and, thus, helps in the interpretation of metaproteomic data. This was strongly facilitated by the recent extension of the frequently used Unipept metaproteome analyzer to not only map the LCA on peptide level but also annotates peptides with GO terms and E.C. numbers. This functionality offers an alternative to other commonly used protein-based tools, such as Megan (Metagenome Analyzer) [65
], eggNOG (evolutionary genealogy of genes: Non-supervised Orthologous Groups) [66
] and KEGG (Kyoto Encyclopedia of Genes and Genomes) [67
]. Of note, since Unipept works at the peptide level, it simplifies data analysis by side-stepping the need of BLAST searches, especially for sample-specific genomic databases but it needs to be mentioned that, to our knowledge, no validation study comparing peptide vs. protein level based annotation has been published. Despite this, both are frequently used in metaproteomics, and there is no consensus opinion on this point in the field of metaproteomics.
This report describes the taxonomic composition and functional process of patients during the MRE gut colonization progress. Further improvements in data analysis strategies and study designs are needed to explore the processes and interactions in the microbiome and the host in more detail. Elucidating the mechanism of microbiome provided colonization resistance against multidrug-resistant pathogens (e.g., bacteriocins), the microbiome influence in disease development following transplantation (e.g., graft versus host disease) or chemotherapy efficacy. We are confident that with new technology and software, most of the challenges will eventually be solved, enabling future studies to move from merely describing taxonomic and functional composition changes to revealing significant protein-centric molecular and functional processes.