Metaproteomic and 16S rRNA Gene Sequencing Analysis of the Infant Fecal Microbiome

A metaproteomic analysis was conducted on the fecal microbiome of eight infants to characterize global protein and pathway expression. Although mass spectrometry-based proteomics is now a routine tool, analysis of the microbiome presents specific technical challenges, including the complexity and dynamic range of member taxa, the need for well-annotated metagenomic databases, and high inter-protein sequence redundancy and similarity. In this study, an approach was developed for assessment of biological phenotype and metabolic status, as a functional complement to DNA sequence analysis. Fecal samples were prepared and analysed by tandem mass spectrometry and a homology-based meta-clustering strategy was used to combine peptides from multiple species into representative proteins. In total, 15,250 unique peptides were sequenced and assigned to 2154 metaclusters, which were then assigned to pathways and functional groups. Differences were noted in several pathways, consistent with the dominant genera observed in different subjects. Although this study was not powered to draw conclusions from the comparisons, the results obtained demonstrate the applicability of this approach and provide the methods needed for performing semi-quantitative comparisons of human fecal microbiome composition, physiology and metabolism, as well as a more detailed assessment of microbial composition in comparison to 16S rRNA gene sequencing.

The infant microbiome shows dynamic changes from birth until around three years of age, when it approximates that of the adult [2,12]. After oxygen in the gut is consumed by pioneering microbes in the first days of life, anaerobic species of Bifidobacterium and Bacteroides typically dominate in breastfed, vaginally born healthy infants [1, 2,[9][10][11][12]. In these infants, the early colonizers have been shown to be vertically transmitted from the mother [9,12]. C-section delivery, which limits normal vertical transfer during birth, can result in several months delay in this profile, which is temporarily substituted by higher levels of facultative anaerobes and skin microbes, such as Propionibacterium and

Results
Metaproteomics analysis, based on shotgun sequencing of the enriched bacterial pellets, resulted in the identification of approximately 3500 unique bacterial peptide sequences in each sample and 14,216 overall (Table 1). These identifications were based on 40,924 independent spectra. By comparison, the number of host peptide spectra detected represents 18% of the total, which is the best estimate of the relative proportion of host protein in the samples.
Using the identified bacterial peptides, insights can be gained into the bacterial genera/species distribution as well as the functionality of the microbiome. The former was addressed first and a total of 253 different bacterial species could be detected by interrogating the proteomics data (Supplemental Tables S1 and S2). To evaluate these results, a more traditional technique, pyrosequencing, was applied to the same samples which led to the identification of 30 different species (Supplemental Table S3). A comparison was made between 16S rRNA gene sequencing and proteomics in terms of the commonly identified bacterial taxa and their relative abundance. The limited resolution of 16S rRNA analysis is well known [19,20], but the more abundant genera were readily detected by both methods, which allowed for a comparison of measured rank order and/or abundance. In Figure 1, the top 12 bacterial genera and one bacterial family were compared with a heat map. 1 Useful annotation is based on association of a characterized gene name. 2 Spectral counts are the number of spectra assigned to a peptide sequence. 3 Unique peptide sequences can be assigned to only a single protein group. 4 Metaclusters are created by clustering peptides sharing the same gene name and/or peptide evidence and/or the same Uniref50 identifier, independent of species assignment.

Bacterial
With useful annotation 1 32,469 10,675 1,033 Without useful annotation 8,455 3,541 889 Human With useful annotation 8,951 1,034 212 1 Useful annotation is based on association of a characterized gene name. 2 Spectral counts are the number of spectra assigned to a peptide sequence. 3 Unique peptide sequences can be assigned to only a single protein group. 4 Metaclusters are created by clustering peptides sharing the same gene name and/or peptide evidence and/or the same Uniref50 identifier, independent of species assignment.
Using the identified bacterial peptides, insights can be gained into the bacterial genera/species distribution as well as the functionality of the microbiome. The former was addressed first and a total of 253 different bacterial species could be detected by interrogating the proteomics data (Supplemental Tables S1 and S2). To evaluate these results, a more traditional technique, pyrosequencing, was applied to the same samples which led to the identification of 30 different species (Supplemental Table S3). A comparison was made between 16S rRNA gene sequencing and proteomics in terms of the commonly identified bacterial taxa and their relative abundance. The limited resolution of 16S rRNA analysis is well known [19,20], but the more abundant genera were readily detected by both methods, which allowed for a comparison of measured rank order and/or abundance. In Figure 1, the top 12 bacterial genera and one bacterial family were compared with a heat map. Data is summarized at the genus level, except for the family Enterobacteriaceae which incorporates multiple genera to simplify comparisons. Values are colored in shades of green to yellow to red, indicating low, medium and high abundance, respectively. Actual read counts and spectral counts are also provided. Subjects are identified by a letter code here and in other figures and tables (E, T, P, N, M, C, B, R).
Multiple genera (Enterobacter/Escherichia/Shigella for pyrosequencing and Klebsiella/Escherichia/Citrobacter and 11 other genera for proteomics) were combined under their family Enterobacteriaceae, to simplify comparison, since these genera did not match well between methods, potentially due to the high level of homology of 16S rRNA gene sequences within this family. Read counts were used as the metric for pyrosequencing and spectral counts for proteomics. The scale of the two measures was similar and the general distribution of the genera was comparable. For example, subject R stood out as it was nearly completely lacking, by both measures, the genus PYROSEQUENCING (Read count) PROTEOMICS (Spectral count)  1  493  2396 0  0  5  0  0  2  135  758  2  2  0  2  6   Bacteroides  246  1935 0  0  9  0  0  0  230  341  213  13  24  10  19  14   Actinomyces  0  42  0  0  0  0  0  0  1  23  11  2  3  2  2  3 Haemophilus Multiple genera (Enterobacter/Escherichia/Shigella for pyrosequencing and Klebsiella/Escherichia/ Citrobacter and 11 other genera for proteomics) were combined under their family Enterobacteriaceae, to simplify comparison, since these genera did not match well between methods, potentially due to the high level of homology of 16S rRNA gene sequences within this family. Read counts were used as the metric for pyrosequencing and spectral counts for proteomics. The scale of the two measures was similar and the general distribution of the genera was comparable. For example, subject R stood out as it was nearly completely lacking, by both measures, the genus Bifidobacterium which was replaced by Enterobacteriaceae in this antibiotic-treated subject. Similarly, the genus Parabacteroides was on par with Bifidobacterium in subject P (C-section born, and vaccinated just prior to fecal sample collection) while Bacteroides showed elevated abundance in subject T. Both methods mostly agreed in their profile of the monozygotic twins, N and M, including the observation that subject M had relatively lower levels of Veillonella and Lactobacillus than subject N. The Spearman rank order correlation between the two datasets is 0.719 while the Pearson correlation is 0.795, further demonstrating a close match in relative quantitation of taxonomic composition by the two methods.
Although the two methods were consistent in taxonomic profiling, there were many more cases of low representation or missing data in the pyrosequencing data. Furthermore, these 12 genera and one family accounted for most of the pyrosequencing data, while the proteomic data identified many other genera (see "unknown/others" in Figure 1). In total, 65 additional genera were identified only through proteomics (Supplemental Table S2).
The composition of each individual microbiome was clearly unique at the genus level. While it is difficult to resolve below this level with pyrosequencing, the metaproteomics data could resolve at the species and even strain level (Supplemental Table S4). Since Bifidobacterium was the dominant genus in all but one subject, the species distribution within that genus was examined in each subject ( Figure 2). A high degree of heterogeneity was observed, although the twins had, by far, the most similar profiles. The dominant species in most subjects was B. breve, which increased in representation with age. B. bifidum was the second most common species in the monozygotic twins, while B. longum subsp. infantis was dominant in one of the youngest subjects (C), and B. pseudocatenulatum was most common in subject P, the youngest subject analysed and born by C-section.
the genus Parabacteroides was on par with Bifidobacterium in subject P (C-section born, and vaccinated just prior to fecal sample collection) while Bacteroides showed elevated abundance in subject T. Both methods mostly agreed in their profile of the monozygotic twins, N and M, including the observation that subject M had relatively lower levels of Veillonella and Lactobacillus than subject N. The Spearman rank order correlation between the two datasets is 0.719 while the Pearson correlation is 0.795, further demonstrating a close match in relative quantitation of taxonomic composition by the two methods.
Although the two methods were consistent in taxonomic profiling, there were many more cases of low representation or missing data in the pyrosequencing data. Furthermore, these 12 genera and one family accounted for most of the pyrosequencing data, while the proteomic data identified many other genera (see "unknown/others" in Figure 1). In total, 65 additional genera were identified only through proteomics (Supplemental Table S2).
The composition of each individual microbiome was clearly unique at the genus level. While it is difficult to resolve below this level with pyrosequencing, the metaproteomics data could resolve at the species and even strain level (Supplemental Table S4). Since Bifidobacterium was the dominant genus in all but one subject, the species distribution within that genus was examined in each subject ( Figure 2). A high degree of heterogeneity was observed, although the twins had, by far, the most similar profiles. The dominant species in most subjects was B. breve, which increased in representation with age. B. bifidum was the second most common species in the monozygotic twins, while B. longum subsp. infantis was dominant in one of the youngest subjects (C), and B. pseudocatenulatum was most common in subject P, the youngest subject analysed and born by C-section. In Table 2, the number of bacterial strains identified by one or more unique spectra for each subject is provided. If relying on as little as one spectrum or peptide, between 45 and 99 strains could be detected in each subject. If five spectra per strain was used as a threshold for higher confidence, between 9 and 33 strains could be detected in each infant, although the absolute numbers detected can also be affected by the amount of sample analyzed which can differ slightly between subjects. Patterns of relative strain abundance were similar to those observed at the genus level, with Bifidobacterium representing the largest group in most infants, twins presenting similar distributions and the infant treated with antibiotics having an altered distribution ( Figure 3). In Table 2, the number of bacterial strains identified by one or more unique spectra for each subject is provided. If relying on as little as one spectrum or peptide, between 45 and 99 strains could be detected in each subject. If five spectra per strain was used as a threshold for higher confidence, between 9 and 33 strains could be detected in each infant, although the absolute numbers detected can also be affected by the amount of sample analyzed which can differ slightly between subjects. Patterns of relative strain abundance were similar to those observed at the genus level, with Bifidobacterium representing the largest group in most infants, twins presenting similar distributions and the infant treated with antibiotics having an altered distribution ( Figure 3).
In addition to taxonomic classification, the functionality of the microbiome is another important aspect that can be investigated by proteomics. As outlined above, due to the high complexity of these samples and the different genera, species and strains detected in each subject, comparison of the same sequence across subjects would not be productive. Instead, peptides representing the equivalent protein in each species were clustered together, based on sharing the same gene name and/or peptide evidence and/or the same Uniref50 identifier. The resulting protein metaclusters allowed each microbiome to be treated as a set of common functional features, providing higher power to the analysis. As a result, changes in enzyme pathway representation could be more readily detected. In this study, the total number of microbial metaclusters detected was 1922 (Table 1). An additional 212 host metaclusters were found, representing 11% of the total.   In addition to taxonomic classification, the functionality of the microbiome is another important aspect that can be investigated by proteomics. As outlined above, due to the high complexity of these samples and the different genera, species and strains detected in each subject, comparison of the same sequence across subjects would not be productive. Instead, peptides representing the equivalent protein in each species were clustered together, based on sharing the same gene name and/or peptide evidence and/or the same Uniref50 identifier. The resulting protein metaclusters allowed each microbiome to be treated as a set of common functional features, providing higher power to the analysis. As a result, changes in enzyme pathway representation could be more readily detected. In this study, the total number of microbial metaclusters detected was 1922 (Table 1). An additional 212 host metaclusters were found, representing 11% of the total.
Of the bacterial metaclusters, only 1033 (54%) were associated with a useful annotation. However, when a subset of the remaining individual metaclusters was verified manually, they were all successfully annotated using other databases (e.g., RefSeq, KEGG Pathway, etc.), and mapped to genes included in other metaclusters, suggesting that most or all of the metaclusters could likely be annotated successfully and further combined. Improvements to the metaclustering algorithm may be achieved by either using a different homology database (e.g., COG) or combining multiple databases to provide successful annotation and better clustering potential. Of the bacterial metaclusters, only 1033 (54%) were associated with a useful annotation. However, when a subset of the remaining individual metaclusters was verified manually, they were all successfully annotated using other databases (e.g., RefSeq, KEGG Pathway, etc.), and mapped to genes included in other metaclusters, suggesting that most or all of the metaclusters could likely be annotated successfully and further combined. Improvements to the metaclustering algorithm may be achieved by either using a different homology database (e.g., COG) or combining multiple databases to provide successful annotation and better clustering potential.
In order to compare the metaproteome of each subject, pairwise scatter plots were generated, based on the number of detected spectral counts in each metacluster ( Figure 4). As expected, there was a high correlation between the two monozygotic twins (R 2 = 0.84). Two unrelated males, both breastfed, were highly similar but less so than the twins (R 2 = 0.60). However, when the antibiotic-treated subject was compared to an untreated subject, the correlation was low (R 2 = 0.29). In order to compare the metaproteome of each subject, pairwise scatter plots were generated, based on the number of detected spectral counts in each metacluster ( Figure 4). As expected, there was a high correlation between the two monozygotic twins (R 2 = 0.84). Two unrelated males, both breastfed, were highly similar but less so than the twins (R 2 = 0.60). However, when the antibiotictreated subject was compared to an untreated subject, the correlation was low (R 2 = 0.29). Because the antibiotic-treated subject was so different from the seven non-treated subjects in terms of both taxonomy and metacluster spectral counts, we chose this comparison to evaluate a method for comparing metabolic pathway expression. This was accomplished by inputting metaclusters with useful annotation into the KEGG Pathway mapping tool [36,37] (Figure 5). The most affected pathways were subsequently evaluated in more detail ( Figure 6).  Because the antibiotic-treated subject was so different from the seven non-treated subjects in terms of both taxonomy and metacluster spectral counts, we chose this comparison to evaluate a method for comparing metabolic pathway expression. This was accomplished by inputting metaclusters with useful annotation into the KEGG Pathway mapping tool [36,37] (Figure 5). The most affected pathways were subsequently evaluated in more detail ( Figure 6).
In order to compare the metaproteome of each subject, pairwise scatter plots were generated, based on the number of detected spectral counts in each metacluster ( Figure 4). As expected, there was a high correlation between the two monozygotic twins (R 2 = 0.84). Two unrelated males, both breastfed, were highly similar but less so than the twins (R 2 = 0.60). However, when the antibiotictreated subject was compared to an untreated subject, the correlation was low (R 2 = 0.29). Because the antibiotic-treated subject was so different from the seven non-treated subjects in terms of both taxonomy and metacluster spectral counts, we chose this comparison to evaluate a method for comparing metabolic pathway expression. This was accomplished by inputting metaclusters with useful annotation into the KEGG Pathway mapping tool [36,37] (Figure 5). The most affected pathways were subsequently evaluated in more detail ( Figure 6).  The urea cycle enzymes were well represented in untreated subjects, but only one enzyme was detected in the antibiotic-treated subject. A very different picture is observed for fatty acid biosynthesis, which was upregulated in the antibiotic-treated subject. A mixed result was obtained for the TCA cycle, which lies at the crossroads of the urea cycle and fatty acid biosynthesis and may reflect the differences in microbial energy metabolism between antibiotic-treated and untreated subjects. The urea cycle enzymes were well represented in untreated subjects, but only one enzyme was detected in the antibiotic-treated subject. A very different picture is observed for fatty acid biosynthesis, which was upregulated in the antibiotic-treated subject. A mixed result was obtained for the TCA cycle, which lies at the crossroads of the urea cycle and fatty acid biosynthesis and may reflect the differences in microbial energy metabolism between antibiotic-treated and untreated subjects.

Discussion
In this study, we evaluated the potential benefit of metaproteomics to the field of human microbiome research. Currently, the majority of microbiome studies are based on high throughput DNA sequencing methods, which have revealed the complex nature of the taxa involved, but with only limited insights on functional activity.
In a direct comparison of the taxonomies detected by each method, the mass spectrometry data essentially replicated the phylogenetic composition at the bacterial genus level, providing a similar but potentially more in-depth taxonomic map of each subject than obtained with 16S rRNA sequencing. Bifidobacteria represented the largest group in all infants, except for the infant treated with antibiotics. Bifidobacteria are typically abundant in the infant microbiome since they are adapted to use human milk oligosaccharides (HMO) as their source of carbon [38]. In addition, high levels of Bacteroides/Parabacteroides can be seen to distinguish several of the non-treated subjects. These genera are known to be well adapted to feed on mucosal glycans present in the gut and share pathways for feeding on HMO [39]. With either approach, it was clear that the antibiotic-treated subject had a

Discussion
In this study, we evaluated the potential benefit of metaproteomics to the field of human microbiome research. Currently, the majority of microbiome studies are based on high throughput DNA sequencing methods, which have revealed the complex nature of the taxa involved, but with only limited insights on functional activity.
In a direct comparison of the taxonomies detected by each method, the mass spectrometry data essentially replicated the phylogenetic composition at the bacterial genus level, providing a similar but potentially more in-depth taxonomic map of each subject than obtained with 16S rRNA sequencing. Bifidobacteria represented the largest group in all infants, except for the infant treated with antibiotics. Bifidobacteria are typically abundant in the infant microbiome since they are adapted to use human milk oligosaccharides (HMO) as their source of carbon [38]. In addition, high levels of Bacteroides/Parabacteroides can be seen to distinguish several of the non-treated subjects. These genera are known to be well adapted to feed on mucosal glycans present in the gut and share pathways for feeding on HMO [39]. With either approach, it was clear that the antibiotic-treated subject had a completely different profile from the other subjects, with the Enterobacteriaceae family mostly replacing Bifidobacterium. However, proteomics revealed that the antibiotic-induced increase of Enterobacteriaceae was more specifically associated with two Klebsiella species, which are well-known opportunistic pathogens [40].
The metaproteomic analysis also allowed assessing the species distribution within the genus Bifidobacterium, which not only revealed a high degree of heterogeneity within each subject, but also different profiles between subjects. The species B. breve was most dominant in the majority of infants, followed by other typical infant-type bifidobacterial species, namely B. longum subsp. infantis, B. longum subsp. longum and B. bifidum. Although the number of infants was low, the proportion of B. breve increased with age, a pattern that was observed previously in exclusively breastfed infants, and in infants receiving a combination of breastfeeding and infant formula [41]. Consistent with previous observations [42], B. longum subsp. infantis represents a majority only in breastfed younger infants, e.g., due to lacto-N-neotetraose, a HMO component, which favors this HMO-adapted species as well as B. breve [43,44] over those that use mucus, like Bacteroides [39].
Proteomics also provided additional information on lower abundance genera as well as significant information at the species and strain level. A total of 183 different taxa were identified by proteomics with strain-level assignments, with an average of 71 in each subject, representing 147 different species and 76 different genera. Of course, these conclusions rely on the assumption that the database used was sufficiently complete. By comparison, pyrosequencing identified only 13 genera overall, with a range of 2 to 8 per subject and no information at the species or strain level. The additional detail afforded by the metaproteomics analysis should be instrumental to better understanding environmental perturbations such as the effects of disease, antibiotic treatments and diet. While this study was too small to make many generalizations about observed differences, the high degree of similarity of the twins at the strain level and the overall high correlation between proteomics and pyrosequencing at the genus level demonstrate that the data is far from random. Interestingly, the profile of the twins closely matches one other subject (B) but less so the remaining subjects. The antibiotic-treated subject differs greatly from the others, but several other subjects are similarly unique, likely for other reasons, including age, mode of delivery, diet or other factors not accounted for in the available clinical information. In future metaproteomics studies of the microbiome, it appears that performing 16S rRNA pyrosequencing on the same samples would offer little benefit. Only a limited number of organisms were identified by this method, most of which were already part of the Human Microbiome Project Database (26 out of 30), and those not present in the publicly available database only represented <0.002% of the total identified spectra.
In addition to the taxonomy information provided by proteomics, a quantitative picture was generated at the enzyme level, which was used to assess pathway and, therefore, metabolic activity. Due to the diversity of bacteria in the gut, each unique protein had a low level of representation in the metaproteomic data, making it difficult to draw conclusions about the status of a single species. However, by grouping functionally similar proteins from each species, metaclusters were created that provided a means of comparing subjects, treating the microbiome of each as the average of all representative functional features. With this approach, it was possible to make high level ( Figure 4) as well as detailed comparisons (Figures 5 and 6).
The highest degree of correlation at the metacluster level, observed for the twins (Figure 4), confirmed the strong influence of host genetics and shared environment on the microbiome [45]. In contrast, the antibiotic-treated subject showed the lowest correlation at the metacluster level in a pairwise comparison to a non-treated subject (Figure 4). To assess the differences in metabolic activity of this infant compared to the others, we assigned the metaclusters to enzyme pathways and functional groups. This pathway analysis revealed that untreated infants showed high levels of urea cycle enzymes (Figures 5 and 6). Bifidobacterium, dominant in this group, is well adapted to utilize urea, abundant in milk, and turn it into a source of nitrogen usable for the host [46]. The antibiotic-treated subject was high in fatty acid biosynthesis enzymes, potentially due to the inability of Klebsiella to efficiently metabolize milk sugars and therefore relying on milk fat instead. The TCA cycle also showed distinct and consistent differences, with nitrogen metabolism enzymes (gdhA, gltB) upregulated in the untreated subjects and carbon metabolism enzymes (sucA, sucB, sdhA, mdh) upregulated in the treated subject.
Further improvements may yield even more comprehensive results. Many of the lower abundance species and metaclusters were represented by small numbers of spectra, limiting confidence in both identity and quantitation. This is mainly due to limitations of the instrumentation to fully sample all peptide ions with each analysis and can be improved by additional fractionation, repeated analysis of each sample, and/or using even more advanced mass spectrometers. Of the metaclusters identified, many were not associated with a meaningful function when using the UniRef database. Improving the quality of the metacluster functional annotation seems achievable by refining the clustering criteria and/or using databases providing better functional annotation [28,47,48] and could potentially bring a deeper understanding of the various pathways and biological functions involved. Finally, cost and throughput should be noted. Mass spectrometry-based approaches involve more effort and take longer than DNA-based analyses, so there is a trade-off for the information obtained. Fecal samples do require more processing steps than most other proteomics samples and fractionation of the digested proteins was necessary to improve the number of peptides detected, thus also increasing the number of mass spectrometry analyses for each sample. These requirements will certainly impact the size of such studies, but the additional insight obtained from the relative quantitation of hundreds of protein metaclusters in each sample, which cannot be otherwise obtained, will likely offset the effort required.

Sample Collection
Fecal samples were obtained from 8 infants between 2 to 5 months of age and living in the Netherlands. Parents were asked to voluntarily participate and information on gestational age, age at collection, gender, feeding mode (breastfed/formula fed), mode of birth and use of concomitant medication was collected (Table 3). Written informed consent was obtained from all participants. Fecal samples were collected by the parents from diapers into 10 ml stool containers (Greiner Bio-One, Kremsmünster, Austria) and kept at 4 • C. Within one hour after collection, laboratory assistants prepared homogenized stool aliquots (5-10) in 2 mL centrifuge tubes, which were subsequently kept and transported on dry-ice for analysis by either 16S rRNA gene sequencing or proteomics. Upon arrival at the respective laboratories and prior to evaluation, samples were kept at −80 • C.

16S rRNA Gene Sequencing and Bioinformatics
Fecal aliquots were thawed on ice and 20 to 60 mg of each aliquot was mixed with 450 µL DNA extraction buffer (100 mM Tris-HCl, 40 mM EDTA, pH 9.0) and 50 µL of 10% sodium dodecyl sulfate. Phenol-chloroform extractions combined with mechanical lysis of bacterial cells by bead-beating was performed as described by Matsuki et al. [49] except that extracted DNA was re-suspended in 0.1 mL of TE (10 mM Tris-HCl, 1 mM EDTA, pH 8.0). The V3-V5 regions of the 16S rRNA gene were amplified using forward primer 357F, and a 'bifidobacteria-optimised' reverse primer 926Rb [17]. The reverse primers included a 12 base-pair error-correcting Golay barcode. PCR was carried out in quadruplicate as previously described [18]. Replicate amplicons were pooled and purified and pyrosequencing was carried out on a 454 GS FLX (Roche, Branford, CT, USA) following the Roche Amplicon Lib-L protocol. The 'Quantitative Insights Into Microbial Ecology' (QIIME) v1.9.0 package was used to analyse sequence data [50]. Sequence alignment was carried out using the SILVA rRNA database (SSU_REF111) [51] as reference. Chimera filtering, clustering at 97% sequence identity into operational taxonomic units, and taxonomic assignment were performed using the USEARCH and UCLUST algorithms [52,53]. Rarefaction was performed and singletons were removed. The resulting taxonomic compositions (read counts and relative abundances) were summarized at the genus level.

Proteomics Sample Preparation and Analysis
Each 0.5 gram of test sample was resuspended in 50 mM sodium phosphate buffer (pH8), 0.01% acid-labile surfactant (RapiGest SF Surfactant, Waters, Milford, MA, USA) and placed on a horizontal platform shaker for 10 min at 100 oscillations per min at 20 • C. Remaining large particulate was removed from the suspension with 4 cycles of low-speed centrifugation at 200× g for 15 min. After each cycle, the supernatant was collected and kept at 4 • C. At the end, the pooled supernatants were centrifuged at 16,000× g for 15 min at 20 • C to collect bacteria. The bacterial pellets were then resuspended in 25 mM Tris-HCl, 150 mM NaCl, 6 M guanidine hydrochloride, pH 7.6 + 0.01% RapiGest and sonicated for 10 min, pulse ON for 30 s and OFF for 10 s. After denaturation, samples were reduced with 5mM tris(2-carboxyethyl)phosphine (Thermo Fisher, Waltham, MA, USA), diluted to decrease the guanidine hydrochloride concentration to 1M, and the protein concentration determined by a Bicinchoninic acid assay (Pierce, Waltham, MA, USA). Based on the determined concentration, trypsin was added overnight, followed by a second addition and a 4 h incubation, both at 37 • C (1:50 (w:w) enzyme:protein ratio each addition, Promega, Madison, WI, USA). Samples were acidified with 0.1% trifluoroacetic acid (Sigma, St. Louis, MO, USA), desalted using Oasis ® HLB plates (Waters) and vacuum evaporated. Sample study peptides (130 µg) were fractionated using strong cation exchange (SCX) chromatography. Elution was performed using a linear salt gradient and peptides were collected into 8 fractions and freeze dried. SCX fractionation performance was monitored by injecting a peptide mix (Leucine Enkephalin, Oxytocin, Angiotensin I, [Arg8] Vasopressin, [Val4] Angiotensin III, α-Endorphin, in-house preparation) at the beginning and end of each SCX run. Following freeze drying of the SCX fractions, they were resuspended in 1% trifluoroacetic acid and desalted using Oasis ® HLB plates. The eluate from the desalting step was split and transferred into two 96-well plates, one plate for LC-MS/MS analysis and the other plate as a back-up. Both plates were vacuum evaporated and stored at −20 • C until analysis by LC-MS/MS.
For protein identification and metaclustering, LC-MS/MS spectra were submitted to a database search using Mascot software v2.2.06 (Matrix Science, Columbia, SC, USA), with search parameters: enzyme = trypsin, allowed missed cleavages = 2, peptide tolerance = 20 ppm, MS/MS tolerance = 0.05 Da, variable modifications = Deamidation (N), Oxidation (M). A custom database was used which included all bacteria from the Human Microbiome Project [54] (downloaded from Uniprot 20140605), plus additional taxonomies identified on the same 8 fecal samples via 16S rRNA gene sequencing (downloaded 20140605), and Uniprot Human (reviewed entries only, downloaded 20140120). A decoy reverse database was used to evaluate the false positive error rate, and Peptide/Protein Teller was used to derive the simplest list of proteins to explain observed peptides with False Discovery Rate = 1.6% at the protein level. Identified proteins were grouped into metaclusters if they shared the same gene name and/or peptide evidence and/or the same Uniref50 identifier [55]. For the subsequent analyses, only organisms/genera/metaclusters/proteins with unique peptide evidence were considered. A peptide was considered unique if its sequence could only be assigned to a single organism, genus, metacluster or protein. The unique nature of a peptide was determined independently at the corresponding level. Conversely, an organism, genus, metacluster or protein were considered identified by a single unique peptide sequence, although the level of confidence in each identification increased with the number of corresponding unique and non-unique peptide assignments.
Protein metaclusters with peptide redundancy removed at the genus level, were submitted to pathway analysis using the 'KEGG Mapper-Search&Color Pathway' mapping tool [36,37], and comparative analysis was performed on the different infant categories (e.g., treated with antibiotic, feeding mode, etc.). Pathways with obvious differences between infant categories were evaluated further by extracting the corresponding pathway genes and spectral count data.

Availability of Data and Materials
Raw pyrosequencing data for all samples are available in the Sequence Read Archive under accession number PRJEB19801. The mass spectrometry proteomics data have been deposited into the ProteomeXchange Consortium via the PRIDE [56] partner repository with the dataset identifiers PXD006033 and 10.6019/PXD006033.

Ethics Approval and Consent to Participate
Prior to inclusion in the study, written informed consent was obtained from the parents of all subjects to evaluate fecal microbiota composition and to include relevant subject metadata (age, gender, breast or formula fed, vaginal or C-section delivery, and treatments proximal to sample collection (antibiotics, vaccination)).
The project proposal, including the Informed Consent text and recruitment materials, was submitted to an accredited Medical Research Ethics Committee (Independent Review Board Nijmegen (IRBN)) for confirmation that this project did not need a formal ethical committee review according to Dutch law.

Conclusions
Metaproteomics data yielded far more information than what was possible from 16S rRNA gene sequencing of the same samples. More organisms were identified with better resolution in each subject. In addition, metabolic pathway information was obtained that can provide a deeper insight into the physiological state of the microbiome, potentially providing diagnostic insights that cannot be derived from DNA sequencing analysis alone. This approach may be particularly important from a diagnostic perspective, given that changes to the proteome may occur more rapidly or independently from taxonomy distribution changes in response to challenges. Protein and pathway changes are also much more readily reconciled with changes to metabolite concentrations, thus opening the door to a better understanding of the impact of the microbiome on the metabolome. For these reasons, fecal metaproteomics is a powerful technology that may improve characterization of the gut microbiome and catalyzes a further understanding of its impact on health and disease outcomes. Author Contributions: H.W., J.K. and D.C. conceived and designed the experiments. L.C., H.W., S.T., J.W.G., A.T. and D.C. analyzed and interpreted the data and were major contributors in writing the manuscript. J.P. analyzed the proteomics data. All authors read and approved the final manuscript.