Metaproteomic Analysis of an Oral Squamous Cell Carcinoma Dataset Suggests Diagnostic Potential of the Mycobiome

Oral squamous cell carcinoma (OSCC) is the most common head and neck malignancy, with an estimated 5-year survival rate of only 40–50%, largely due to late detection and diagnosis. Emerging evidence suggests that the human microbiome may be implicated in OSCC, with oral microbiome studies putatively identifying relevant bacterial species. As the impact of other microbial organisms, such as fungi and viruses, has largely been neglected, a bioinformatic approach utilizing the Trans-Proteomic Pipeline (TPP) and the R statistical programming language was implemented here to investigate not only bacteria, but also viruses and fungi in the context of a publicly available, OSCC, mass spectrometry (MS) dataset. Overall viral, bacterial, and fungal composition was inferred in control and OSCC patient tissue from protein data, with a range of proteins observed to be differentially enriched between healthy and OSCC conditions, of which the fungal protein profile presented as the best potential discriminator of OSCC within the analysed dataset. While the current project sheds new light on the fungal and viral spheres of the oral microbiome in cancer in silico, further research will be required to validate these findings in an experimental setting.


Introduction
Oral cancer carries a heavy global burden, being the most common head and neck malignancy worldwide, with an observed 377,713 new cases in 2020, and one of the leading causes of death in India among males [1,2]. Oral squamous cell carcinoma (OSCC) is the predominant form of oral cancer, with many aetiological factors, including age, alcohol, tobacco use, and the traditional chewing of areca nuts in regions such as South Central Asia [3][4][5].
Among OSCC patients, the overall 5-year survival rate is an estimated 40-50% [6]. However, intervention and treatment in patients presenting with early stage oral cancer show drastically improved outcomes, with an estimated 78-92% 3-year survival rate [7]. As with all cancers, proliferation and apoptotic pathways play a key role in OSCC [8], with recent evidence demonstrating the ability of oral bacterial microfilms to modulate cell proliferation [9]. This, coupled with a growing body of evidence, suggests that the oral microbiome may be implicated in the development of OSCC [10]. Thus, these oral microorganisms may exhibit potential as biomarkers to support OSCC disease diagnosis.
The association between microorganisms and carcinogenicity was first demonstrated in the 1990s through Helicobacter pylori [11], with modern estimates now suggesting H. pylori is causally related to 60-90% of all gastric cancers [12]. Microbiome research has suggested three likely primary mechanisms for the promotion of microbial carcinogenesis, including inflammation and activation of Toll-like receptors; secretion of microbial genotoxins, such as colibactin; and alteration of metabolic states [13]. With the advent of next-generation sequencing, many researchers are now able to probe the associations between cancer and the microbiome using culture-free metagenomic approaches, including, in particular, 16S rDNA sequencing. For OSCC, many different species, including periodontitis-related Figure 1. Distribution of microbial species between NAT and OSCC conditions: Venn diagrams displaying number of viral, bacterial, and fungal species identified by at least two unique proteins and their allocation to NAT, OSCC, or both conditions based on log-fold change cut-offs. Species inferred by proteins with a mean intensity log2 fold-change of ≤−1 or ≥1 were assigned to NAT and OSCC conditions, respectively, with those between this range being considered present in both conditions.
Only two main kingdoms were identified in both NAT and OSCC conditions for viruses: Bamfordvirae and Heunggongvirae, as well as an "Unclassified" category. The percentage composition for these were 43, 29, and 29, respectively, for NAT, and 17, 33, and 50, respectively, for OSCC (Supplementary Figure S2). Distribution of microbial species between NAT and OSCC conditions: Venn diagrams displaying number of viral, bacterial, and fungal species identified by at least two unique proteins and their allocation to NAT, OSCC, or both conditions based on log-fold change cut-offs. Species inferred by proteins with a mean intensity log 2 fold-change of ≤−1 or ≥1 were assigned to NAT and OSCC conditions, respectively, with those between this range being considered present in both conditions. For fungi, eight phyla were observed in both NAT and OSCC conditions: Ascomycota, Basidiomycota, Mucoromycota, Chytridiomycota, Zoopagomycota, Microsporidia, Cryptomycota, and Blastocladiomycota. The percentage composition for these were 45,26,9,7,7,4,1, and 1, respectively, for NAT, and 46, 30, 9, 6, 5, 3, 1, and >1, respectively, for OSCC ( Figure 2).
Only two main kingdoms were identified in both NAT and OSCC conditions for viruses: Bamfordvirae and Heunggongvirae, as well as an "Unclassified" category. The percentage composition for these were 43, 29, and 29, respectively, for NAT, and 17, 33, and 50, respectively, for OSCC (Supplementary Figure S2).

Identification of Differentially Abundant Fungal Proteins
Benjamini-Hochberg statistical t-testing identified a total of 6 out of 97, 14 out of 201, and 196 out of 937 significantly differentially abundant viral, bacterial, and fungal proteins, respectively (adjusted p-value < 0.05) (Figure 3 and Supplementary Figure S3). Due to the low number of significant viral and bacterial proteins, reflecting the low viral and bacterial diversity seen in Figure 1, we will focus on the fungal microbiome. Of the top 30 most significantly differentially abundant proteins, 2 were enriched in the NAT condition, with the remaining 28 enriched in the OSCC condition ( Figure 3 and Table 1).    Here, a separation between NAT and OSSC samples was observed, with only four OSCC samples clustering together with the healthy NAT patient samples. Clustering for viral and bacterial proteins was similarly performed, although no clear separation was observed due to the low number of proteins (Supplementary Figure S4). Here, a separation between NAT and OSSC samples was observed, with only four OSCC samples clustering together with the healthy NAT patient samples. Clustering for viral and bacterial proteins was similarly performed, although no clear separation was observed due to the low number of proteins (Supplementary Figure S4).

Enrichment of Nucleocytoplasmic Large DNA Viruses in the Oral Virome
Regarding viruses, statistical testing identified only six viral proteins, which significantly differed in abundance between NAT and OSCC samples, with half of these being uncharacterised proteins and the other half corresponding to proposed replication and heat shock proteins. Four of the six proteins mapped back to members of the Phycodnaviridae family, a group of giant viruses referred to as nucleocytoplasmic large DNA viruses (NCLDVs) [35]. Microalgae serve as the natural hosts for NCLDVs, though many foods for human consumption use microalgae to fortify protein content and supplement nutrition [36]. The enrichment of these proteins in the OSCC condition suggest another possible association between OSCC and diet, as microalgae food safety, especially in the context of contaminants, is not well characterised [37]. However, this would require further investigation, given the low overall number of viral proteins observed here. Of note, no HPV proteins could be detected at all, remaining consistent with the screening by Huang et al. to ensure only HPV-negative samples were used in the analysed dataset.

Periodontal Pathogens and Opportunistic Bacteria Are Enriched in OSCC
Statistical testing of bacterial protein fold-changes revealed 14 proteins that were predominantly significantly enriched in OSCC samples. Twelve of these mapped back to individual species, with only one species being inferred by two proteins here. The majority of these are normal constituents of the human microbiome and have low virulence, but act as opportunistic pathogens in immunocompromised individuals. This includes Paracoccus yeei [38], Pseudomonas luteola [39], Staphylococcus epidermidis [40], Cardiobacterium hominis [41], and Staphylococcus lugdunensis [42]. These bacteria likely represent "passengers" in the "passenger-turning-driver" model [43]. Most notably, S. lugduenensis represents a normal human commensal, which flourishes during oral infection, expressing hemolysin virulence factors that promote further inflammatory states that may favour cancer growth [44][45][46]. In addition to these opportunistic pathogens, other inferred species include Variovorax paradoxus, Eubacterium minutum, and Microbacterium flavescens, which are notably enriched in periodontitis [47][48][49][50]. V. paradoxus, in particular, was the only species here inferred by two proteins with documented biofilm-formation phenotypes [51]. There is reasonable evidence suggesting periodontitis increases the risk of developing oral cancer [52], with one metaanalysis identifying a 2-5 fold increased oral cancer risk associated with periodontitis [53].

An Unexpected Fungal Diversity Is Observed in the OSCC Patient Samples
Current research on fungi within the oral microbiome has been largely eclipsed by its bacterial members, though advancements in omics approaches has renewed interest in this area, with high-throughput sequencing revealing a complex fungal microbiome, or mycobiome [54].
Existing literature suggests the oral mycobiome is predominated by fungi from the phylum Ascomycota, followed by Basidiomycota [30], and this trend was observed in the current study, where these two phyla predominated in the oral mycobiome composition in both NAT and OSCC tissue. However, the current study additionally identified substantial diversity from the additional phyla Mucoromycota, Zoopagomycota, Chytridioomycota, and Microsporidia, with the overall proportion of these phyla remaining relatively stable between NAT and OSCC.
Unexpectedly, the current results identified a greater number of fungal proteins and species, compared to both bacteria and viruses. Though fungi have been described as comprising <0.1% of the human microbiome, this estimate is based on cfu [55] and likely under-estimates the true fungal frequency. These cfu measurements require microbial cultures, and a large proportion of fungi from the human microbiome are unable to be cultured [56][57][58]. In addition, despite being numerically underrepresented, the generally larger cell size of fungal species has been speculated to contribute a proportionally larger amount of biomass [55]. It is also possible that the increased number of identified fungal proteins was due to the use of surgical antiseptics, such as chlorhexidine, povidone iodine, and, less commonly, alcoholic disinfectants [59][60][61]. Alcohols remain effective against bacteria, but demonstrate low efficacy against fungi and fungal spores [60,62]. Conversely, povidone iodine and chlorhexidine exhibit greater fungicidal activity, with povidone iodine demonstrating additional efficacy against spores [59,60,63,64]. Due to their prevalence, many experiments have demonstrated the efficacy of povidone iodine and chlorhexidine against Candida species specifically [64][65][66][67], likely explaining the lack of Candida species identified in the current study. Limited information, however, could be found regarding the efficacy of these antiseptics on other fungal genera, with the additional possibility that surface cleaning of the biopsy site is ineffective against intraepithelial fungal species that have penetrated the mucosal layer [68].

Fungal Proteins Implicate Pathogens Capable of Soft Tissue Damage
Following statistical testing, 196 fungal proteins were identified as being significantly differentially abundant between the NAT and OSCC conditions. While a large number of fungal species were inferred, many of these were poorly characterised, with no documented pathogenicity or documented interactions in the human oral cavity. Similar to identified bacteria, several inferred fungi were documented to be, or be associated with, opportunistic pathogens, including Verruconis gallopava [69], Syncephalastrum racemosum [70], and Dimargaris cristalligena [71,72], all of which were inferred to be more abundant in OSCC tissue. Some more notable and well-characterised fungal species inferred from proteins include Lichteimia corymbifera, Malassezia sympodialis, and Paracoccidioides brasiliensis. L. corymbifera is capable of causing mucormycosis fungal infection, which can lead to ulceration of the oral cavity [73,74], and was recently examined to be strongly associated with mobile tongue OSCC in a recent metagenomic study examining tumour tissue against non-tumor tissue controls [75]. Lichteimia species are one of the predominant causative agents of mucormycosis in Europe, with maxillo-facial and pulmonary infections as common clinical presentations [76,77]. Clinical patient studies have additionally observed mucormycosis to be more commonly associated with hematological malignancies, such as acute leukemia and lymphoma, though this is possibly a result of opportunistic infection rather than tumour initiation [77,78]. M. sympodialis is considered to be a normal commensal of the human oral microbiome [79], though a recent in vivo study utilising mouse tumour models demonstrated increased Malassezia abundance in pancreatic tumour mice compared to controls, with further knockout mice suggesting that tumour progression is driven by activation of the mannose-binding lectin-C3 complement cascade [80]. P. brasiliensis is a fungal yeast capable of paracoccidioidomycosism fungal infection, which may present as oral lesions and other soft-tissue damage [81,82]. In patients with both paracoccidioidomycosism and OSCC, these malignancies often occur in the same region or adjacent tissues, highlighting a potential role of this fungus in cancer aetiology [83]. It has been hypothesised that continuous stimulation of epithelial cells may predispose these cells to malignant transformation, and that these may persist due to fungal-impaired macrophage and natural killer cell activity [84]. This association, however, is not conclusive due to the overall low number of experimental studies examining this effect [84]. In a retrospective study examining patients diagnosed with both paracoccidioidomycosism and cancer, 62.5% presented lung tumours, with the majority of these being classified as squamous cell carcinoma [85]. While the authors conclude that paracoccidioidomycosism appears to increase the risk of cancer, particularly lung cancer, there is little research on whether paracoccidioidomycosism infection has a role in tumor initiation, or if it acts as a "passenger" following oncogenesis.
It is surprising to note that, among the inferred species enriched in OSCC, Candida albicans, a species that has been reported to dominate the OSCC mycobiome landscape [30], was not observed here. This is, again, likely due to antiseptic use during specimen collection, as C. albicans is observed to be susceptible to povidone iodine, chlorhexidine, and alcohol disinfectants, such as isopropanol [64][65][66][67]86].

Clustering of the Mycobiome Protein Profile Shows Diagnostic Potential
When compared to bacteria and viruses, unsupervised hierarchical clustering of fungal proteins identified in at least 50% of patients provided the greatest discrimination between NAT and OSCC samples. Apart from four OSCC samples, all NAT samples clustered together based on this fungal protein profile. This superior discrimination is likely achieved through the high identification rate of fungal proteins, potentially attributed to the use of chemical disinfectants that have biased the overall microbial composition. Nonetheless, it is still promising to identify that the oral mycobiome, which has been understudied, may be used as potential indicators of oral carcinogenesis.
Further research into the fungal composition of the oral cavity is needed, with investigations using unbiased samples being potentially capable of identifying novel biomarkers for the diagnosis of OSCC.

Data Collection
The selected HPV-negative MS dataset of head and neck squamous cell carcinoma (HNSCC; including OSCC) was downloaded from the Clinical Proteomic Tumour Analysis Consortium public repository (accessed August 2021), as reported by Huang et al. [31]. This portal, however, has since been retired, with the data now accessible in the Proteomic Data Commons repository with the identifier PDC000221 (https://pdc.cancer.gov/pdc/ study/PDC000221; accessed on 20 January 2022). Data-dependent acquisition (DDA) was used for the generation of this data.
Huang et al. [31] collected 109 treatment-naive primary tumours and matched blood samples from tumours of samples from mainly the oral cavity and the larynx, with few samples from the lip, hypopharynx, and otopharnyx. Sixty-six tumours had matched normal adjacent tissues (NATs). Specimen inclusion was based on the maximal percent in the pathology criteria and best weight, with clinical details of all samples provided in Huang et al. [31]: Table S1. One sample was excluded as it was HPV-positive. Information regarding erosive/productive phenotypes was not available. These samples were tandem mass tag (TMT)-labelled (11-plex) with a reference standard included in the first channel of each TMT set, produced by pooling prepared peptide solutions from 87 HNSCC and 50 healthy, normal, adjacent tissue (NAT) samples [31].
We have included only HPV-negative samples from the oral cavity for the current analysis, comprising 49 total patient samples, of which 23 had matched normal adjacent tissue (NAT) and 26 had tumour tissue only. While tumour stage information was available, analysis was performed on all cancer stages together, compared to non-cancer conditions, due to large differences in sample numbers between the tumour stage groups (Stage I: 4, Stage II: 12, Stage III: 10, and Stage IV: 23).
Viral and fungal reference proteomes were downloaded from UniProt (https://www. uniprot.org/proteomes; release number 2021_03, accessed on 2 June 2021) in FASTA format. Bacterial reference sequences were downloaded from the expanded Human Oral Microbiome Database (eHOMD) (version 9.1.4; updated 9 September 2020) [87,88]. Details of each can be seen in Table 2.
For fungal sequences, due to the large search space, CD-HIT software was implemented to perform clustering at 90% sequence identity [89]. An approximate 1.32-fold reduction in search space was achieved, with 8,370,376 initial fungal protein entries reduced to 6,326,765 protein clusters. An additional human reference proteome was downloaded from UniProt (Release 2021_04, 29 September 2021) and appended to each microbial database to improve false discovery rate (FDR) performance.

Trans-Proteomic Pipeline Analysis
A modified version of our recently developed generic protocol was implemented [90]. TPP (ver. 6.0.0) was used for primary analysis of the publicly available MS data (http: //tools.proteomecenter.org/TPP.php; accessed on 20 January 2022) [91]. TPP is a freely available platform for the complete analysis of MS data, including software for file conversion, database searching, peptide validation, and protein inference. Proprietary raw files were converted in TPP's MSConvert to mzML before database searching in TPP's Comet. Preset mass modifications of 229.162932 to both the peptide N-termini and lysine residues with additional clearing of the 125.5-131.5 m/z range were implemented to account for TMT-labelling. Defaults were also used for remaining Comet parameters, with an optimised peptide mass tolerance of 5 ppm implemented. The corresponding pepXML output files were subsequently analysed in TPP's PeptideProphet for peptide validation. Accurate mass binning implemented using ppm, a minimum peptide length of 9 amino acids, and known decoy hits were used to pin down the negative distribution (as detailed in [90]). TPP's Libra software was also utilised using the default values for TMT-11 channel labelling, with intensity values normalised against the reference standard. The PeptideProphet outputs were then analysed in ProteinProphet for protein inference. Each TMT reaction (comprising 24 fractions) searched against a particular database was analysed separately, using default settings and a similar implementation of Libra software for TMT 11-channel labelling.

Secondary Analysis and Visualisation
Data exported from ProteinProphet were processed using R (version 4.1.2) in RStudio (version 2021.09.1, build 372) software. Protein assignments with low ProteinProphet probability scores were filtered to maintain a 1% protein level FDR, followed by filtering of human proteins, and then proteins that could not be uniquely identified to a single species. In addition, only proteins that were inferred by at least 2 unique peptides (minimum length of 9 amino acids) and observed in ≥2 matched patient samples were retained.
Filtered high-probability protein assignments were used as best indicators of species inference. Assignment to OSCC, NAT, or both conditions was based on log 2 fold-change cut-offs (log 2 OSCC intensity N AT intensity ). Species inferred by proteins with an intensity log 2 foldchange ≤−1 were assigned to the NAT condition, whilst conversely, species inferred by proteins with intensity log 2 fold-change ≥1 were assigned to the OSCC condition. Species inferred by proteins between these cut-offs were allocated to a "core" condition present in both OSCC and NAT samples. For species inferred by more than one protein, the mean log 2 fold-change of all proteins was used to assign the species to a particular condition. Inferred species were matched against the NCBI taxonomy database (https: //ftp.ncbi.nlm.nih.gov/pub/taxonomy/new_taxdump; accessed on 1 February 2022), and taxonomic information was visualised as sunburst plots using the 'plotly' package [92].
The 'gplots' package was used to perform unsupervised hierarchical clustering of proteins that were observed in ≥50% of patient samples [93]. Complete linkage clustering was used with Euclidean distance, with an additional log 10 transformation of the protein fold-change data, and these data were visualised as heatmaps.
Statistical pairwise t-testing was performed on all significantly identified proteins to determine differentially abundant proteins between NAT and OSCC based on the foldchange of Libra values (corresponding to the relative intensity of the TMT-labels). Adjusted p-values were calculated using Benjamini-Hochberg correction, with an adjusted p-value cut-off of p < 0.05 for significance. Results of this statistical analysis were visualised as volcano plots using the 'ggplot2 package [94].

Conclusions
The current study examined the oral microbiome of a public oral cancer dataset using a metaproteomic bioinformatic approach. To the best our knowledge, this is the first such report with low numbers of viral and, surprisingly, bacterial species identified, complemented by a high fungal diversity. This was likely due to biases within the dataset, including the use of surface disinfectants. Though poorly characterised, fungal proteins differentially abundant in OSCC implicated several species capable of causing ulceration and soft tissue damage, including V. gallopava, S. racemosum, and D. cristalligena. Hierarchical clustering of the fungal protein profile also resulted in the best separation between the NAT and OSCC conditions, suggesting that the understudied oral mycobiome may have diagnostic biomarker potential for OSCC. Future experiments withholding the use of surface disinfectants would be needed to further validate these findings. A multi-omics approach combining metagenomic, metatranscriptomic, and metaproteomic experiments on the same set of patient samples-balancing all cancer stages and, specifically, from erosive and productive phenotypes-would provide biomarkers with increased confidence. Additionally, potential further work comparing the oral cancer microbiome to other epithelial cancers would allow for the identification of common microbial species between these that may also provide insight into disease aetiology. Data Availability Statement: The R code used for the secondary analysis of the MS data can be found at GitHub through the following link: https://github.com/hehestevenhe/MDPI_OSCC_ microbiome/tree/main/RFiles (accessed on 2 June 2021).